Skip to content

Text Parsing

A key feature of Text objects is their ability to be created from and turned back into strings. The former process is called parsing, in which string inputs are automatically structured into Text objects depending on what (markup) language the string is written in. The latter is called formatting, in which Text objects turn themselves back into strings e.g. when printed or saved to a file.

The first section lists some convenient parsing rules. The second explains different ways a Text can be formatted, which is used to introduce subclasses like Chat.

Tldr

For most applications it is sufficient to be aware, that curly brackets {} are special characters and will be automatically parsed. To disable automatic parsing, pass parse=False when creating a Text or TextTensor.

Parsing

Default f-string Parsing

By default, when a string is passed to create a Text or TextTensor, it is parsed f-string, extracting information to (key, value) pairs. The main goal of this default language is to simulate f-string formatting when creating Text from strings containing curly braces. This almost always involves splitting the text into segments (with no keys). i.e.

text = Text("{var_name} or unnamed space {} to fill")

Parses into these (key, value) pairs:

>> text.items()
[("","{var_name}"), (""," or unnamed space "), ("",""), (""," to fill")]
Pattern Parsed Result Description
"{value}"
"value"
("", "value") Value with empty key
"{value:key}"
"value{:key}"
("key", "value") Key-value pair. The second pattern adds a key to the entire string segment before the {:key} bracket
"{``:key}" ("key", "") A key with an empty value
"{}" ("", "") Empty key-value pair, used for positional string formatting
"{0}", "{1}", ... ("", "0"), ... Used for positional string formatting, but specifying the order

To disable automatic parsing, pass parse=False when creating a Text or TextTensor.

Tip

You can add a backslash to escape-out special characters like curly brackets and not use these in parsing or use a pair of backticks (the content within won't be parsed)

Text Subclasses, Chat and String Formatting

Subclassing Text is useful for two main purposes: restricting the type of content entries the text can hold and adding custom rules of how to format the content to a string. By default, the content is formatted by joining the value entries. A Text indicates with the languageattribute how it is formatted into a string ("str" by default, "md" when created from markdown and so on). The most important of these useful subclasses is Chat.

Representing Chat histories

As LangTorch aims to represent a vast array of objects with the same Text class, chat histories can be created as Text objects with a specific structure:

chat = Text(  
    ("system", "optional system message"),  
    ("user", "Hello!"),  
    ("assistant", "Hi! How can I help you today?")
) 
# We can add a new message with:
chat += ("user", "whatsup")

For a Text to be recognized as a chat history, e.g. by Activation, it has to only have keys that are either "system", "user" or "assistant". After transformations this can be harder to keep track of, so we can make it easier to guarantee the content represent chat messages with a simple subclass of Text. We can set the allowed_keys class attribute to create a Chat subclass.

Below is the full definition of Chat from langtorch. Apart from allowed_keys it only adds a convenient string representation of chat history to lines of the form "Role: message".

class Chat(Text):  
    language = 'chat'  
    allowed_keys = ["system", "user", "assistant"]  

    def __str__(self):  
        return "\n".join([f"{k}: {Text(v)}" for k, v in self.items()])  
Usage:
>> Chat({"custom role": "message"})
# ValueError: Invalid key found in ['custom role']. For class Chat only ['system', 'user', 'assistant'] keys are allowed.

For local models a good use for replacing __str__ formatting is to define the formatter to apply the model's desired chat template. For example, to query models using the ChatML format, use:

class ChatML(Chat):  
    language = 'chatml'
    allowed_keys = ["user", "assistant", "system"]  

    def __str__(self):  
        formatted = "\n".join(f"<|im_start|>{k}\n{Text(v)}<|im_end|>" for k, v in self.items())  
        return formatted if self.keys()[-1] == "assistant" else formatted + "\n<|im_start|>assistant\n"

print(ChatML(("user" ,"Hi")))
# Outputs:  <|im_start|>user
            Hi<|im_end|>
            <|im_start|>assistant

Chat Templates and Accessing Entries

All Text subclasses have loc and iloc accessors you can use on chat classes to get specific parts:

user_messages = chat.loc["user"]
print(user_messages.values())

last_user_message = user_messages.iloc[-1]
print(last_user_message)
["Hello",
 "I want to to fly from New Yorkto London on Jne 15th",
 "Can you book a flight for me?"]

Can you book a flight for me?

A lot of packages use dictionaries with "role" and "content" keys to represent messages. You can quickly create either of the above classes from a list of such dictionary messages with the class method they inherit from Text called from_messages.

To create chat prompt templates you can use parsing, described above, in combination with any of the accepted Text input formats (e.g. key-value tuples or dictionaries):

chat_template = ChatML(  
    ("user", "Hello!"),  
    ("assistant", "Hi! How can I help you today?"),  
    ("user", "Tell me about {thing}.")  
)  
# Use iloc to make sure you format the right message  
chat_template *=  {"thing":"the weather"}  
print(chat_template.items())  

# In this case there are no other "thing" values so we could've also used  
chat_template *=  {"thing":"the weather"}

If you are not creating a template, just a message history you can load the chat structure using either of the less common parsing patterns {value:key} or value{:key}:

chat = Chat(
    "Hello!{:user}Hi there, how can I help you today?{:assistant}"
    )
# Equivalently
chat = Chat(
    "{Hello!:user}", "{Hi there, how can I help you today?:assistant}"
    )

Used in Subclasses of TextTensor

To make the subclasses work with the rest of the package, we need to make them elements of a TextTensor. The ttype attribute of TextTensors indicates which Text class to use for entries, so a short implementation would look like this:

class ChatTensor(TextTensor):  
    ttype = Chat

The actual ChatTensor you can import from langtorch includes some additional processing to allow for more input formats and error correction.