Text Parsing
A key feature of Text
objects is their ability to be created from and turned back into strings. The former process is called parsing, in which string inputs are automatically structured into Text
objects depending on what (markup) language the string is written in. The latter is called formatting, in which Text
objects turn themselves back into strings e.g. when printed or saved to a file.
The first section lists some convenient parsing rules. The second explains different ways a Text
can be formatted, which is used to introduce subclasses like Chat
.
Tldr
For most applications it is sufficient to be aware, that curly brackets {}
are special characters and will be automatically parsed.
To disable automatic parsing, pass parse=False
when creating a Text
or TextTensor
.
Parsing
Default f-string Parsing
By default, when a string is passed to create a Text
or TextTensor
, it is parsed f-string, extracting information to (key, value) pairs. The main goal of this default language is to simulate f-string formatting when creating Text
from strings containing curly braces. This almost always involves splitting the text into segments (with no keys). i.e.
text = Text("{var_name} or unnamed space {} to fill")
Parses into these (key, value) pairs:
>> text.items()
[("","{var_name}"), (""," or unnamed space "), ("",""), (""," to fill")]
Pattern | Parsed Result | Description |
---|---|---|
"{value}" "value" |
("", "value") |
Value with empty key |
"{value:key}" "value{:key}" |
("key", "value") |
Key-value pair. The second pattern adds a key to the entire string segment before the {:key} bracket |
"{``:key}" |
("key", "") |
A key with an empty value |
"{}" |
("", "") |
Empty key-value pair, used for positional string formatting |
"{0}" , "{1}" , ... |
("", "0") , ... |
Used for positional string formatting, but specifying the order |
To disable automatic parsing, pass parse=False
when creating a Text
or TextTensor
.
Tip
You can add a backslash to escape-out special characters like curly brackets and not use these in parsing or use a pair of backticks (the content within won't be parsed)
Text Subclasses, Chat and String Formatting
Subclassing Text is useful for two main purposes: restricting the type of content entries the text can hold and adding custom rules of how to format the content to a string. By default, the content is formatted by joining the value entries. A Text
indicates with the language
attribute how it is formatted into a string ("str"
by default, "md"
when created from markdown and so on). The most important of these useful subclasses is Chat
.
Representing Chat histories
As LangTorch aims to represent a vast array of objects with the same Text
class, chat histories can be created as Text
objects with a specific structure:
chat = Text(
("system", "optional system message"),
("user", "Hello!"),
("assistant", "Hi! How can I help you today?")
)
# We can add a new message with:
chat += ("user", "whatsup")
For a Text
to be recognized as a chat history, e.g. by Activation
, it has to only have keys that are either "system", "user" or "assistant". After transformations this can be harder to keep track of, so we can make it easier to guarantee the content represent chat messages with a simple subclass of Text
. We can set the allowed_keys
class attribute to create a Chat
subclass.
Below is the full definition of Chat
from langtorch. Apart from allowed_keys it only adds a convenient string representation of chat history to lines of the form "Role: message".
class Chat(Text):
language = 'chat'
allowed_keys = ["system", "user", "assistant"]
def __str__(self):
return "\n".join([f"{k}: {Text(v)}" for k, v in self.items()])
>> Chat({"custom role": "message"})
# ValueError: Invalid key found in ['custom role']. For class Chat only ['system', 'user', 'assistant'] keys are allowed.
For local models a good use for replacing __str__
formatting is to define the formatter to apply the model's desired chat template. For example, to query models using the ChatML format, use:
class ChatML(Chat):
language = 'chatml'
allowed_keys = ["user", "assistant", "system"]
def __str__(self):
formatted = "\n".join(f"<|im_start|>{k}\n{Text(v)}<|im_end|>" for k, v in self.items())
return formatted if self.keys()[-1] == "assistant" else formatted + "\n<|im_start|>assistant\n"
print(ChatML(("user" ,"Hi")))
# Outputs: <|im_start|>user
Hi<|im_end|>
<|im_start|>assistant
Chat Templates and Accessing Entries
All Text
subclasses have loc
and iloc
accessors you can use on chat classes to get specific parts:
user_messages = chat.loc["user"]
print(user_messages.values())
last_user_message = user_messages.iloc[-1]
print(last_user_message)
["Hello",
"I want to to fly from New Yorkto London on Jne 15th",
"Can you book a flight for me?"]
Can you book a flight for me?
A lot of packages use dictionaries with "role" and "content" keys to represent messages. You can quickly create either of the above classes from a list of such dictionary messages with the class method they inherit from Text
called from_messages
.
To create chat prompt templates you can use parsing, described above, in combination with any of the accepted Text
input formats (e.g. key-value tuples or dictionaries):
chat_template = ChatML(
("user", "Hello!"),
("assistant", "Hi! How can I help you today?"),
("user", "Tell me about {thing}.")
)
# Use iloc to make sure you format the right message
chat_template *= {"thing":"the weather"}
print(chat_template.items())
# In this case there are no other "thing" values so we could've also used
chat_template *= {"thing":"the weather"}
If you are not creating a template, just a message history you can load the chat structure using either of the less common parsing patterns {value:key}
or value{:key}
:
chat = Chat(
"Hello!{:user}Hi there, how can I help you today?{:assistant}"
)
# Equivalently
chat = Chat(
"{Hello!:user}", "{Hi there, how can I help you today?:assistant}"
)
Used in Subclasses of TextTensor
To make the subclasses work with the rest of the package, we need to make them elements of a TextTensor
. The ttype
attribute of TextTensors indicates which Text class to use for entries, so a short implementation would look like this:
class ChatTensor(TextTensor):
ttype = Chat
The actual ChatTensor
you can import from langtorch includes some additional processing to allow for more input formats and error correction.