class
langtorch.Text
The Text
class in LangTorch represents structured textual data, allowing for easy text modification.
Text
represents strings as a sequence of named text segments . Unlike key-value pair dictionaries, in such a sequence the same key may be used multiple times, serving as a name or tag for a particular segment of the text. This structure is similar to markup languages like XML and indeed can be parsed from and to most markup languages.
Content Structure
The whole Text structure can be understood as a YAML file like in examples below. The list of segments is held in the content
attribute and entries can be a text or a (key, value) pair:
content:
- "Hello, "
- name: "Alice"
- "!"
content:
- user: "Hi"
- assistant: "Hello"
- user: "How are you?"
As in XML, the labels can also be nested, i.e. the text of a single section labeled "user" can itself be structured into labelled segment (with nested keys):
content:
- user:
- nested_label1: "text 1."
- nested_label2: "text 2."
As this structure is sometimes hard e.g. to iterate over, to access the content, the .items()
method is much more convenient (along with .values()
and .keys()
). The output of .items()
is always a list of key-value tuples. If a segment has no key (like in the first example) items
represents it as ("", "Hello")
.
Initialization
Text
instances can be initialized on various inputs, including:
- Regular strings
- Tuples representing key-value pairs
- Dictionaries of ordered keys and values
- Other
Text
instances
Text
accepts a sequence of arguments, which can include any mix of these types:
text = Text("Hello, ", ("name", "Alice"), "!")
print(text)
# Output: Hello, Alice!
Attributes
content
: Holds "named strings" or key-value pairs, enabling concatenation, formatting, and structured text manipulations.language
: The same content can represent different text strings when formatted to a string;language
indicates what language the Text will be written in when formatted to a string. Defaults to"str"
.allowed_keys
: Restricts to a list of permissible keys within aText
instance, violations raiseValueError
.
Operations
The Text
class supports various operations for manipulating and transforming the textual data:
- Concatenation:
Text
instances can be concatenated with otherText
instances or strings using the+
operator. - Formatting: The
*
operation between twoText
objects allows for text formatting, where the content of the right (second) text can replace matching segments of the left text. If a key of a segment doesn't match any value of segments from the left text that segment is appended (like with +). This enables dynamic substitution of values based on keys, prompt injections and more and is explained in a separate page here. - Indexing:
Text
supports pandas style indexing withiloc
andloc
properties providing index-based and key-based indexing, respectively. They allow accessing and modifying specific parts of the text using integer indices or string keys (may be buggy in early releases). - Splitting: The
split()
method allows splitting the text into smaller segments based on a specified separator. - Iteration:
Text
instances are iterable, allowing iteration over the individual segments.
Method | Description |
---|---|
__init__(*substrings, parse=True, language='str', **named_substrings) |
Initializes a new Text instance. Allows for various input formats, including parsable strings, string sequences, key-value pairs, and dictionaries. The parse parameter determines whether to parse the input content, and the language parameter specifies the language for parsing and formatting. |
items() |
Returns a list of key-value pairs representing the Text instance's content. |
keys() |
Returns a list of keys from the Text instance's content. |
values() |
Returns a list of values from the Text instance's content. |
set_key(key, inplace=False) inplace: set_key_(key) |
Sets the top-level keys for the Text entries, removing previous keys. If inplace is True , it modifies the instance in-place; otherwise, it returns a new Text instance. |
add_key(key, inplace=False) inplace: add_key_(key) |
Adds a new top-level key for the Text entries, preserving the previous structure. If inplace is True , it modifies the instance in-place; otherwise, it returns a new Text instance. |
to_tensor() |
Converts the Text instance to a TextTensor . If you want to perform some LLM operation on each "substring" of Text (each element in content ), you can use .to_tensor() to create a 1-d tensor with separate entries for each element (normally, creating a TextTensor on a Text assumes the Text to be one element). |
split(sep=' ', mode='auto') |
Splits the Text instance into smaller segments based on the specified separator (sep ). The mode parameter determines the splitting behavior, but currently only supports 'auto' mode. |
apply(func, *args, to='values', **kwargs) |
Applies a specified function to the keys, values, or both of the Text instance. The to parameter determines the target of the function application. |
from_messages(*messages) |
Creates a Text instance from a list of dictionaries representing chat messages. Each dictionary should have keys 'role' and 'content'. |
from_pandoc_json(ast_json) |
Creates a Text instance from a Pandoc AST JSON string. |
upper() , lower() |
Applies the corresponding string method to the Text instance using method_apply() . |
inv() |
Returns the inverse of the Text instance, swapping keys and values. |
load(path, language='str') , or from_file |
Loads a Text instance from a file specified by the path parameter. The language parameter determines the input format. |
save(path) ,or to_file |
Saves the Text instance to a file specified by the path parameter. |
Examples
Initializing a Text with parsing
text = Text("Hello, {name}!")
print(text.items())
# Output: [('','Hello'), ('', ', '),('', 'name'), ('', '!')]
Note that "name" is a value not a key, it would be a key in the completion dictionary
Concatenating Text Instances
greeting = Text("Hello")
name_and_label = Text(("name", "World"))
combined = greeting + ", " + name_and_label
print(combined)
# Output: "Hello, World"
Remember to always add a trailing or prefix space when concatenating Texts that are supposed to be separate words / sentences. This sometimes also applies to multiplication, as in the example below.
Formatting Text Instances
template = Text("Dear {title} {last_name},")
formatted = template * {"title": "Mr.", "last_name": "Doe", "unmatched_key":"value"}
print(formatted)
# Output: "Dear Mr. Doe,value"
Remember that entries of the right text with unmatched keys will be appended to the result
Accessing content
The pairs that make up a text object can be accessed with its .items()
method, and we can edit these entries using indexing similar to pandas, which uses .iloc
and .loc
:
text = Text({"key1":"Value1", "key2":"Value2"})
print(text.keys())
# Output: ['key1', 'key2']
print(text.items())
# Output: [('key1', 'Value1'), ('key2', 'Value2')]
print(text.iloc[0])
# Output: ('key1', 'Value1')
print(text.loc['key1'])
# Output: 'Value1'
The content can have nested entries, e.g. when representing a chat history. To access them or format during multiplication use a dot between the keys, e.g. [('key1',('key2', 'value'))]
can be accessed with loc['key1.key2']
Language Parsing
The Text
class supports parsing (constructing the content structure from just a string) from various markup languages, such as HTML, Markdown, and LaTeX and save them back to text files (potentially in a different language). By default, Text
tries to parse an f-string syntax. Both parsing from and casting to strings is explained 'here'.
Notes
- If you want to perform some LLM operation on each "segment" of
Text
separately, you can use.to_tensor()
to create a 1-d tensor with separate entries for each element (normally, creating aTextTensor
on aText
assumes theText
to be one element). - Be careful with strings with "{" and "}" symbols. If
parse
is not set toFalse
LangTorch will attempt to parse such strings into segments as if it were an f-string.