`class` langtorch.Text

The Text class in LangTorch represents structured textual data, allowing for easy text modification.

Text represents strings as a sequence of named text segments $(\text{label, text})$ . Unlike key-value pair dictionaries, in such a sequence the same key may be used multiple times, serving as a name or tag for a particular segment of the text. This structure is similar to markup languages like XML and indeed can be parsed from and to most markup languages.

Content Structure

The whole Text structure can be understood as a YAML file like in examples below. The list of segments is held in the content attribute and entries can be a text or a (key, value) pair:

content:
  - "Hello, "
  - name: "Alice"
  - "!"

content:
  - user: "Hi"
  - assistant: "Hello"
  - user: "How are you?"

As in XML, the labels can also be nested, i.e. the text of a single section labeled "user" can itself be structured into labelled segment (with nested keys):

content:
  - user:
      - nested_label1: "text 1."
      - nested_label2: "text 2."

As this structure is sometimes hard e.g. to iterate over, to access the content, the .items() method is much more convenient (along with .values() and .keys()). The output of .items() is always a list of key-value tuples. If a segment has no key (like in the first example) items represents it as ("", "Hello").

Initialization

Text instances can be initialized on various inputs, including:

Regular strings
Tuples representing key-value pairs
Dictionaries of ordered keys and values
Other Text instances

Text accepts a sequence of arguments, which can include any mix of these types:

text = Text("Hello, ", ("name", "Alice"), "!")
print(text)
# Output: Hello, Alice!

Attributes

content: Holds "named strings" or key-value pairs, enabling concatenation, formatting, and structured text manipulations.
language: The same content can represent different text strings when formatted to a string; language indicates what language the Text will be written in when formatted to a string. Defaults to "str".
allowed_keys: Restricts to a list of permissible keys within a Text instance, violations raise ValueError.

Operations

The Text class supports various operations for manipulating and transforming the textual data:

Concatenation: Text instances can be concatenated with other Text instances or strings using the + operator.
Formatting: The * operation between two Text objects allows for text formatting, where the content of the right (second) text can replace matching segments of the left text. If a key of a segment doesn't match any value of segments from the left text that segment is appended (like with +). This enables dynamic substitution of values based on keys, prompt injections and more and is explained in a separate page here.
Indexing: Text supports pandas style indexing with iloc and loc properties providing index-based and key-based indexing, respectively. They allow accessing and modifying specific parts of the text using integer indices or string keys (may be buggy in early releases).
Splitting: The split() method allows splitting the text into smaller segments based on a specified separator.
Iteration: Text instances are iterable, allowing iteration over the individual segments.

Method	Description
`__init__(substrings, parse=True, language='str', *named_substrings)`	Initializes a new `Text` instance. Allows for various input formats, including parsable strings, string sequences, key-value pairs, and dictionaries. The `parse` parameter determines whether to parse the input content, and the `language` parameter specifies the language for parsing and formatting.
`items()`	Returns a list of key-value pairs representing the `Text` instance's content.
`keys()`	Returns a list of keys from the `Text` instance's content.
`values()`	Returns a list of values from the `Text` instance's content.
`set_key(key, inplace=False)` inplace: `set_key_(key)`	Sets the top-level keys for the `Text` entries, removing previous keys. If `inplace` is `True`, it modifies the instance in-place; otherwise, it returns a new `Text` instance.
`add_key(key, inplace=False)` inplace: `add_key_(key)`	Adds a new top-level key for the `Text` entries, preserving the previous structure. If `inplace` is `True`, it modifies the instance in-place; otherwise, it returns a new `Text` instance.
`to_tensor()`	Converts the `Text` instance to a `TextTensor`. If you want to perform some LLM operation on each "substring" of `Text` (each element in `content`), you can use `.to_tensor()` to create a 1-d tensor with separate entries for each element (normally, creating a `TextTensor` on a `Text` assumes the `Text` to be one element).
`split(sep=' ', mode='auto')`	Splits the `Text` instance into smaller segments based on the specified separator (`sep`). The `mode` parameter determines the splitting behavior, but currently only supports 'auto' mode.
`apply(func, args, to='values', *kwargs)`	Applies a specified function to the keys, values, or both of the `Text` instance. The `to` parameter determines the target of the function application.
`from_messages(*messages)`	Creates a `Text` instance from a list of dictionaries representing chat messages. Each dictionary should have keys 'role' and 'content'.
`from_pandoc_json(ast_json)`	Creates a `Text` instance from a Pandoc AST JSON string.
`upper()`, `lower()`	Applies the corresponding string method to the `Text` instance using `method_apply()`.
`inv()`	Returns the inverse of the `Text` instance, swapping keys and values.
`load(path, language='str')`, or `from_file`	Loads a `Text` instance from a file specified by the `path` parameter. The `language` parameter determines the input format.
`save(path)`, or `to_file`	Saves the `Text` instance to a file specified by the `path` parameter.

Examples

Initializing a Text with parsing

text = Text("Hello, {name}!")
print(text.items())
# Output: [('','Hello'), ('', ', '),('', 'name'), ('', '!')]

Note that "name" is a value not a key, it would be a key in the completion dictionary

Concatenating Text Instances

greeting = Text("Hello")
name_and_label = Text(("name", "World"))
combined = greeting + ", " + name_and_label
print(combined)
# Output: "Hello, World"

Remember to always add a trailing or prefix space when concatenating Texts that are supposed to be separate words / sentences. This sometimes also applies to multiplication, as in the example below.

Formatting Text Instances

template = Text("Dear {title} {last_name},")
formatted = template * {"title": "Mr.", "last_name": "Doe", "unmatched_key":"value"}
print(formatted)
# Output: "Dear Mr. Doe,value"

Remember that entries of the right text with unmatched keys will be appended to the result

Accessing content

The pairs that make up a text object can be accessed with its .items() method, and we can edit these entries using indexing similar to pandas, which uses .iloc and .loc:

text = Text({"key1":"Value1", "key2":"Value2"})
print(text.keys())
# Output: ['key1', 'key2']
print(text.items())
# Output: [('key1', 'Value1'), ('key2', 'Value2')]
print(text.iloc[0])
# Output: ('key1', 'Value1')
print(text.loc['key1'])
# Output: 'Value1'

The content can have nested entries, e.g. when representing a chat history. To access them or format during multiplication use a dot between the keys, e.g. [('key1',('key2', 'value'))] can be accessed with loc['key1.key2']

Language Parsing

The Text class supports parsing (constructing the content structure from just a string) from various markup languages, such as HTML, Markdown, and LaTeX and save them back to text files (potentially in a different language). By default, Text tries to parse an f-string syntax. Both parsing from and casting to strings is explained 'here'.

Notes

If you want to perform some LLM operation on each "segment" of Text separately, you can use .to_tensor() to create a 1-d tensor with separate entries for each element (normally, creating a TextTensor on a Text assumes the Text to be one element).
Be careful with strings with "{" and "}" symbols. If parse is not set to False LangTorch will attempt to parse such strings into segments as if it were an f-string.

class langtorch.Text