Skip to content

class langtorch.Text

The Text class in LangTorch represents structured textual data, allowing for easy text modification.

Text represents strings as a sequence of named text segments . Unlike key-value pair dictionaries, in such a sequence the same key may be used multiple times, serving as a name or tag for a particular segment of the text. This structure is similar to markup languages like XML and indeed can be parsed from and to most markup languages.

Content Structure

The whole Text structure can be understood as a YAML file like in examples below. The list of segments is held in the content attribute and entries can be a text or a (key, value) pair:

content:
  - "Hello, "
  - name: "Alice"
  - "!"
content:
  - user: "Hi"
  - assistant: "Hello"
  - user: "How are you?"

As in XML, the labels can also be nested, i.e. the text of a single section labeled "user" can itself be structured into labelled segment (with nested keys):

content:
  - user:
      - nested_label1: "text 1."
      - nested_label2: "text 2."

As this structure is sometimes hard e.g. to iterate over, to access the content, the .items() method is much more convenient (along with .values() and .keys()). The output of .items() is always a list of key-value tuples. If a segment has no key (like in the first example) items represents it as ("", "Hello").

Initialization

Text instances can be initialized on various inputs, including:

  • Regular strings
  • Tuples representing key-value pairs
  • Dictionaries of ordered keys and values
  • Other Text instances

Text accepts a sequence of arguments, which can include any mix of these types:

text = Text("Hello, ", ("name", "Alice"), "!")
print(text)
# Output: Hello, Alice!

Attributes

  • content: Holds "named strings" or key-value pairs, enabling concatenation, formatting, and structured text manipulations.
  • language: The same content can represent different text strings when formatted to a string; language indicates what language the Text will be written in when formatted to a string. Defaults to "str".
  • allowed_keys: Restricts to a list of permissible keys within a Text instance, violations raise ValueError.

Operations

The Text class supports various operations for manipulating and transforming the textual data:

  • Concatenation: Text instances can be concatenated with other Text instances or strings using the + operator.
  • Formatting: The * operation between two Text objects allows for text formatting, where the content of the right (second) text can replace matching segments of the left text. If a key of a segment doesn't match any value of segments from the left text that segment is appended (like with +). This enables dynamic substitution of values based on keys, prompt injections and more and is explained in a separate page here.
  • Indexing: Text supports pandas style indexing with iloc and loc properties providing index-based and key-based indexing, respectively. They allow accessing and modifying specific parts of the text using integer indices or string keys (may be buggy in early releases).
  • Splitting: The split() method allows splitting the text into smaller segments based on a specified separator.
  • Iteration: Text instances are iterable, allowing iteration over the individual segments.
Method Description
__init__(*substrings, parse=True, language='str', **named_substrings) Initializes a new Text instance. Allows for various input formats, including parsable strings, string sequences, key-value pairs, and dictionaries. The parse parameter determines whether to parse the input content, and the language parameter specifies the language for parsing and formatting.
items() Returns a list of key-value pairs representing the Text instance's content.
keys() Returns a list of keys from the Text instance's content.
values() Returns a list of values from the Text instance's content.
set_key(key, inplace=False)
inplace: set_key_(key)
Sets the top-level keys for the Text entries, removing previous keys. If inplace is True, it modifies the instance in-place; otherwise, it returns a new Text instance.
add_key(key, inplace=False)
inplace: add_key_(key)
Adds a new top-level key for the Text entries, preserving the previous structure. If inplace is True, it modifies the instance in-place; otherwise, it returns a new Text instance.
to_tensor() Converts the Text instance to a TextTensor. If you want to perform some LLM operation on each "substring" of Text (each element in content), you can use .to_tensor() to create a 1-d tensor with separate entries for each element (normally, creating a TextTensor on a Text assumes the Text to be one element).
split(sep=' ', mode='auto') Splits the Text instance into smaller segments based on the specified separator (sep). The mode parameter determines the splitting behavior, but currently only supports 'auto' mode.
apply(func, *args, to='values', **kwargs) Applies a specified function to the keys, values, or both of the Text instance. The to parameter determines the target of the function application.
from_messages(*messages) Creates a Text instance from a list of dictionaries representing chat messages. Each dictionary should have keys 'role' and 'content'.
from_pandoc_json(ast_json) Creates a Text instance from a Pandoc AST JSON string.
upper(), lower() Applies the corresponding string method to the Text instance using method_apply().
inv() Returns the inverse of the Text instance, swapping keys and values.
load(path, language='str'), or from_file Loads a Text instance from a file specified by the path parameter. The language parameter determines the input format.
save(path),
or to_file
Saves the Text instance to a file specified by the path parameter.

Examples

Initializing a Text with parsing

text = Text("Hello, {name}!")
print(text.items())
# Output: [('','Hello'), ('', ', '),('', 'name'), ('', '!')]

Note that "name" is a value not a key, it would be a key in the completion dictionary

Concatenating Text Instances

greeting = Text("Hello")
name_and_label = Text(("name", "World"))
combined = greeting + ", " + name_and_label
print(combined)
# Output: "Hello, World"

Remember to always add a trailing or prefix space when concatenating Texts that are supposed to be separate words / sentences. This sometimes also applies to multiplication, as in the example below.

Formatting Text Instances

template = Text("Dear {title} {last_name},")
formatted = template * {"title": "Mr.", "last_name": "Doe", "unmatched_key":"value"}
print(formatted)
# Output: "Dear Mr. Doe,value"

Remember that entries of the right text with unmatched keys will be appended to the result

Accessing content

The pairs that make up a text object can be accessed with its .items() method, and we can edit these entries using indexing similar to pandas, which uses .iloc and .loc:

text = Text({"key1":"Value1", "key2":"Value2"})
print(text.keys())
# Output: ['key1', 'key2']
print(text.items())
# Output: [('key1', 'Value1'), ('key2', 'Value2')]
print(text.iloc[0])
# Output: ('key1', 'Value1')
print(text.loc['key1'])
# Output: 'Value1'

The content can have nested entries, e.g. when representing a chat history. To access them or format during multiplication use a dot between the keys, e.g. [('key1',('key2', 'value'))] can be accessed with loc['key1.key2']

Language Parsing

The Text class supports parsing (constructing the content structure from just a string) from various markup languages, such as HTML, Markdown, and LaTeX and save them back to text files (potentially in a different language). By default, Text tries to parse an f-string syntax. Both parsing from and casting to strings is explained 'here'.

Notes

  • If you want to perform some LLM operation on each "segment" of Text separately, you can use .to_tensor() to create a 1-d tensor with separate entries for each element (normally, creating a TextTensor on a Text assumes the Text to be one element).
  • Be careful with strings with "{" and "}" symbols. If parse is not set to False LangTorch will attempt to parse such strings into segments as if it were an f-string.