TextTensor Attributes

Each TextTensor has 6 main attributes, two per each representation: texts (content and ttype), embeddings (embedding and embedding_model) and tokens (tokens and tokenizer).

The content attribute is a numpy array with the textual entries. The ttype is the "text type" of these entries, which is any subclass of the Text class.
The embedding attribute holds a torch.tensor representation of the text entries, computed using the model specified in the embedding_model (name of OpenAI model or local Module).
tokens are the torch.Tensor tokenized text entries from content. While embeddings by default are computed using the "text-embedding-3-small" mode, to tokenize content the tokenizer attribute has to be set to a tokenizer from the Transformers library.

Setting attributes

The content is always set, as this is what we initialize a TextTensor on.

The embedding and tokens are not calculated upon initialization to save costs. They are automatically computed directly before an operation that requires them or can be invoked manually with .embed() and .tokenize() respectively.

TextTensor subclasses and Text attributes

By default, a TextTensor is a tensor with entries of type Text. We can get various benefits from using text tensor subclasses that set ttype to a subclass of Text, for example:

A Text subclass can add a custom __str__ method, e.g. langtorch.Markdown, which formats entries to a string differently (which is done e.g. when passint to an LLM )
Set the allowed_keys attribute to require certain keys, e.g. langtorch.Chat which guarantees a correct chat history representation with only system, user or assistant keys