TextTensor Attributes
Each TextTensor
has 6 main attributes, two per each representation: texts (content
and ttype
), embeddings (embedding
and embedding_model
) and tokens (tokens
and tokenizer
).
-
The
content
attribute is a numpy array with the textual entries. Thettype
is the "text type" of these entries, which is any subclass of theText
class. -
The
embedding
attribute holds atorch.tensor
representation of the text entries, computed using the model specified in theembedding_model
(name of OpenAI model or local Module). -
tokens
are thetorch.Tensor
tokenized text entries fromcontent
. While embeddings by default are computed using the "text-embedding-3-small" mode, to tokenize content thetokenizer
attribute has to be set to a tokenizer from the Transformers library.
Setting attributes
The content
is always set, as this is what we initialize a TextTensor
on.
The embedding
and tokens
are not calculated upon initialization to save costs. They are automatically computed directly before an operation that requires them or can be invoked manually with .embed()
and .tokenize()
respectively.
TextTensor subclasses and Text attributes
By default, a TextTensor
is a tensor with entries of type Text
. We can get various benefits from using text tensor subclasses that set ttype
to a subclass of Text
, for example:
- A
Text
subclass can add a custom__str__
method, e.g.langtorch.Markdown
, which formats entries to a string differently (which is done e.g. when passint to an LLM ) - Set the
allowed_keys
attribute to require certain keys, e.g.langtorch.Chat
which guarantees a correct chat history representation with only system, user or assistant keys