Text Transforms
Text transforms are essential for preparing raw text data for use in neural networks. Unlike images, which are already numerical, text is a sequence of characters that must be converted into a structured, numerical format—typically a tensor of integer IDs.
The two primary steps in any NLP data pipeline are:
- Tokenization: The process of breaking a raw string of text into smaller pieces called "tokens". These can be words, subwords, or characters.
- Numericalization: The process of converting each token into a unique integer ID based on a pre-defined "vocabulary".
xTorch provides a suite of transforms to handle these steps, from powerful tokenizers to utilities for managing sequence length.
All text transforms are located under the xt::transforms::text namespace and can be found in the <xtorch/transforms/text/> header directory.
General Usage
Text transforms are used within a Compose pipeline, which is then passed to an NLP Dataset. A key difference from other modalities is the reliance on a vocabulary, a mapping from tokens to integers. This vocabulary is often built from the training corpus or loaded from a pre-trained model's files.
#include <xtorch/xtorch.h>
#include <iostream>
int main() {
// In a real application, you would load a vocabulary from a file.
// For example, for a BERT model, this would be a 'vocab.txt' file.
// std::string vocab_path = "path/to/bert/vocab.txt";
// 1. Define a pipeline of text transformations.
auto text_pipeline = std::make_unique<xt::transforms::Compose>(
// Tokenize the input string using the BERT WordPiece tokenizer
std::make_shared<xt::transforms::text::BertTokenizer>(/*vocab_path=*/vocab_path),
// Truncate sequences to a maximum length of 512 tokens
std::make_shared<xt::transforms::text::Truncate>(512),
// Add special tokens like [CLS] and [SEP]
std::make_shared<xt::transforms::text::AddToken>("[CLS]", /*at_beginning=*/true),
std::make_shared<xt::transforms::text::AddToken>("[SEP]", /*at_beginning=*/false)
// Note: Padding is often handled by the DataLoader's collate function,
// but can also be a transform.
);
// 2. Pass the pipeline to an NLP Dataset
auto dataset = xt::datasets::IMDB(
"./data",
xt::datasets::DataMode::TRAIN,
/*download=*/true,
std::move(text_pipeline)
);
// 3. The DataLoader will now yield batches of tokenized and numericalized text
xt::dataloaders::ExtendedDataLoader data_loader(dataset, 16);
// ...
}Available Transforms by Category
Tokenizers
These modules are responsible for the first step: converting a string into a sequence of tokens.
| Transform | Description | Header File |
|---|---|---|
BertTokenizer |
Implements the WordPiece tokenization algorithm used by BERT. It requires a vocab.txt file. |
bert_tokenizer.h |
SentencePieceTokenizer |
A tokenizer that uses the SentencePiece library, common in models like XLNet and T5. It requires a spm.model file. |
sentence_piece_tokenizer.h |
Vocabulary and Numericalization
These transforms handle the conversion between tokens and integer IDs.
| Transform | Description | Header File |
|---|---|---|
VocabTransform |
A transform that takes a vocabulary object and converts a sequence of string tokens into a sequence of integer IDs. | vocab_transform.h |
StrToIntTransform |
A lower-level transform for converting strings to integers, often used internally by VocabTransform. |
str_to_int_transform.h |
Sequence Utilities
These transforms are used to format the token sequences to meet the model's requirements.
| Transform | Description | Header File |
|---|---|---|
PadTransform |
Pads a sequence to a specified length with a given padding token ID. | pad_transform.h |
Truncate |
Truncates a sequence to a maximum specified length. | truncate.h |
AddToken |
Adds a special token (e.g., [CLS], [SEP], [EOS]) to the beginning or end of a sequence. |
add_token.h |
Data Augmentation
These transforms modify the input text to create new training samples, which can help improve model robustness.
| Transform | Description | Header File |
|---|---|---|
SynonymReplacement |
Randomly replaces words in a sentence with their synonyms. | synonym_replacement.h |
BackTranslation |
Augments text by translating it to another language and then translating it back to the original language. (Note: May require an external translation API). | back_translation.h |
TextStyleTransfer |
A transform for altering the style of the text. | text_style_transfer.h |
