Datasets
The Dataset classes are responsible for accessing and preprocessing individual samples of data from a source (like a directory of images or a text file). They are the foundation of the data loading pipeline in xTorch.
A Dataset object is passed to a DataLoader, which then handles the more complex logic of batching, shuffling, and parallel data fetching.
The xt::datasets::Dataset Base Class
All dataset classes in xTorch, both built-in and custom, inherit from a common base class: xt::datasets::Dataset. This class establishes a standard interface that the DataLoader knows how to interact with.
The two core methods of any dataset are:
get(size_t index): Returns the data sample at the given index. This is typically atorch::data::Example<>containing a data tensor and a target tensor.size(): Returns the total number of samples in the dataset as anoptional<size_t>.
Standard Usage
The typical workflow involves three steps:
- Define Transforms: Create a pipeline of data augmentation or preprocessing steps (optional).
- Instantiate a Dataset: Create an instance of a specific dataset class, providing the path to the data and the transform pipeline.
- Pass to a DataLoader: Pass the initialized dataset object to the
ExtendedDataLoader.
#include <xtorch/xtorch.h>
int main() {
// 1. Define a transform pipeline
auto compose = std::make_unique<xt::transforms::Compose>(
std::make_shared<xt::transforms::general::Normalize>(0.5, 0.5)
);
// 2. Instantiate a built-in dataset for CIFAR-10
auto cifar_dataset = xt::datasets::CIFAR10(
"./data",
xt::datasets::DataMode::TRAIN,
/*download=*/true,
std::move(compose)
);
// The dataset can report its size
std::cout << "CIFAR-10 training set size: " << *cifar_dataset.size() << std::endl;
// 3. Pass the dataset to a data loader
xt::dataloaders::ExtendedDataLoader data_loader(cifar_dataset, 64, true);
// The data loader can now be used for training
for (auto& batch : data_loader) {
// ...
}
}Built-in Datasets by Domain
xTorch provides an extensive collection of built-in dataset handlers for many popular public datasets, saving you the effort of writing boilerplate loading code. These are organized by machine learning domain.
Follow the links below for a detailed list of available datasets in each category.
-
Computer Vision: Includes datasets for image classification (
MNIST,CIFAR,ImageNet), object detection (COCO), and segmentation (Cityscapes). -
Natural Language Processing: Includes datasets for text classification (
IMDB,AG_NEWS), question answering (SQuAD), and machine translation (WMT). -
Audio Processing: Includes datasets for speech recognition (
LibriSpeech), sound event classification (UrbanSound8K), and music analysis. -
Time Series: Includes datasets for time series forecasting and classification.
-
Tabular Data: Includes a variety of classic small-scale datasets for classification and regression tasks.
-
Graph Data: Includes datasets for node and graph classification tasks (
Cora). -
General Purpose: Contains flexible dataset classes like
ImageFolderDatasetandCSVDatasetthat allow you to easily load your own custom data without writing a new dataset class from scratch. -
Other Domains: Includes datasets from other domains like biomedical data and recommendation systems.
