Other Specialized Datasets

Beyond the primary domains of vision, language, and audio, xTorch also provides support for datasets from more specialized fields. This allows researchers and developers to work on a diverse range of tasks using the same consistent data loading interface.

This section covers datasets for:

  • Biomedical Data Analysis
  • Recommendation Systems
  • Reinforcement Learning

General Usage

The usage pattern for these specialized datasets is similar to others: you instantiate the dataset class and pass it to a DataLoader. However, the structure of the data and the required preprocessing can be highly specific to the domain.

For example, reinforcement learning datasets might represent an entire environment, while recommendation system datasets often consist of user-item interaction pairs. Always refer to the specific dataset's documentation or header file for details on the data format.

#include <xtorch/xtorch.h>
#include <iostream>
 
int main() {
    // Example: Loading the MovieLens dataset for a recommendation task
    auto dataset = xt::datasets::MovieLens(
        "./data",
        /*download=*/true
    );
 
    std::cout << "MovieLens dataset loaded." << std::endl;
    std::cout << "Number of ratings: " << *dataset.size() << std::endl;
 
    // The data loader will provide batches of user-item-rating triplets
    xt::dataloaders::ExtendedDataLoader data_loader(dataset, 256, true);
 
    for (auto& batch : data_loader) {
        // Data structure depends on the dataset, check the implementation.
        // For MovieLens, it could be a tensor of user IDs, item IDs, and ratings.
        // auto user_ids = batch.first;
        // auto item_ids = batch.second;
        // auto ratings = ...
    }
}

Available Datasets by Domain

Biomedical Data

These datasets are used for tasks like disease classification and genomic analysis.

Dataset Class Description Header File
ADNI The Alzheimer's Disease Neuroimaging Initiative dataset, used for classifying stages of Alzheimer's from medical imaging and clinical data. biomedical_data/alzheimers_classification/adni.h
TCGA The Cancer Genome Atlas (TCGA) dataset, containing genomic and clinical data for cancer research. biomedical_data/cancer_genomics_classification/tcga.h

Recommendation Systems

These datasets contain user-item interaction data (e.g., ratings, reviews) and are used to train recommender models.

Dataset Class Description Header File
MovieLens A classic dataset family containing movie ratings from users. Different versions (e.g., 100K, 1M, 20M) are available. recommendation_systems/recommendation/movie_lens.h
AmazonProductReviews A large dataset of product reviews from Amazon, useful for training recommendation and sentiment analysis models. recommendation_systems/recommendation/amazon_product_reviews.h

Reinforcement Learning

These are not traditional datasets but rather environments or collections of recorded experiences used to train reinforcement learning agents.

Dataset Class Description Header File
Atari2600ALE Provides an interface to the Arcade Learning Environment (ALE), allowing agents to be trained on a wide variety of Atari 2600 games. reinforcement_learning/reinforcement_learning/atari_2600_ale.h
MuJoCoGym Provides an interface to continuous control environments from OpenAI Gym powered by the MuJoCo physics engine (e.g., Hopper, Walker, Humanoid). reinforcement_learning/continuous_control/mu_jo_co_gym.h