Section Navigation

▼ Api
- Index
- Activations
- Dataloaders
- Dropouts
- Losses
- Normalizations
- Optimizations
- Regularizations
- Trainers
- Utils
- ▼ Datasets
- ▼ Models
- ▼ Transforms
▼ Comparisons
- Comparison
▼ Examples
- Index
- ▼ Audio
  - ▼ Audio classification
    Environmental sounds
    Music genre
  - ▼ Speech recognition
    E2e ctc
    Keyword spotting
- ▼ Computer vision
  - ▼ Image classification
    Finetuning resnet cifar10
    Lenet mnist
    Transfer learning custom
  - ▼ Image generation
    Cyclegan
    Dcgan
  - ▼ Object detection
    Faster rcnn
    Yolov3 coco
  - ▼ Semantic segmentation
    Deeplab v3
    Mask rcnn
- ▼ Data handling
  - ▼ Dataloaders
    Efficient loading
  - ▼ Datasets
    Builtin datasets
    Custom datasets
  - ▼ Transforms
    Image augmentation
- ▼ Deployment
  - ▼ Inference
    Cpp app
    Tensorrt
  - ▼ Serialization
    Export torchscript
    Save load
  - ▼ Web services
    Rest api
- ▼ Distributed
  - ▼ Data parallelism
    Multi gpu
  - ▼ Model parallelism
    Model splitting
  - ▼ Multi machine
    Setup
- ▼ Generative
  - ▼ Autoencoders
    Denoising ae
    Vae
  - ▼ Diffusion
    Ddpm
  - ▼ Gans
    Mnist gan
    Progressive gan
- ▼ Getting started
- ▼ Gnn
  - ▼ Graph level
    Diffpool
    Mpnn
  - ▼ Node level
    Gcn
    Graphsage
- ▼ Nlp
  - ▼ Language modeling
    Finetuning bert
    Training gpt
  - ▼ Seq2seq
    Machine translation
    Summarization
  - ▼ Text classification
    Sentiment rnn
    Transformer classification
- ▼ Optimization
  - ▼ Lr schedulers
    Cosine annealing
    Step decay
  - ▼ Optimizers
    Adamw
    Sgd momentum
  - ▼ Regularization
    Dropout
    Weight decay
- ▼ Performance
  - ▼ Memory
    Data loading
    Gradient checkpointing
  - ▼ Speed
    Mixed precision
    Profiling
- ▼ Rl
  - ▼ Policy based
    Ppo
    Reinforce
  - ▼ Value based
    Dqn atari
    Q learning
- ▼ Time series
  - ▼ Anomaly detection
    Autoencoders
  - ▼ Forecasting
    Lstm
    Multivariate
▼ Getting started
- Installation
- Quick start cnn
▼ User guide

Audio & Speech Datasets

xTorch provides a rich collection of built-in dataset handlers for a wide range of audio and speech processing tasks, from recognition and classification to synthesis and source separation.

All audio datasets are located under the xt::datasets namespace and can be found in the <xtorch/datasets/audio_processing/> header directory.

General Usage

The workflow for using an audio dataset is similar to that of other domains. You typically instantiate the dataset class with the path to your data and an optional pipeline of audio-specific transformations.

#include <xtorch/xtorch.h>
 
int main() {
    // 1. Define audio-specific transformations (optional)
    // For example, creating a Mel Spectrogram from the raw audio waveform.
    auto transforms = std::make_unique<xt::transforms::Compose>(
        std::make_shared<xt::transforms::signal::MelSpectrogram>()
    );
 
    // 2. Instantiate a dataset for the Speech Commands task
    // The dataset will handle downloading and loading the data.
    auto dataset = xt::datasets::SpeechCommands(
        "./data",
        xt::datasets::DataMode::TRAIN,
        /*download=*/true,
        std::move(transforms)
    );
 
    std::cout << "Speech Commands dataset size: " << *dataset.size() << std::endl;
 
    // 3. Pass the dataset to a DataLoader
    xt::dataloaders::ExtendedDataLoader data_loader(dataset, 32, true);
 
    // The data loader is now ready for use in a training loop
    for (auto& batch : data_loader) {
        auto spectrograms = batch.first;
        auto labels = batch.second;
        // ... training step ...
    }
}

!!! info "Dataset Constructors" Most dataset constructors follow a standard pattern: DatasetName(const std::string& root, DataMode mode, bool download, TransformPtr transforms)

root: The directory where the data is stored or will be downloaded.
mode: DataMode::TRAIN, DataMode::TEST, or DataMode::VALIDATION.
download: If true, the dataset will be downloaded if not found in the root directory.
transforms: A unique_ptr to a transform pipeline to be applied to the data.

Available Datasets by Task

Audio Event Detection

Dataset Class	Description	Header File
`AudioSet`	A large-scale dataset of manually annotated audio events from YouTube videos.	`audio_event_detection/audioset.h`

Binary Speech Classification

Dataset Class	Description	Header File
`YesNo`	A small dataset of speech recordings saying "yes" or "no".	`binary_speech_classification/yes_no.h`

Emotion Recognition

Dataset Class	Description	Header File
`IEMOCAP`	Interactive Emotional Dyadic Motion Capture database for emotion analysis.	`emotion_recognition/iemocap.h`

Environmental Sound Classification

Dataset Class	Description	Header File
`ESC`	Dataset for Environmental Sound Classification (ESC-50 and ESC-10).	`environmental_sound_classification/esc.h`
`UrbanSound`	The UrbanSound8K dataset, containing urban sound recordings.	`environmental_sound_classification/urban_sound.h`

Intent Classification

Dataset Class	Description	Header File
`Snips`	The Snips Natural Language Understanding benchmark dataset.	`intent_classification/snips.h`

Music Genre Classification

Dataset Class	Description	Header File
`GTZAN`	A popular dataset for music genre recognition.	`music_genre_classification/gtzan.h`

Music Information Retrieval

Dataset Class	Description	Header File
`MillionSongDataset`	A freely-available collection of audio features and metadata for a million contemporary popular music tracks.	`music_information_retrieval/million_song_dataset.h`

Music Source Separation

Dataset Class	Description	Header File
`MUSDBHQ`	A high-quality dataset for music source separation.	`music_source_separation/mus_db_hq.h`

Music Tagging

Dataset Class	Description	Header File
`MagnaTagATune`	A dataset of audio clips with associated tags.	`music_tagging/magna_tag_a_tune.h`

Sound Event Detection

Dataset Class	Description	Header File
`FSD50K`	An open dataset of human-labeled sound events.	`sound_event_detection/fsd50k.h`

Speaker Identification and Verification

Dataset Class	Description	Header File
`VoxCeleb`	An audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.	`speaker_identification_and_verification/vox_celeb.h`

Speech Command Recognition

Dataset Class	Description	Header File
`FluentSpeechCommands`	An audio dataset for spoken language understanding.	`speech_command_recognition/fluent_speech_commands.h`
`SpeechCommands`	A dataset of one-second audio clips of people saying thirty-five different words.	`speech_command_recognition/speech_commands.h`

Speech Recognition

Dataset Class	Description	Header File
`CommonVoice`	A large, multi-language dataset of transcribed speech.	`speech_recognition/common_voice.h`
`LibriSpeech`	A corpus of read English speech suitable for training and evaluating speech recognition systems.	`speech_recognition/librispeech.h`
`TEDLIUM`	Audio recordings of TED talks with transcriptions.	`speech_recognition/tedlium.h`
`TIMIT`	A corpus of phonemically and lexically transcribed speech of American English speakers.	`speech_recognition/timit.h`

Speech Separation

Dataset Class	Description	Header File
`LibriMix`	A dataset for source separation derived from LibriSpeech.	`speech_separation/libri_mix.h`

Speech Synthesis

Dataset Class	Description	Header File
`CMUArctic`	Speech synthesis databases from Carnegie Mellon University.	`speech_synthesis/cmu_arctic.h`
`VCTK092`	Speech data uttered by 110 English speakers with various accents.	`speech_synthesis/vctk_092.h`
`LJSpeech`	A public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books.	`speech_synthesis/lj_speech.h`
`LibriTTS`	A large-scale, multi-speaker English corpus designed for TTS research.	`speech_synthesis/libritts.h`