xTorch provides a rich collection of built-in dataset handlers for a wide range of audio and speech processing tasks, from recognition and classification to synthesis and source separation.
All audio datasets are located under the xt::datasets namespace and can be found in the <xtorch/datasets/audio_processing/> header directory.
General Usage
The workflow for using an audio dataset is similar to that of other domains. You typically instantiate the dataset class with the path to your data and an optional pipeline of audio-specific transformations.
#include <xtorch/xtorch.h>int main() { // 1. Define audio-specific transformations (optional) // For example, creating a Mel Spectrogram from the raw audio waveform. auto transforms = std::make_unique<xt::transforms::Compose>( std::make_shared<xt::transforms::signal::MelSpectrogram>() ); // 2. Instantiate a dataset for the Speech Commands task // The dataset will handle downloading and loading the data. auto dataset = xt::datasets::SpeechCommands( "./data", xt::datasets::DataMode::TRAIN, /*download=*/true, std::move(transforms) ); std::cout << "Speech Commands dataset size: " << *dataset.size() << std::endl; // 3. Pass the dataset to a DataLoader xt::dataloaders::ExtendedDataLoader data_loader(dataset, 32, true); // The data loader is now ready for use in a training loop for (auto& batch : data_loader) { auto spectrograms = batch.first; auto labels = batch.second; // ... training step ... }}
!!! info "Dataset Constructors"
Most dataset constructors follow a standard pattern:
DatasetName(const std::string& root, DataMode mode, bool download, TransformPtr transforms)
root: The directory where the data is stored or will be downloaded.
mode: DataMode::TRAIN, DataMode::TEST, or DataMode::VALIDATION.
download: If true, the dataset will be downloaded if not found in the root directory.
transforms: A unique_ptr to a transform pipeline to be applied to the data.
Available Datasets by Task
Audio Event Detection
Dataset Class
Description
Header File
AudioSet
A large-scale dataset of manually annotated audio events from YouTube videos.
audio_event_detection/audioset.h
Binary Speech Classification
Dataset Class
Description
Header File
YesNo
A small dataset of speech recordings saying "yes" or "no".
binary_speech_classification/yes_no.h
Emotion Recognition
Dataset Class
Description
Header File
IEMOCAP
Interactive Emotional Dyadic Motion Capture database for emotion analysis.
emotion_recognition/iemocap.h
Environmental Sound Classification
Dataset Class
Description
Header File
ESC
Dataset for Environmental Sound Classification (ESC-50 and ESC-10).
environmental_sound_classification/esc.h
UrbanSound
The UrbanSound8K dataset, containing urban sound recordings.
environmental_sound_classification/urban_sound.h
Intent Classification
Dataset Class
Description
Header File
Snips
The Snips Natural Language Understanding benchmark dataset.
intent_classification/snips.h
Music Genre Classification
Dataset Class
Description
Header File
GTZAN
A popular dataset for music genre recognition.
music_genre_classification/gtzan.h
Music Information Retrieval
Dataset Class
Description
Header File
MillionSongDataset
A freely-available collection of audio features and metadata for a million contemporary popular music tracks.