Signal (Audio) Transforms

Signal transforms are a critical part of any audio-based deep learning pipeline. They are used to convert raw audio waveforms into representations that are more suitable for neural networks, and to perform data augmentation to improve model robustness.

Common tasks include converting a time-domain waveform into a time-frequency representation (like a spectrogram) and applying augmentations like pitch shifting or adding background noise.

All signal transforms are located under the xt::transforms::signal namespace and can be found in the <xtorch/transforms/signal/> header directory.

General Usage

Audio transforms are designed to be chained together in a Compose pipeline and passed to an audio Dataset. The dataset will then apply this pipeline to each raw audio waveform it loads.

#include <xtorch/xtorch.h>
#include <iostream>
 
int main() {
    // 1. Define a pipeline of audio transformations
    // This pipeline converts a raw waveform to a Mel Spectrogram and then applies augmentation.
    auto training_transforms = std::make_unique<xt::transforms::Compose>(
        // Convert the waveform to a Mel Spectrogram
        std::make_shared<xt::transforms::signal::MelSpectrogram>(
            /*sample_rate=*/16000,
            /*n_fft=*/400,
            /*n_mels=*/128
        ),
        // Apply Frequency Masking for data augmentation
        std::make_shared<xt::transforms::signal::FrequencyMasking>(
            /*freq_mask_param=*/80
        ),
        // Apply Time Masking for data augmentation
        std::make_shared<xt::transforms::signal::TimeMasking>(
            /*time_mask_param=*/100
        )
    );
 
    // 2. Pass the pipeline to an audio Dataset
    // The SpeechCommands dataset will now apply these transforms to each audio clip.
    auto dataset = xt::datasets::SpeechCommands(
        "./data",
        xt::datasets::DataMode::TRAIN,
        /*download=*/true,
        std::move(training_transforms)
    );
 
    // The data loader will yield batches of augmented spectrograms
    xt::dataloaders::ExtendedDataLoader data_loader(dataset, 32);
    // ...
}

!!! info "Constructor Options" Audio transforms are often highly configurable with parameters like sample_rate, n_fft (FFT window size), hop_length, and n_mels (number of Mel bands). Always refer to the specific header file in <xtorch/transforms/signal/> for a full list of available settings.


Available Transforms by Category

Time-Frequency Representations

These are the most common preprocessing steps, converting a 1D waveform into a 2D image-like representation.

Transform Description Header File
Spectrogram Creates a standard spectrogram from a waveform. spectrogram.h
MelSpectrogram Creates a Mel-scaled spectrogram, which is a perceptually relevant representation of audio. mel_spectrogram.h
MFCC Mel-Frequency Cepstral Coefficients, a compact representation of the spectral envelope. mfcc.h
InverseMelScale Converts a Mel-spectrogram to a regular spectrogram. inverse_mel_scale.h
GriffinLim An algorithm to estimate a waveform from a spectrogram (phase reconstruction). griffin_lim.h

Data Augmentation

These transforms modify the audio to create new training samples, improving model generalization.

Transform Description Header File
TimeMasking Randomly masks a range of consecutive time steps in a spectrogram. A key component of SpecAugment. time_masking.h
FrequencyMasking Randomly masks a range of consecutive frequency channels in a spectrogram. A key component of SpecAugment. frequency_masking.h
AddNoise Adds random noise to the audio waveform. add_noise.h
BackgroundNoiseAddition Mixes the audio with random clips from a provided set of background noise files. background_noise_addition.h
PitchShift Shifts the pitch of the audio up or down without changing the tempo. pitch_shift.h
TimeStretch Stretches or compresses the audio in time without changing the pitch. time_stretch.h
SpeedPerturbation Changes the speed of the audio, which affects both pitch and tempo. Commonly used in ASR. speed_perturbation.h
DeReverberation Applies a de-reverberation effect to the audio. de_reverberation.h
TimeWarping Applies a non-linear warp along the time axis of a spectrogram. time_warping.h
Vol Changes the volume of the audio. vol.h

Other Utility Transforms

Transform Description Header File
Resample Resamples the audio waveform from one sampling rate to another. resample.h
SlidingWindowCMN Cepstral Mean and Variance Normalization, a common technique in speech recognition. sliding_window_cmn.h
WaveletTransforms Applies wavelet transforms to the signal. wavelet_transforms.h