Section Navigation

▼ Api
- Index
- Activations
- Dataloaders
- Dropouts
- Losses
- Normalizations
- Optimizations
- Regularizations
- Trainers
- Utils
- ▼ Datasets
- ▼ Models
- ▼ Transforms
▼ Comparisons
- Comparison
▼ Examples
- Index
- ▼ Audio
  - ▼ Audio classification
    Environmental sounds
    Music genre
  - ▼ Speech recognition
    E2e ctc
    Keyword spotting
- ▼ Computer vision
  - ▼ Image classification
    Finetuning resnet cifar10
    Lenet mnist
    Transfer learning custom
  - ▼ Image generation
    Cyclegan
    Dcgan
  - ▼ Object detection
    Faster rcnn
    Yolov3 coco
  - ▼ Semantic segmentation
    Deeplab v3
    Mask rcnn
- ▼ Data handling
  - ▼ Dataloaders
    Efficient loading
  - ▼ Datasets
    Builtin datasets
    Custom datasets
  - ▼ Transforms
    Image augmentation
- ▼ Deployment
  - ▼ Inference
    Cpp app
    Tensorrt
  - ▼ Serialization
    Export torchscript
    Save load
  - ▼ Web services
    Rest api
- ▼ Distributed
  - ▼ Data parallelism
    Multi gpu
  - ▼ Model parallelism
    Model splitting
  - ▼ Multi machine
    Setup
- ▼ Generative
  - ▼ Autoencoders
    Denoising ae
    Vae
  - ▼ Diffusion
    Ddpm
  - ▼ Gans
    Mnist gan
    Progressive gan
- ▼ Getting started
- ▼ Gnn
  - ▼ Graph level
    Diffpool
    Mpnn
  - ▼ Node level
    Gcn
    Graphsage
- ▼ Nlp
  - ▼ Language modeling
    Finetuning bert
    Training gpt
  - ▼ Seq2seq
    Machine translation
    Summarization
  - ▼ Text classification
    Sentiment rnn
    Transformer classification
- ▼ Optimization
  - ▼ Lr schedulers
    Cosine annealing
    Step decay
  - ▼ Optimizers
    Adamw
    Sgd momentum
  - ▼ Regularization
    Dropout
    Weight decay
- ▼ Performance
  - ▼ Memory
    Data loading
    Gradient checkpointing
  - ▼ Speed
    Mixed precision
    Profiling
- ▼ Rl
  - ▼ Policy based
    Ppo
    Reinforce
  - ▼ Value based
    Dqn atari
    Q learning
- ▼ Time series
  - ▼ Anomaly detection
    Autoencoders
  - ▼ Forecasting
    Lstm
    Multivariate
▼ Getting started
- Installation
- Quick start cnn
▼ User guide

Video Transforms

Video data adds a temporal dimension to the challenges of computer vision. A video is a sequence of image frames, and processing this data requires handling both the spatial content of each frame and the temporal relationships between them.

Video transforms are designed to operate on these sequences of frames. They are essential for preparing video clips for input to models like video classifiers or action recognition networks.

All video transforms are located under the xt::transforms namespace and can be found in the <xtorch/transforms/video/> header directory.

General Usage

Video transforms are used in a Compose pipeline, just like image transforms. However, they expect their input to be a tensor with a temporal dimension, typically with a shape like [Time, Channels, Height, Width].

A common workflow involves:

Loading a video and decoding it into a sequence of frames.
Applying temporal transforms to select or sample frames.
Applying spatial (image) transforms to each of the selected frames.

#include <xtorch/xtorch.h>
#include <iostream>
 
int main() {
    // 1. Define a pipeline of video and image transformations
    auto video_pipeline = std::make_unique<xt::transforms::Compose>(
        // --- Temporal Transforms ---
        // Sample 16 frames uniformly from the video clip
        std::make_shared<xt::transforms::video::UniformTemporalSubsample>(16),
        // Randomly reverse the sequence of the 16 frames with a 50% probability
        std::make_shared<xt::transforms::video::RandomClipReverse>(0.5),
 
        // --- Spatial Transforms (applied to each frame) ---
        // Note: You would typically wrap spatial transforms to apply them per-frame.
        // For simplicity, we assume the dataset handles this application logic.
        std::make_shared<xt::transforms::image::Resize>(std::vector<int64_t>{128, 128}),
        std::make_shared<xt::transforms::general::Normalize>(
            std::vector<float>{0.5, 0.5, 0.5},
            std::vector<float>{0.5, 0.5, 0.5}
        )
    );
 
    // 2. Pass the pipeline to a video Dataset
    // The dataset will apply these transforms to each video it loads.
    // auto dataset = xt::datasets::UCF101("./data", std::move(video_pipeline));
 
    // 3. The DataLoader will yield batches of processed video clips
    // xt::dataloaders::ExtendedDataLoader data_loader(dataset, 8);
    // for (auto& batch : data_loader) {
    //     auto video_clips = batch.first; // Shape e.g.,
    // }
}

Available Video Transforms

xTorch provides the following transforms for video data:

Temporal Transforms

These transforms operate along the time dimension of a video clip.

Transform	Description	Header File
`UniformTemporalSubsample`	Subsamples a fixed number of frames from a video clip at a uniform frame rate. This is a common way to create a fixed-size input from videos of varying lengths.	`uniform_temporal_subsample.h`
`RandomClipReverse`	Randomly reverses the order of frames in a clip with a given probability. A simple form of temporal data augmentation.	`random_clip_reverse.h`

Spatio-Temporal Transforms

These transforms modify both the spatial and temporal aspects of the video.

Transform	Description	Header File
`FrameInterpolation`	Generates intermediate frames between existing ones, which can be used to increase the frame rate or for slow-motion effects.	`frame_interpolation.h`
`OpticalFlowWarping`	Warps video frames based on calculated optical flow, a technique used in video compression and frame rate conversion.	`optical_flow_warping.h`

!!! info "Applying Image Transforms to Videos" To apply a standard image transform (like RandomCrop or ColorJitter) to every frame of a video, you typically need to iterate over the time dimension and apply the transform to each [C, H, W] frame individually. The Compose applier can be used to chain these operations. Support for per-frame application wrappers may be included in the future.