Computer Vision Models
xTorch provides a rich and diverse model zoo for a wide array of computer vision tasks. These pre-built models allow you to quickly apply powerful, state-of-the-art architectures to your data.
All computer vision models are located under the xt::models namespace and their headers can be found in the <xtorch/models/computer_vision/> directory.
General Usage
Using a pre-built computer vision model is straightforward. You instantiate the model, typically providing task-specific parameters like the number of output classes, and then it's ready for training or inference.
#include <xtorch/xtorch.h>
#include <iostream>
int main() {
torch::Device device(torch::cuda::is_available() ? torch::kCUDA : torch::kCPU);
// Example: Instantiate a VGG16 model for a 10-class classification problem
xt::models::VGGNet model(
xt::models::VGGNetImpl::VGGType::VGG16,
/*num_classes=*/10
);
model.to(device);
model.train(); // Set to training mode
// Create a dummy input batch (Batch=4, Channels=3, Height=224, Width=224)
auto input_tensor = torch::randn({4, 3, 224, 224}).to(device);
// Perform a forward pass
auto output = model.forward(input_tensor);
std::cout << "VGG16 Model Instantiated." << std::endl;
std::cout << "Output shape: " << output.sizes() << std::endl; // Should be
}!!! info "Model Variants"
Many model families like ResNet, VGGNet, and EfficientNet have multiple variants (e.g., ResNet18 vs. ResNet50). These are typically selected via an enum or by passing configuration arguments to the constructor. Please refer to the specific model's header file for all available options.
Available Models by Task
Image Classification
These models are designed to take an image as input and output a probability distribution over a set of classes.
| Model Family | Description | Header File |
|---|---|---|
LeNet5 |
The classic LeNet-5 architecture, foundational for CNNs. | image_classification/lenet5.h |
AlexNet |
The breakthrough deep CNN from the 2012 ImageNet competition. | image_classification/alexnet.h |
VGGNet |
A simple and effective architecture with very small (3x3) convolution filters. | image_classification/vggnet.h |
ResNet |
Residual Networks, which introduced skip connections to enable much deeper models. | image_classification/resnet.h |
ResNeXt |
An evolution of ResNet that uses grouped convolutions. | image_classification/resnext.h |
WideResNet |
A ResNet variant that is wider (more channels) but shallower. | image_classification/wide_resnet.h |
GoogLeNet |
A deep CNN that introduced the "Inception" module. | image_classification/google_net.h |
Inception |
Later versions of the Inception architecture (e.g., InceptionV3). | image_classification/inception.h |
InceptionResNet |
A hybrid architecture combining Inception modules with residual connections. | image_classification/inception_resnet.h |
DenseNet |
Densely Connected Convolutional Networks, where each layer is connected to every other layer. | image_classification/dense_net.h |
MobileNet |
A family of efficient models for mobile and embedded vision applications. | image_classification/mobilenet.h |
EfficientNet |
A family of models that scales depth, width, and resolution in a principled way. | image_classification/efficient_net.h |
Xception |
An architecture based on depthwise separable convolutions. | image_classification/xception.h |
SENet |
Squeeze-and-Excitation Networks that adaptively recalibrate channel-wise feature responses. | image_classification/se_net.h |
CBAM |
Convolutional Block Attention Module. | image_classification/cbam.h |
NetworkInNetwork |
A model that uses micro neural networks in place of linear filters. | image_classification/network_in_network.h |
PyramidalNet |
A variant of ResNet that gradually increases feature map dimensions. | image_classification/pyramidal_net.h |
HighwayNetwork |
A deep network with learnable gating mechanisms. | image_classification/highway_network.h |
AmoebaNet |
An architecture discovered through evolutionary neural architecture search. | image_classification/amoeba_net.h |
ZefNet |
A visualization-driven model, an early winner of the ImageNet competition. | image_classification/zefnet.h |
Object Detection
These models identify and locate multiple objects within an image by outputting bounding boxes and class labels.
| Model Family | Description | Header File |
|---|---|---|
RCNN |
Region-based CNN, the original groundbreaking model for this task. | object_detection/rcnn.h |
FastRCNN |
An improved version of R-CNN that is faster to train and test. | object_detection/fast_rcnn.h |
FasterRCNN |
Introduces a Region Proposal Network (RPN) for end-to-end training. | object_detection/faster_rcnn.h |
MaskRCNN |
An extension of Faster R-CNN that also adds a branch for predicting segmentation masks. | object_detection/mask_rcnn.h |
SSD |
Single Shot MultiBox Detector, a one-stage detector that is very fast. | object_detection/ssd.h |
RetinaNet |
A one-stage detector that introduced the Focal Loss to address class imbalance. | object_detection/retina_net.h |
YOLO |
You Only Look Once, a family of extremely fast one-stage detectors. | object_detection/yolo.h |
YOLOX |
An anchor-free version of YOLO. | object_detection/yolox.h |
DETR |
Detection Transformer, which frames object detection as a set prediction problem. | object_detection/detr.h |
EfficientDet |
A family of scalable and efficient object detectors. | object_detection/efficient_det.h |
Image Segmentation
These models classify each pixel in an image to create a segmentation map.
| Model Family | Description | Header File |
|---|---|---|
FCN |
Fully Convolutional Network, a foundational model for semantic segmentation. | image_segmentation/fcn.h |
UNet |
An architecture with a U-shaped encoder-decoder structure, popular for biomedical imaging. | image_segmentation/unet.h |
SegNet |
A deep encoder-decoder architecture for semantic pixel-wise segmentation. | image_segmentation/segnet.h |
DeepLab |
A family of models (e.g., DeepLabV3+) using atrous convolutions for segmentation. | image_segmentation/deep_lab.h |
HRNet |
High-Resolution Network, which maintains high-resolution representations through the network. | image_segmentation/hrnet.h |
PANet |
Path Aggregation Network, which enhances feature fusion. | image_segmentation/panet.h |
Vision Transformers
These models apply the Transformer architecture, originally designed for NLP, to computer vision tasks.
| Model Family | Description | Header File |
|---|---|---|
ViT |
Vision Transformer, the original model that applies a pure Transformer to image patches. | vision_transformers/vit.h |
DeiT |
Data-efficient Image Transformer, which uses knowledge distillation. | vision_transformers/deit.h |
SwinTransformer |
A hierarchical Vision Transformer using shifted windows. | vision_transformers/swin_transformer.h |
PVT |
Pyramid Vision Transformer, which introduces a pyramid structure to ViT. | vision_transformers/pvt.h |
T2TViT |
Token-to-Token Vision Transformer. | vision_transformers/t2t_vit.h |
MViT |
Multiscale Vision Transformer. | vision_transformers/mvit.h |
BEiT |
Bidirectional Encoder representation from Image Transformers (BERT pre-training for vision). | vision_transformers/beit.h |
CLIPViT |
The Vision Transformer backbone used in the CLIP model. | vision_transformers/clip_vit.h |
