High-Performance C++ AI, Simplified

Deploying Transformers: Handling Dynamic Sequence Lengths in TensorRT

Transformer models have revolutionized NLP, but their dynamic nature presents a challenge for inference optimization. A BERT model might need to process a 10-word sentence one moment and a 300-word paragraph the next. Building a separate TensorRT engine for every possible length is not feasible.

The Power of Optimization Profiles

This is where TensorRT's dynamic shape profiles come in. A profile allows you to define a range of valid input dimensions for your engine. You specify three configurations for each dynamic axis:

  • Minimum Shape: The smallest possible input (e.g., batch size 1, sequence length 8).
  • Optimal Shape: The most common input shape you expect. TensorRT will tune its kernels to be fastest for this size.
  • Maximum Shape: The largest possible input (e.g., batch size 16, sequence length 512).

Configuring Profiles in XTorch

XTorch exposes this functionality through a simple command-line interface. You can define your profiles directly when you convert your model.

xtorch convert --model bert.pth --precision fp16 \
--dynamic-shapes input_ids:[1,8],[1,128],[16,512]

In this example, we've told XTorch that the input tensor input_ids has a dynamic shape, with an optimal batch size of 1 and sequence length of 128, but can handle inputs up to a batch of 16 with 512 tokens. The resulting engine is flexible, powerful, and ready for deployment on Ignition-Hub.