High-Performance C++ AI, Simplified

Benchmarking XInfer: Squeezing Every Drop of Performance from Your TensorRT Models

In the world of production AI, every millisecond counts. A model that runs flawlessly in a Jupyter notebook can fail spectacularly in the real world if it can't meet the latency demands of its application. This is especially true for TensorRT-optimized models, which are specifically designed for high-throughput, low-latency inference.

But an optimized model is only one half of the equation. The other half is the client that calls it. How efficiently does it handle network requests? Can it manage persistent connections? Does it support modern protocols?

The Contenders: XInfer vs. The World

We decided to put this to the test. We benchmarked three common methods for calling a ResNet-50 model deployed on our Ignition-Hub platform:

  • Standard Python requests Library: The ubiquitous, simple-to-use HTTP client.
  • Python aiohttp Library: A popular choice for asynchronous requests.
  • Our own XInfer SDK: Both in standard and gRPC streaming mode.

The Results: Throughput, Latency, and Cold Starts

The results were illuminating. For single, sequential requests, standard clients performed adequately. But when it came to concurrent requests and high-throughput scenarios, the difference was stark.

The XInfer gRPC client, by leveraging a single persistent connection, reduced network overhead by over 70% compared to making individual HTTP requests. This resulted in a median end-to-end latency that was nearly indistinguishable from the model's pure inference time. We'll dive deep into the numbers, the methodology, and show you exactly how to replicate these benchmarks for your own models.