
Benchmarking XInfer: Squeezing Every Drop of Performance from Your TensorRT Models
In the world of production AI, every millisecond counts. A model that runs flawlessly in a Jupyter notebook can fail spectacularly in the real world if it can't meet the latency demands of its application. This is especially true for TensorRT-optimized models, which are specifically designed for high-throughput, low-latency inference.
But an optimized model is only one half of the equation. The other half is the client that calls it. How efficiently does it handle network requests? Can it manage persistent connections? Does it support modern protocols?
The Contenders: XInfer vs. The World
We decided to put this to the test. We benchmarked three common methods for calling a ResNet-50 model deployed on our Ignition-Hub platform:
- Standard Python
requestsLibrary: The ubiquitous, simple-to-use HTTP client. - Python
aiohttpLibrary: A popular choice for asynchronous requests. - Our own
XInferSDK: Both in standard and gRPC streaming mode.
The Results: Throughput, Latency, and Cold Starts
The results were illuminating. For single, sequential requests, standard clients performed adequately. But when it came to concurrent requests and high-throughput scenarios, the difference was stark.
The XInfer gRPC client, by leveraging a single persistent connection, reduced network overhead by over 70% compared to making individual HTTP requests. This resulted in a median end-to-end latency that was nearly indistinguishable from the model's pure inference time. We'll dive deep into the numbers, the methodology, and show you exactly how to replicate these benchmarks for your own models.
.png)




