Training gets the conference talks; inference gets the invoice. Once a model serves production traffic, every wasted millisecond multiplies by every request, forever—and most inference fleets waste a lot of them. The good news is that inference performance is an engineering problem with a known toolbox: CUDA-level optimization inside the model, and serving-level discipline around it, where NVIDIA's Triton Inference Server has become the de facto standard.
Key Takeaways
- Profile before optimizing: most “slow model” problems are batching, data movement, or precision problems wearing a disguise.
- Engine-level wins—TensorRT compilation, quantization, kernel fusion—routinely deliver multiples, not percentages.
- Triton's dynamic batching, concurrent model execution, and ensembles turn single-model speed into fleet throughput.
- Define latency SLOs first; every optimization below is a trade against them.
01Where the milliseconds actually go
Profile a typical unoptimized pipeline and the GPU kernel time is rarely the villain. Latency hides in framework overhead per request, host-to-device copies that serialize with compute, FP32 math where FP8/INT8 would do, and GPUs idling between under-batched requests. The implication: optimization is a pipeline activity, not a model activity. Measure end to end—tokenization to response—then attack the largest bucket.
02The CUDA-level toolbox
- Engine compilation (TensorRT): graph-level fusion, memory planning, and hardware-specific kernel selection—the single highest-leverage step for most models.
- Precision reduction: FP8 on Hopper-class silicon, INT8 with calibration where accuracy budgets allow; validate against your eval set, not vibes.
- CUDA Graphs: capture repetitive launch sequences to amortize CPU launch overhead—decisive for small-batch, latency-critical serving.
- Streams and async transfer: overlap copies with compute so PCIe time disappears behind kernel time instead of adding to it.
03The Triton-level discipline
Triton's job is keeping expensive silicon saturated without violating your SLOs. Dynamic batching coalesces individual requests up to a queuing budget you set—the single best throughput lever in most fleets. Concurrent model instances overlap compute with I/O on the same GPU. Ensembles move pre/post-processing server-side, eliminating round-trips. And the metrics endpoint exposes queue times, batch sizes, and per-model latency—the dashboard your capacity planning should already be reading.

04A tuning sequence that works
- Set explicit SLOs (p99 latency, throughput floor) per model.
- Baseline end-to-end with production-shaped traffic.
- Compile to an optimized engine; re-validate accuracy.
- Enable dynamic batching; sweep the queuing delay against p99.
- Add model instances until GPU utilization plateaus.
- Watch the new bottleneck appear—it is usually preprocessing or the network—and iterate.
Teams that run this loop routinely cut cost-per-inference by half or better on identical hardware. In a serving fleet, that is not a tuning exercise; it is capacity you already paid for, waiting to be claimed.
Ready to put this into practice?
Talk to the Semifly team about your infrastructure, security, and compliance roadmap.
Contact Us

