Model performance | Page 2

Topics

Latest Model performance Hacks & projects GPU guides ML models Glossary Community Product News

High performance ML inference with NVIDIA TensorRT

Use TensorRT to achieve 40% lower latency for SDXL and sub-200ms time to first token for Mixtral 8x7B on A100 and H100 GPUs.

Justin Yi

1 other

Prompt: A friendly robot horse playing in a sunlit meadow

40% faster Stable Diffusion XL inference with NVIDIA TensorRT

Using NVIDIA TensorRT to optimize each component of the SDXL pipeline, we improved SDXL inference latency by 40% and throughput by 70% on NVIDIA H100 GPUs.

Pankaj Gupta

2 others

Prompt: A movie still of an astronaut coming through a technicolor wormhole

Unlocking the full power of NVIDIA H100 GPUs for ML inference with TensorRT

Double or triple throughput at same-or-better latencies by switching to H100 GPUs from A100s for model inference with TensorRT/TensorRT-LLM.

Pankaj Gupta

1 other

Prompt: a retro rocket ship taking off on the beach at sunrise. Model: Playground 2

Faster Mixtral inference with TensorRT-LLM and quantization

Mixtral 8x7B structurally has faster inference than similarly-powerful Llama 2 70B, but we can make it even faster using TensorRT-LLM and int8 quantization.