Software Engineer

Justin Yi

Model performance

How we built production-ready speculative decoding with TensorRT-LLM

Our TensorRT-LLM Engine Builder now supports speculative decoding, which can improve LLM inference speeds.

Pankaj Gupta

2 others

Glossary

A quick introduction to speculative decoding

Speculative decoding improves LLM inference latency by using a smaller model to generate draft tokens that the larger target model can accept during inference.

Pankaj Gupta

2 others

A ghostly, glowing llama walking ahead of a real llama

Product

Introducing our Speculative Decoding Engine Builder integration for ultra-low-latency LLM inference

Our new Speculative Decoding integration can cut latency in half for production LLM workloads.

Justin Yi

3 others

Baseten's Speculative Decoding integration can cut latency in half for production LLM workloads.

Model performance

Benchmarking fast Mistral 7B inference

Running Mistral 7B in FP8 on H100 GPUs with TensorRT-LLM, we achieve best in class time to first token and tokens per second on independent benchmarks.

Abu Qader

3 others

Prompt: a model bullet train in a snowy village.

Model performance

High performance ML inference with NVIDIA TensorRT

Use TensorRT to achieve 40% lower latency for SDXL and sub-200ms time to first token for Mixtral 8x7B on A100 and H100 GPUs.

Justin Yi

1 other

Prompt: A friendly robot horse playing in a sunlit meadow

Model performance

40% faster Stable Diffusion XL inference with NVIDIA TensorRT

Using NVIDIA TensorRT to optimize each component of the SDXL pipeline, we improved SDXL inference latency by 40% and throughput by 70% on NVIDIA H100 GPUs.

Pankaj Gupta

2 others

Prompt: A movie still of an astronaut coming through a technicolor wormhole

ML models

Build with OpenAI’s Whisper model in five minutes

Deploy OpenAI Whisper for free on Baseten instantly from our model library. Or stick around to learn how to package and deploy Whisper with Truss.

Justin Yi

Prompt: A steampunk gramophone by a window

Machine learning infrastructure that just works

Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.