Model performance

Topics

Latest Model performance Hacks & projects GPU guides ML models Glossary Community Product News

How we built high-throughput embedding, reranker, and classifier inference with TensorRT-LLM

Discover how we optimized embedding, reranker, and classifier inference using TensorRT-LLM, doubling throughput and achieving ultra-low latency at scale.

Michael Feil

1 other

A library -- it's classical with dark wooden shelves and glowing golden lights and grand architectural design. However, the books -- which fly on and off the shelves themselves -- are ghostly glowing blue holograms.

How multi-node inference works for massive LLMs like DeepSeek-R1

Running DeepSeek-R1 on H100 GPUs requires multi-node inference to connect the 16 H100s needed to hold the model weights.

Phil Howes

1 other

Driving model performance optimization: 2024 highlights

Baseten's model performance team works to optimize customer models for latency, throughput, quality, cost, features, and developer efficiency.

Pankaj Gupta

How we built production-ready speculative decoding with TensorRT-LLM

Our TensorRT-LLM Engine Builder now supports speculative decoding, which can improve LLM inference speeds.

Pankaj Gupta

2 others

Generally Available: The fastest, most accurate and cost-efficient Whisper transcription

At Baseten, we've built the most performant (1000x real-time factor), accurate, and cost-efficient speech-to-text pipeline for production AI audio transcription

William Gao

3 others

Train speeding down a tunnel with Baseten as a label

How to build function calling and JSON mode for open-source and fine-tuned LLMs

Use a state machine to generate token masks for logit biasing to enable function calling and structured output at the model server level.

Bryce Dubayah

1 other

How to double tokens per second for Llama 3 with Medusa

We observe up to a 122% increase in tokens per second for Llama 3 after training custom Medusa heads and running the updated model with TensorRT-LLM

Abu Qader

1 other

A stone sculpture of a minotaur in a field

How to serve 10,000 fine-tuned LLMs from a single GPU

LoRA swapping with TRT-LLM supports in-flight batching and loads LoRA weights in 1-2 ms, enabling each request to hit a different fine-tune.

Pankaj Gupta

1 other

Prompt: Different-colored friendly robots standing in a field

Benchmarking fast Mistral 7B inference

Running Mistral 7B in FP8 on H100 GPUs with TensorRT-LLM, we achieve best in class time to first token and tokens per second on independent benchmarks.

Abu Qader

3 others

Prompt: a model bullet train in a snowy village.

33% faster LLM inference with FP8 quantization

Quantizing open-source LLMs to FP8 resulted in near-zero perplexity gains and yielded material performance improvements across latency, throughput, and cost.

Pankaj Gupta

1 other

Prompt: A ship in a bottle in a dark wood library

1 2