Lead Developer Advocate

Philip Kiely

Model performance

33% faster LLM inference with FP8 quantization

Quantizing open-source LLMs to FP8 resulted in near-zero perplexity gains and yielded material performance improvements across latency, throughput, and cost.

Model performance

High performance ML inference with NVIDIA TensorRT

Use TensorRT to achieve 40% lower latency for SDXL and sub-200ms time to first token for Mixtral 8x7B on A100 and H100 GPUs.

1 other
Glossary

FP8: Efficient model inference with 8-bit floating point numbers

The FP8 data format has an expanded dynamic range versus INT8 which allows for quantizing weights and activations for more LLMs without loss of output quality.

Glossary

The benefits of globally distributed infrastructure for model serving

Multi-cloud and multi-region infrastructure for model serving provides availability, redundancy, lower latency, cost savings, and data residency compliance.

1 other
Model performance

40% faster Stable Diffusion XL inference with NVIDIA TensorRT

Using NVIDIA TensorRT to optimize each component of the SDXL pipeline, we improved SDXL inference latency by 40% and throughput by 70% on NVIDIA H100 GPUs.

2 others
Glossary

Why GPU utilization matters for model inference

Save money on high-traffic model inference workloads by increasing GPU utilization to maximize performance per dollar for LLMs, SDXL, Whisper, and more.

ML models

The best open source large language model

Explore the best open source large language models for 2025 for any budget, license, and use case.

Model performance

Unlocking the full power of NVIDIA H100 GPUs for ML inference with TensorRT

Double or triple throughput at same-or-better latencies by switching to H100 GPUs from A100s for model inference with TensorRT/TensorRT-LLM.

Machine learning infrastructure that just works

Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.