Baseten Blog | Page 5
Announcing our Series B
We’ve spent the last four and a half years building Baseten to be the most performant, scalable, and reliable way to run your machine learning workloads.
The benefits of globally distributed infrastructure for model serving
Multi-cloud and multi-region infrastructure for model serving provides availability, redundancy, lower latency, cost savings, and data residency compliance.
New in February 2024
3x throughput with H100 GPUs, 40% lower SDXL latency with TensorRT, and multimodal open source models.
40% faster Stable Diffusion XL inference with NVIDIA TensorRT
Using NVIDIA TensorRT to optimize each component of the SDXL pipeline, we improved SDXL inference latency by 40% and throughput by 70% on NVIDIA H100 GPUs.
Why GPU utilization matters for model inference
Save money on high-traffic model inference workloads by increasing GPU utilization to maximize performance per dollar for LLMs, SDXL, Whisper, and more.
The best open source large language model
Explore the best open source large language models for 2024 for any budget, license, and use case.
Unlocking the full power of NVIDIA H100 GPUs for ML inference with TensorRT
Double or triple throughput at same-or-better latencies by switching to H100 GPUs from A100s for model inference with TensorRT/TensorRT-LLM.
New in January 2024
A library for open source models, general availability for L4 GPUs, and performance benchmarking for ML inference
Introduction to quantizing ML models
Quantizing ML models like LLMs makes it possible to run big models on less expensive GPUs. But it must be done carefully to avoid quality reduction.