Lead Developer Advocate
Machine learning infrastructure that just works
Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.
Lead Developer Advocate
The TensorRT-LLM Engine Builder empowers developers to deploy extremely efficient and performant inference servers for open source and fine-tuned LLMs.
Baseten is a Series B startup building infrastructure for AI. We're actively hiring for many roles — here are ten reasons to join the Baseten team.
LoRA swapping with TRT-LLM supports in-flight batching and loads LoRA weights in 1-2 ms, enabling each request to hit a different fine-tune.
A separation of concerns between a control plane and workload planes enables multi-cloud, multi-region model serving and self-hosted inference.
To accurately compare tokens per second between different large language models, we need to adjust for tokenizer efficiency.
In this article, we outline a continuous integration and continuous deployment (CI/CD) pipeline for using AI models in production.
In this tutorial, we'll build a streaming endpoint for the XTTS V2 text to speech model with real-time narration and 200 ms time to first chunk.
Learn how to increase throughput with minimal impact on latency during model inference with continuous and dynamic batching.