Lead Developer Advocate
Machine learning infrastructure that just works
Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.
Lead Developer Advocate
Speculative decoding improves LLM inference latency by using a smaller model to generate draft tokens that the larger target model can accept during inference.
Are NVIDIA H200 GPUs cost-effective for model inference? We tested an 8xH200 cluster provided by Lambda to discover suitable inference workload profiles.
Export model inference metrics like response time and hardware utilization to observability platforms like Grafana, New Relic, Datadog, and Prometheus.
Using MongoDB Atlas and Baseten’s Chains framework for compound AI, you can build high-performance compound AI systems.
Use a state machine to generate token masks for logit biasing to enable function calling and structured output at the model server level.
Add function calling and structured output capabilities to any open-source or fine-tuned large language model supported by TensorRT-LLM automatically.
Explore the strengths and weaknesses of state-of-the-art image generation models like FLUX.1, Stable Diffusion 3, SDXL Lightning, and Playground 2.5.
We observe up to a 122% increase in tokens per second for Llama 3 after training custom Medusa heads and running the updated model with TensorRT-LLM