Baseten Blog | Page 2
Baseten Chains is now GA for production compound AI systems
Baseten Chains delivers ultra-low-latency compound AI at scale, with custom hardware per model and simplified model orchestration.
Private, secure DeepSeek-R1 in production in US & EU data centers
Dedicated deployments of DeepSeek-R1 and DeepSeek-V3 offer private, secure, high-performance inference that's cheaper than OpenAI
Driving model performance optimization: 2024 highlights
Baseten's model performance team works to optimize customer models for latency, throughput, quality, cost, features, and developer efficiency.
New observability features: activity logging, LLM metrics, and metrics dashboard customization
We added three new observability features for improved monitoring and debugging: an activity log, LLM metrics, and customizable metrics dashboards.
How we built production-ready speculative decoding with TensorRT-LLM
Our TensorRT-LLM Engine Builder now supports speculative decoding, which can improve LLM inference speeds.
A quick introduction to speculative decoding
Speculative decoding improves LLM inference latency by using a smaller model to generate draft tokens that the larger target model can accept during inference.
Introducing our Speculative Decoding Engine Builder integration for ultra-low-latency LLM inference
Our new Speculative Decoding integration can cut latency in half for production LLM workloads.
Generally Available: The fastest, most accurate and cost-efficient Whisper transcription
At Baseten, we've built the most performant (1000x real-time factor), accurate, and cost-efficient speech-to-text pipeline for production AI audio transcription
Introducing Custom Servers: Deploy production-ready model servers from Docker images
Deploy production-ready model servers on Baseten directly from any Docker image using just a YAML file.