Baseten Blog | Page 2

Topics

Latest Model performance AI Engineering Infrastructure AI models Glossary Community Product News

Baseten Chains is now GA for production compound AI systems

Baseten Chains delivers ultra-low-latency compound AI at scale, with custom hardware per model and simplified model orchestration.

Marius Killinger

2 others

Baseten Chains delivers ultra-low-latency, scalable compound AI with custom hardware per model and seamless model orchestration.

AI models

Private, secure DeepSeek-R1 in production in US & EU data centers

Dedicated deployments of DeepSeek-R1 and DeepSeek-V3 offer private, secure, high-performance inference that's cheaper than OpenAI

Amir Haghighat

1 other

Model performance

Driving model performance optimization: 2024 highlights

Baseten's model performance team works to optimize customer models for latency, throughput, quality, cost, features, and developer efficiency.

Pankaj Gupta

Product

New observability features: activity logging, LLM metrics, and metrics dashboard customization

We added three new observability features for improved monitoring and debugging: an activity log, LLM metrics, and customizable metrics dashboards.

Suren Atoyan

4 others

Introducing three new observability features on Baseten: the activity log, LLM metrics, and customizable metrics dashboards

Model performance

How we built production-ready speculative decoding with TensorRT-LLM

Our TensorRT-LLM Engine Builder now supports speculative decoding, which can improve LLM inference speeds.

Pankaj Gupta

2 others

Glossary

A quick introduction to speculative decoding

Speculative decoding improves LLM inference latency by using a smaller model to generate draft tokens that the larger target model can accept during inference.

Pankaj Gupta

2 others

A ghostly, glowing llama walking ahead of a real llama

Product

Introducing our Speculative Decoding Engine Builder integration for ultra-low-latency LLM inference

Our new Speculative Decoding integration can cut latency in half for production LLM workloads.

Justin Yi

3 others

Baseten's Speculative Decoding integration can cut latency in half for production LLM workloads.

Model performance

Generally Available: The fastest, most accurate and cost-efficient Whisper transcription

At Baseten, we've built the most performant (1000x real-time factor), accurate, and cost-efficient speech-to-text pipeline for production AI audio transcription

William Gao

3 others

Train speeding down a tunnel with Baseten as a label

Product

Introducing Custom Servers: Deploy production-ready model servers from Docker images

Deploy production-ready model servers on Baseten directly from any Docker image using just a YAML file.

Tianshu Cheng

2 others

Custom Servers on Baseten let you deploy any Docker image as a production-ready ML model server.

1 2 3…14