We raised a $75m series C to build the future of inference

Baseten Blog | Page 5

Topics

Latest Model performance Hacks & projects GPU guides ML models Glossary Community Product News

1…4 5 6…14

How latent consistency models work

Latent Consistency Models (LCMs) improve on generative AI methods to produce high-quality images in just 2-4 steps, taking less than a second for inference.

Rachel Rapp

Two trees slightly different in size and color represent how latent consistency models ensure consistency between images.

New in May 2024

AI events, multicluster model serving architecture, tokenizer efficiency, and forward-deployed engineering

Prompt: A solarpunk pier for a futuristic water taxi

What I learned as a forward-deployed engineer working at an AI startup

My first six months at Baseten exposed me to a huge range of exciting engineering challenges as I learned how to make an impact as a forward-deployed engineer.

Het Trivedi

Prompt: a software engineer building a bridge out of glowing code

Control plane vs workload plane in model serving infrastructure

A separation of concerns between a control plane and workload planes enables multi-cloud, multi-region model serving and self-hosted inference.

Philip Kiely

Matt Howard

Colin McGrath

2 others

Prompt: an intricate metal mobile of our solar system

Comparing tokens per second across LLMs

To accurately compare tokens per second between different large language models, we need to adjust for tokenizer efficiency.

Philip Kiely

Prompt: an abacus in a roman garden

New in April 2024

Use four new best in class LLMs, stream synthesized speech with XTTS, and deploy models with CI/CD

Prompt: the steps and entrance to a solarpunk museum

Hacks & projects

CI/CD for AI model deployments

In this article, we outline a continuous integration and continuous deployment (CI/CD) pipeline for using AI models in production.

Philip Kiely

Sid Shanker

Samiksha Pal

Vlad Shulman

3 others

Prompt: A movie still of an aqueduct

Hacks & projects

Streaming real-time text to speech with XTTS V2

In this tutorial, we'll build a streaming endpoint for the XTTS V2 text to speech model with real-time narration and 200 ms time to first chunk.

Philip Kiely

Het Trivedi

1 other

Prompt: A wooden boat full of books floating down a rapid river in a Japanese garden

Continuous vs dynamic batching for AI inference

Learn how to increase throughput with minimal impact on latency during model inference with continuous and dynamic batching.

Philip Kiely

Matt Howard

1 other

Prompt: A batch of candy being processed on a fantasy assembly line

1…4 5 6…14