New in May 2024
TL;DR
This May, we’re spotlighting AI events, multicluster model serving, tokenizer efficiency, and forward-deployed engineering. Join Baseten engineers for upcoming tech talks and workshops on AI phone calling, LLM optimization, and async model inference in NY, SF, and online. Plus, discover our new multicluster architecture for enhanced model serving and compare tokenizer efficiency across LLMs for accurate performance metrics. We’re also hiring forward-deployed engineers—apply today!
Join us in NY, SF, or anywhere
If you’re reading this, we want to hang out with you in June! Baseten engineers are leading tech talks and workshops on topics like AI phone calling, LLM optimization with TensorRT-LLM, and asynchronous model inference all month long.
In New York for #TechWeek:
June 5, 5PM ET: A panel discussion of voice and phone automation with a focus on enterprise use cases. Join: lu.ma/phonecall.
June 6, 5PM ET: A tech talk on “production-ready async inference” by Baseten engineer Samiksha Pal. Register: lu.ma/h100.
June 7, 6PM ET: A house party for AI/ML engineers. RSVP: lu.ma/lg1wz57u.
In San Francisco at the AI Engineer World’s Fair:
June 25: Pankaj Gupta and Philip Kiely are hosting a workshop on optimizing LLM inference with TensorRT-LLM. Tickets available at ai.engineer.
Anywhere in the world:
June 12, 10AM PT: Philip Kiely will join Daniel Lenton of unify.ai to discuss dynamic routing for LLM inference requests. Sign up: https://lu.ma/s1ezdjpo.
Multicluster AI inference architecture
One of Baseten’s most core abstractions is the separation of our model serving infrastructure into a control plane and a set of workload planes. The control plane is a single Kubernetes cluster for centralized resource allocation and backend operations, while workload planes directly handle model inference across different regions, cloud providers, and VPCs.
Read our new multicluster architecture overview to learn how the control and workload plane abstractions enable availability, security, compliance, and scale for model inference.
Comparing tokens per second across models
Every LLM processes text as tokens, which can be anything from a couple of characters to a whole word. LLMs use tokenizers to turn text to tokens and tokens to text, but each model’s tokenizer varies in its efficiency.
Here's the length of tokenized sequences per model for different input samples:
When we measure model performance in tokens per second, it’s essential to relate that back to real-world user experience. Models with more efficient tokenizers, like Llama 3, can offer even better perceived speeds than their TPS metrics might suggest as they need fewer tokens to complete user requests. If you’re evaluating multiple LLMs for performance, here’s what you need to know about comparing tokens per second across different models.
Baseten is hiring!
At Baseten, we’re growing our engineering team. We’re hiring for forward-deployed engineers and other roles across model performance, core product, and more.
Curious about what working as a forward-deployed engineer (FDE) is like? One of our FDEs, Het, wrote about his first six months on the job. He writes, “Being an FDE mixes engineering, sales, and customer support all in one role … as an FDE, the impact is very clear. Every time you close a deal, that’s new revenue hitting the company’s bottom line — because of you! Customers also tend to show a lot of appreciation and gratitude when you help them, which adds to that sense of personal fulfillment.”
If you’re interested in becoming an FDE at Baseten, send in an application and we’ll be in touch shortly! Please reach out to careers@baseten.co with any questions.
We’ll be back next month with more from the world of open source AI!
Thanks for reading,
— The team at Baseten
Subscribe to our newsletter
Stay up to date on model performance, GPUs, and more.