New in July 2023

July was a banner month for open-source foundation models. Llama 2 and Stable Diffusion XL redefined their genres for ease of use and output quality. But using these models in production—and achieving enough throughput in a cost-effective manner—can be a challenge. That’s why here at Baseten we’ve been focusing on features and explainers around autoscaling, cold starts, and scale-to-zero.

Llama 2 brings new SOTA to OSS LLMs

There aren’t enough acronyms to describe how exciting Llama 2—the new state of the art (SOTA) in open-source large language models (OSS LLMs)—is to build with. The model comes in 3 variants (7B, 13B, and 70B). The 7B model is small enough to run on an A10, while the 70B model trades blows with GPT-3.5 on results quality. And with the model’s 4k-token context window, it’s the best OSS model yet for chatbots and agents.

Llama 2’s context window matches GPT-3.5 base at 4k tokens

Get started with Llama 2:

Deploy Llama 2 from the model library
Build a chatbot with Llama 2 and LangChain
Learn more about Llama 2 in this month’s Models we Love

Plus: Llama 2 on Baseten takes advantage of Truss’ new streaming output support, so you can stream the model response for a substantially lower time-to-first-token.

Stable Diffusion XL 1.0: Better images, shorter prompts

Stable Diffusion XL 1.0 (SDXL) is a larger, more powerful version of Stable Diffusion that creates high-quality images from shorter prompts. Rather than appending a string of adjectives at the end of your prompt (e.g. “4k cinematic high resolution beautiful artistic photorealism”), use SDXL and just type in exactly what you want to see.

The best way to evaluate a text-to-image model is to give it a try! We took prompts on Twitter and generated a dozen images that show the range of the model’s capabilities and limitations.

Images generated with Stable Diffusion XL 1.0

Get started with Stable Diffusion XL 1.0:

Deploy SDXL from the model library
Read a quickstart guide for SDXL deployment and invocation
Learn more about SDXL in this month’s Models we Love

Model autoscaling for cost-effective throughput

Autoscaling is the process of automatically creating and deleting replicas of your machine learning model server in response to incoming traffic. Model traffic is usually inconsistent, so autoscaling helps ensure that you only pay for compute resources that you actually need.

Autoscaling includes scale-to-zero, where a model scales down to zero replicas when not in use—meaning you pay zero dollars while the model is completely scaled down. There is a catch, though: a cold start time on the first request to the model to give the server a chance to scale back up. But we’ve been hammering down cold start times for months to give you reliable, performant autoscaling infrastructure. For example, for Stable Diffusion on an A10G, we reliably see cold start times under 15 seconds, from zero to ready for inference.

To learn more about how autoscaling works, take a look at our recent explainer on autoscaling features.

Autoscaling and scale-to-zero in action

And while autoscaling features don’t kick in until the model is active, you can now avoid resource waste by stopping accidental or unwanted deployments in the Baseten UI.

We’ll be back next month with more open-source models, ML project tutorials, and infrastructure content.

Thanks all!

— The team at Baseten

Subscribe to our newsletter

Stay up to date on model performance, GPUs, and more.

‌

New in July 2023

Llama 2 brings new SOTA to OSS LLMs

Stable Diffusion XL 1.0: Better images, shorter prompts

Model autoscaling for cost-effective throughput

Subscribe to our newsletter

Related Product posts

Introducing Baseten Embeddings Inference: The fastest embeddings solution available

Baseten Chains is now GA for production compound AI systems

New observability features: activity logging, LLM metrics, and metrics dashboard customization