New in August 2023
ML models. Plus, learn how to optimize Stable Diffusion XL inference to run in as little as 3 seconds and build your own open-source version of ChatGPT with Llama 2 and Chainlit.
Reintroducing Truss for model packaging and serving
One year after its initial launch, we’re introducing Truss 0.6.0, which synthesizes a year of learnings into a tool for packaging and serving ML models. And we rebuilt the docs from scratch to make Truss easy to learn.
We built Truss to solve three frustrations:
The dev loop for serving ML models is too long.
We need a clear way to go from notebook or script to model server.
Configuring a model server requires detailed knowledge of Docker and Kubernetes.
Truss solves the first frustration with a live reload workflow for model serving similar to what you get when developing a website with React or Django. Learn about live reload in the Truss user guide.
For the second issue, a clear path from notebook to model server, Truss’ simplified model.py
file lets you implement an interface between an ML model and the model server. Get started building trusses with the quickstart guide.
Finally, Truss requires zero knowledge of Docker or Kubernetes. In fact, you don’t even need them installed on your local machine to use Truss. Instead, every attribute of the model server is controlled from the config.yaml file. Truss is still very flexible, you can review a full list of config options in this reference.
Give Truss a star on GitHub to follow along with development progress. We’re building in public and welcome bug reports, feature suggestions, and pull requests!
SDXL inference in 1.92 seconds
By default, it takes about 8 seconds to generate an image with Stable Diffusion XL on an A100 and three times as long on an A10. We cut inference time to as low as 1.92 seconds on an A100 by:
Cutting the number of inference steps from 50 with minimal impact on results quality
Setting classifier free guidance (CFG) to zero after 40% of steps.
Swapping in the refiner model for the last 20% of the steps.
Chosing a fp16 vae and efficient attention to improve memory efficiency.
You can use this optimized version of SDXL today from our model library or on GitHub.
We go into all the details in this guide to inference optimization. Give it a read to learn techniques for getting faster results from ML models
OSS ChatGPT with Llama 2 and Chainlit
ChatGPT gained 100 million users in 2 months by combining a state-of-the-art large language model with an intuitive interface that showed off the model’s capabilities. Now, you can build a similar application using entirely open-source technologies: Llama 2 for the model and Chainlit for the user interface.
Llama 2, a recent open-source LLM from Meta, is competitive on results quality with GPT-3.5, the closed-source LLM that drives ChatGPT. But a high quality model, while necessary, is not sufficient for a great user experience.
That’s where Chainlit comes in. It’s an open-source tool for building a ChatGPT-like interface on top of any LLM. It gives important features like streaming model output, prompt history, context-aware chat, and chat restart out of the box.
Build this open-source ChatGPT for yourself with these resources:
Deploy your own with this cookbook in the Chainlit documentation
Walk through the project with a tutorial from the Baseten blog
We’ll be back next month with another dispatch from the world of open-source AI and ML!
Thanks for reading!
— The team at Baseten
Subscribe to our newsletter
Stay up to date on model performance, GPUs, and more.