New in February 2024
TL;DR
Latency, throughput, quality, cost. These factors determine the success of an ML model in production, and we’re excited to share February’s improvements across all four factors in this newsletter. From lower latency and higher throughput with TensorRT on H100 GPUs to sub-one-second image generation with SDXL Lightning, we’ve continued our focus on model performance. New open source models bring multimodal capabilities and best-in-class quality, and our refreshed billing dashboard gives you daily insight into usage and spend.
NVIDIA H100 GPUs for model inference
We’re now offering model inference on H100 GPUs — the world’s most powerful GPU for running ML models.
H100 GPUs feature:
989.5 teraFLOPs of fp16 tensor compute (vs 312 for 80GB A100)
80 GB of VRAM (matching 80GB A100)
3.35 TB/s memory bandwidth (vs 2.039 for 80GB A100)
This translates to extraordinary performance for model inference, especially for models optimized with TensorRT-LLM. In our testing, we saw 3x higher throughput at constant latency for Mistral 7B versus A100 GPUs. This results in a 45% reduction in cost for running high-traffic workloads.
See our H100 changelog for details on pricing and instance types. H100 GPUs are available for all users; you can deploy a model on an H100 GPU today.
40% faster SDXL with TensorRT
TensorRT, a software development kit for high-performance deep learning inference by NVIDIA, is a powerful tool for making models run faster, especially on top-end GPUs like the A100 and H100.
While TensorRT is often used via TensorRT-LLM to optimize language models, you can also use the base TensorRT to optimize a wider range of models. We optimized Stable Diffusion XL with TensorRT and saw 40% lower latency and 70% higher throughput on H100 GPUs compared to a baseline implementation.
Deploy TensorRT-optimized models from our model library to leverage these performance gains in your product.
Real-time image generation with SDXL Lightning
If SDXL on an H100 isn’t fast enough for you, consider SDXL Lightning, a new implementation of few-step image generation. SDXL Lightning shows notable improvements over other fast image models like SDXL Turbo, including full 1024x1024 output image size and closer prompt adherence. However, there is still a compromise in quality versus the base SDXL model, especially for highly detailed images.
Deploy SDXL Lightning in one click from our model library and start generating images in less than 1 second per image.
QwenVL: an open source visual language model
Alibaba has released Qwen, a family of open source language models somewhat like Llama 2. Qwen is short for Tongyi Qianwen (通义千问), which we translated to “Responding to any and all of your questions, no matter the subject or the quantity.”
Within the Qwen family of models, Qwen VL is unique as a large vision language model. Qwen is able to use natural language to describe images with grounding to identify where in an image each described object lies.
Deploy Qwen VL for a peek into the future of multimodal models that combine vision and language.
Best in class open source text embedding
Nomic Embed v1.5 is a text embedding model that beats OpenAI’s text-embedding-3-small on benchmarks while using only half the dimensionality. Nomic Embed v1.5 offers:
Optimized embeddings for retrieval, search, clustering, or classification.
Adjustable dimensionality with Matryoshka Representation Learning.
Deploy Nomic Embed v1.5 from the Baseten model library for accurate, efficient text embedding.
Product update: improved billing visibility
In February, we released a refreshed billing dashboard that gives you detailed insights into your model usage and associated spend. Here’s what we added:
A new graph for daily costs, requests, and billable minutes.
Billing and usage information for the previous billing period.
Request count visibility within the model usage table.
We’ll be back next month with more from the world of open source ML!
Thanks for reading,
— The team at Baseten
Subscribe to our newsletter
Stay up to date on model performance, GPUs, and more.