Glossary
Building high-performance compound AI applications with MongoDB Atlas and Baseten
Using MongoDB Atlas and Baseten’s Chains framework for compound AI, you can build high-performance compound AI systems.
Compound AI systems explained
Compound AI systems combine multiple models and processing steps, and are forming the next generation of AI products.
How latent consistency models work
Latent Consistency Models (LCMs) improve on generative AI methods to produce high-quality images in just 2-4 steps, taking less than a second for inference.
Control plane vs workload plane in model serving infrastructure
A separation of concerns between a control plane and workload planes enables multi-cloud, multi-region model serving and self-hosted inference.
Comparing tokens per second across LLMs
To accurately compare tokens per second between different large language models, we need to adjust for tokenizer efficiency.
Continuous vs dynamic batching for AI inference
Learn how to increase throughput with minimal impact on latency during model inference with continuous and dynamic batching.
FP8: Efficient model inference with 8-bit floating point numbers
The FP8 data format has an expanded dynamic range versus INT8 which allows for quantizing weights and activations for more LLMs without loss of output quality.
The benefits of globally distributed infrastructure for model serving
Multi-cloud and multi-region infrastructure for model serving provides availability, redundancy, lower latency, cost savings, and data residency compliance.
Why GPU utilization matters for model inference
Save money on high-traffic model inference workloads by increasing GPU utilization to maximize performance per dollar for LLMs, SDXL, Whisper, and more.
Introduction to quantizing ML models
Quantizing ML models like LLMs makes it possible to run big models on less expensive GPUs. But it must be done carefully to avoid quality reduction.