Co-Founder
Machine learning infrastructure that just works
Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.
Co-Founder
The FP8 data format has an expanded dynamic range versus INT8 which allows for quantizing weights and activations for more LLMs without loss of output quality.
Using NVIDIA TensorRT to optimize each component of the SDXL pipeline, we improved SDXL inference latency by 40% and throughput by 70% on NVIDIA H100 GPUs.
Double or triple throughput at same-or-better latencies by switching to H100 GPUs from A100s for model inference with TensorRT/TensorRT-LLM.
Mixtral 8x7B structurally has faster inference than similarly-powerful Llama 2 70B, but we can make it even faster using TensorRT-LLM and int8 quantization.
Truss' live reload feature revolutionizes iterative development, turning the lengthy 3-30 minute model deployment process into an almost instant task.