5 BEI models
BEI is Baseten's solution for production-grade deployments via TensorRT-LLM for (text) embeddings, reranking models and prediction models. With BEI you get the following benefits:
Lowest-latency inference across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)
Highest-throughput inference across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.2
High parallelism: up to 1400 client embeddings per second
Cached model weights for fast vertical scaling and high availability - no Hugging Face hub dependency at runtime