Rime serves speech synthesis API with stellar uptime using Baseten
- <300
- milliseconds p99 latency
- 200+
- distinct voices
- 100%
- uptime to date
Company overview
Based in San Francisco, Rime is on a mission to deliver the fastest and most lifelike speech synthesis API on the market. Founded by PhD researchers and linguists, Rime trains its own text-to-speech models with a focus on accurate accents and speech patterns. Enterprises rely on Rime for highly customized and lifelike speech synthesis that seamlessly integrates into their products.
Challenges
Bringing a custom model API to market
Rime’s founding team had trained a best-in-class speech synthesis model from scratch, but they needed a way to bring it to market.
Lily Clifford, Co-Founder and CEO of Rime, and her team decided to build an enterprise-grade developer API for their models. For that, they needed fast, reliable, and secure inference infrastructure that could run their custom-built models seamlessly.
Time to market was crucial for us. We knew we had the best models, but if we couldn’t stand up serving infrastructure quickly, there would be no way to build a business around our AI.
Rime’s early team realized they could get to market faster with a higher-performance, lower-cost inference platform if they didn’t have to build from scratch themselves.
Reducing latency for real-time speech
As developers embraced Rime’s API for its quality, the team realized that they needed to support latency-sensitive real-time use cases. Every millisecond mattered.
Rime needed to offer their enterprise customers a p99 latency SLAs below 300 milliseconds. While their highly optimized model easily fit within this SLA, their network latency threatened it constantly, especially when sending multi-megabyte audio responses.
As a phonetician, I know the huge impact that tiny changes can make on how speech is perceived. For our API to be integrated into real-time conversations, it had to be fast enough to keep up with human speech.
Serving enterprise customers
Rime’s fast, high-quality speech synthesis quickly caught the attention of enterprises, who entrusted Rime with mission-critical use cases.
Every API provider’s dream is to support large-scale users who are deriving massive amounts of business value from their API. But this introduced new dimensions to Rime’s infrastructure needs: large GPU allocations, multi-region compute availability, new compliance measures, and strict uptime SLAs.
We were proving to the market that we have the world’s best text-to-speech models. We just needed to match that with the world’s best infrastructure.
Solution
Distributed model serving infrastructure
Rime partnered with Baseten to bring its custom AI models to market on production-grade infrastructure.
Rime uses a multi-region deployment strategy, ensuring that API calls are routed to a geographically proximate datacenter, which reduces network overhead. Along with smart batching, appropriate separation of CPU and GPU workloads, and optimization of multi-megabyte response payloads, this distributed model serving infrastructure ensures low-latency inference on every API request.
Baseten’s forward-deployed engineers have worked closely with us throughout the entire process of bringing our API to market. A massive shoutout to the entire team for solving our most pressing infrastructure challenges.
Flexible GPU allocation with scale
Rime started out serving its model on inexpensive NVIDIA T4-backed instances. As usage exploded, they moved to larger, more powerful NVIDIA A10G GPUs for lower latencies and more cost-effective throughput at scale, and are now experimenting with options like NVIDIA L4 and fractional NVIDIA H100 GPUs to handle demand.
Allocations for large numbers of on-demand GPUs on cloud service providers are hard to come by. Often, asking for dozens or hundreds of instances requires a multi-year commitment to a specific type of GPU in a specific region.
With Baseten, Rime has the flexibility to scale up and down with their choice of hardware and region without multi-year commitments. Now, the Rime team can do short-notice load testing and onboard high-volume users at any time.
It would be hard or impossible to get the GPUs we need right when we need them on the same terms that Baseten gives through any cloud service provider – this flexibility is essential as we scale.
Results
Landed major deals with compliance-sensitive enterprise customers
Since switching their infrastructure to Baseten, Rime has signed significant new customers and partners, including ConverseNow, a leading AI-powered restaurant ordering platform.
Baseten’s SOC 2 Type II certification and HIPAA compliance are essential for supporting security-conscious customers who require those same compliance measures to be in place before they can use any API.
We were able to sidestep potential time-consuming challenges by relying on Baseten’s platform-wide security and compliance posture.
Met all customer latency SLAs
Rime has consistently matched or beat their 300-millisecond p99 latency SLA for enterprise customers with optimized network serving and multi-region model deployments.
Thanks to more powerful GPUs and carefully architected inference code, Rime’s p50 response time is less than half of their p99 SLA, meaning that the average request is twice as fast as advertised.
We would have been happy if Baseten had simply matched the latency we were seeing with our previous provider. Instead, they set a new standard for fast infrastructure.
This exceptional speed positions Rime as the go-to choice for real-time speech synthesis.
Exceeded all customer uptime SLAs
With Baseten, Rime has had no trouble exceeding their uptime SLAs for demanding enterprise use cases. Rime has maintained perfect reliability while scaling up their inference workloads to handle massive spikes in usage from new customers.
Additionally, Rime exports their inference metrics from Baseten to Grafana, unifying their data for maximum observability and smooth MLOps that treat model inference appropriately as a critical component of the platform.
Rime’s state-of-the-art p99 latency and 100% uptime over 2024 is driven by our shared laser focus on fundamentals, and we’re excited to push the frontier even further with Baseten.
Explore Baseten today
We love partnering with companies developing innovative AI products by providing the most customizable model deployment with the lowest latency.