Wake scaled to zero models
ML models deployed on Baseten can automatically scale to zero when not in use so that you’re not paying for unnecessary idle GPU time. When a scaled to zero model is invoked, it spins up on a new instance to resume handling requests. Spinning up an instance can take a moment—this is called the cold start time.
But what if you want your model to scale up from zero before it gets invoked so that the user isn’t waiting those extra seconds for the cold start?
When using model endpoints in production, certain user actions can suggest a model is about to be invoked, like loading a page or typing in a text box. In these cases, you can use the new wake endpoint to spin up a new instance for a scaled to zero model before it gets its first request so the user isn’t waiting during the cold start time.
Here’s an example invocation for the wake endpoint:
curl -X POST https://app.baseten.co/models/MODEL_ID/wake \
-H 'Authorization: Api-Key YOUR_API_KEY'
You can also manually trigger this process with the wake button on any scaled to zero model in your Baseten workspace.