How to deploy Stable Diffusion using Truss

Stable Diffusion is an open-source image generation model developed by Stability AI. It goes image for image with Dall·E, but unlike Dall·E’s proprietary license, Stable Diffusion’s usage is governed by the CreativeML Open RAIL M License. While this dramatically lowers the cost of using the model, it still requires some technical aptitude to get it running, not to mention a high-end GPU.

I wanted to give stable diffusion a try, but didn’t want to send NVIDIA a couple thousand dollars for a shiny new RTX 4090 to run it on. Plus, I wanted my colleagues to be able to generate images too. So I set out to deploy the model on Baseten.

Spoiler alert: it worked. And you can deploy the model for yourself in two clicks from our model library.

If you’re curious about the process of deploying a text to image model on Baseten, read on and I’ll walk you through the process step by step.

Prerequisites

To deploy Stable Diffusion on Baseten from the terminal, you'll need to do a couple of things to get started:

  • Create a Baseten API key. If you don't have a Baseten account, you can sign up for free and you'll automatically get more than enough free credits to follow this tutorial.

  • Install Truss from PyPi.

We don’t need to download the model weights directly or otherwise use the build instructions from the Stable Diffusion GitHub repository as we’ll get everything we need through the Truss.

Packaging the model

I used Truss, Baseten’s open source package for serving and deploying models, to prepare the model for production.

To create the Truss, I opened up Terminal and ran:

truss init ./stable-diffusion

This created the folder structure that I used to package up the model. I edited two files from inside this folder: config.yaml and model/model.py.

In the config file, I made three adjustments. First I added package dependencies:

requirements:
- diffusers
- transformers
- torch
- scipy
- accelerate
- pillow

Then I made sure to configure the Truss to use a GPU. Stable diffusion requires a GPU during inference, not just training, to generate images.

resources:
  cpu: "3"
  memory: 14Gi
  use_gpu: true
  accelerator: A10G

Then it was time to work on the model code itself. Truss allows you to quickly define a model/model.py file and then turns that into a Docker image that contains a server that hosts your model. Here is the full model/model.py file, which I’ll explain in detail below.

1​​import base64
2import os
3from io import BytesIO
4from typing import Dict, List
5
6import torch
7from diffusers import (
8    DDIMScheduler,
9    DPMSolverMultistepScheduler,
10    EulerDiscreteScheduler,
11    LMSDiscreteScheduler,
12    PNDMScheduler,
13    StableDiffusionPipeline,
14)
15from PIL import Image
16
17torch.backends.cuda.matmul.allow_tf32 = True
18torch.backends.cudnn.benchmark = True
19
20
21class Model:
22    def __init__(self, **kwargs) -> None:
23        self._data_dir = kwargs["data_dir"]
24        self._config = kwargs["config"]
25        self._secrets = kwargs.get("secrets")
26        self.model_id = "stabilityai/stable-diffusion-2-1-base"
27        self.model = None
28        self.schedulers = None
29
30    def load(self):
31        self.model = StableDiffusionPipeline.from_pretrained(
32            str(self._data_dir),
33            revision="fp16",
34            torch_dtype=torch.float16,
35        ).to("cuda")
36
37        schedulers = [
38            ("ddim", DDIMScheduler),
39            ("dpm", DPMSolverMultistepScheduler),
40            ("euler", EulerDiscreteScheduler),
41            ("lms", LMSDiscreteScheduler),
42            ("pndm", PNDMScheduler),
43        ]
44
45        self.schedulers = self.load_schedulers(schedulers)
46        self.model.scheduler = self.schedulers["ddim"]
47
48    def load_schedulers(self, schedulers_to_add):
49        schedulers = {}
50        for name, scheduler in schedulers_to_add:
51            schedulers[name] = scheduler.from_config(self.model.scheduler.config)
52
53        print(schedulers)
54        return schedulers
55
56    def preprocess(self, request):
57        scheduler = request.pop("scheduler", "ddim")
58        self.model.scheduler = self.schedulers[scheduler]
59
60        return request
61
62    def convert_to_b64(self, image: Image) -> str:
63        buffered = BytesIO()
64        image[0].save(buffered, format="JPEG")
65        img_b64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
66        return img_b64
67
68    @torch.inference_mode()
69    def predict(self, request: Dict) -> Dict[str, List]:
70        results = []
71        random_seed = int.from_bytes(os.urandom(2), "big")
72        generator = torch.Generator("cuda").manual_seed(
73            request.get("seed", random_seed)
74        )
75        try:
76            output = self.model(
77                **request,
78                generator=generator,
79                return_dict=False,
80            )
81            b64_results = self.convert_to_b64(output[0])
82            results = results + [b64_results]
83
84        except Exception as exc:
85            return {"status": "error", "data": None, "message": str(exc)}
86
87        return {"status": "success", "data": results, "message": None}

Let's break that big chunk of code down.

The init function loads the model configuration and version.

def __init__(self, **kwargs) -> None:
    self._data_dir = kwargs["data_dir"]
    self._config = kwargs["config"]
    self._secrets = kwargs.get("secrets")
    self.model_id = "stabilityai/stable-diffusion-2-1-base"
    self.model = None
    self.schedulers = None

The load function loads the Stable Diffusion model onto the GPU device.

1def load(self):
2    self.model = StableDiffusionPipeline.from_pretrained(
3        str(self._data_dir),
4        revision="fp16",
5        torch_dtype=torch.float16,
6    ).to("cuda")
7
8    schedulers = [
9        ("ddim", DDIMScheduler),
10        ("dpm", DPMSolverMultistepScheduler),
11        ("euler", EulerDiscreteScheduler),
12        ("lms", LMSDiscreteScheduler),
13        ("pndm", PNDMScheduler),
14    ]
15
16    self.schedulers = self.load_schedulers(schedulers)
17    self.model.scheduler = self.schedulers["ddim"]

A helper function converts the Image object to a base64 string.

def convert_to_b64(self, image):
    buffered = BytesIO()
    image.save(buffered, format="JPEG")
    img_b64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
    return img_b64

Finally, the predict function actually runs interference on the model. The predict function parses out the prompt from the request, runs the prompt through the model, and does some post-processing on the resulting Image object. After running the result through the base64 helper function, it returns a response with the encoded image. Here's the line that actually invokes the model:

output = self.model(
    **request,
    generator=generator,
    return_dict=False,
)

Stable Diffusion uses Hugging Face and PyTorch, which are both supported frameworks on Truss and Baseten. So it only takes a few lines of code to load and run inference on the model in production.

Prompt: a painting of a robot

Serving and deployment

After packaging the model with Truss, I deployed it to Baseten with the Truss CLI, pasting my API key when prompted:

truss push

With the model deployed, I called it from the command line:

truss predict -d '{"prompt": "A robot making art"}'

This returns a long base64 string, which I'd rather have saved as a nice file. So I wrote a quick helper script show.py to save and display the output.

1import base64
2import json
3import os
4import sys
5
6resp = sys.stdin.read()
7image = json.loads(resp)["data"]
8img = base64.b64decode(image)
9
10file_name = f'{image[-10:].replace("/", "")}.jpeg'
11img_file = open(file_name, "wb")
12img_file.write(img)
13img_file.close()
14os.system(f"open {file_name}")

To use the script, I pipe the output of the model call into the script.

truss predict -d '{"prompt": "A robot making art"}' | python show.py
Prompt: a robot making art