Build your own open-source ChatGPT with Llama 2 and Chainlit

Prompt: A llama wearing multiple gold chains in the park

Llama 2 is a recent open-source large language model (LLM) that is competitive on results quality with GPT-3.5, the closed-source LLM that powers ChatGPT. AI-powered products like ChatGPT require high model quality, but model quality alone isn’t enough to make a compelling user experience.

Chainlit is an open-source tool for creating a ChatGPT-style interface on top of any LLM. We’ll use it to get a chat UI out of the box with support for streaming model output, prompt history, context-aware chat, chat restart, and many other essential features.

This tutorial takes you through the Baseten + Llama 2 Chainlit cookbook to build a ChatGPT-style interface for your favorite open source LLMs like Llama 2.

The final product in action

What makes ChatGPT great?

It took ChatGPT two months to hit 100 million users. What makes it such a compelling user experience, and how can we match that with open-source tools?

Quality output: GPT-3.5 is a very capable model that creates coherent output for a variety of topics. As of Llama 2, we now have an open-source model that performs comparably across most benchmarks.
Conversation in context: ChatGPT goes beyond a question-answer loop to pass the message history into each model call, making chat feel like a natural conversation. We can do the same thanks to Llama 2’s 4096-token context window (which matches GPT-3.5 base).
Streaming output: ChatGPT feels fast because you get the first word of the result as soon as it’s generated rather than waiting for the full response. Baseten also supports streaming model output.
Prompt history: The ChatGPT web UI includes a number of quality-of-life features like prompt history, which we’ll get out of the box from Chainlit.

Plus, our open source version adds the security and privacy of running a private instance of the model on SOC 2 Type II certified and HIPAA compliant infrastructure. And you use whatever open-source LLM best matches your specific use case.

Building an open-source ChatGPT

With Llama-2 on Baseten as the backend and Chainlit as the front-end, let’s build an open-source ChatGPT.

To follow along, download the Chainlit cookbook from GitHub.

Get approved for Llama 2 access

Llama 2 currently requires approval to access after accepting Meta’s license for the model. To request access to Llama 2:

Go to https://ai.meta.com/resources/models-and-libraries/llama-downloads/ and request access using the email associated with your HuggingFace account.
Go to https://huggingface.co/meta-llama/Llama-2-7b and request access.

Once you have access:

Create a HuggingFace access token
Set it as a secret in your Baseten account with the name hf_access_token

Deploy Llama-2-chat

After you have the required access and have set the access token secret in your Baseten account, you can deploy Llama-2-chat 7B with just a couple of clicks from the Baseten model library.

You can also deploy the 13B or 70B versions of the model from GitHub. These models run on A100 GPUs, so inference is more expensive than the 7B variant, which runs on an A10.

Set up Chainlit

To follow this step, download the Chainlit cookbook from GitHub and navigate to the baseten-llama-2-chat directory in your terminal.

First, install the latest version of Chainlit:

pip install --upgrade chainlit

Open the file .env.example and do the following:

Set BASETEN_API_KEY to an API key for your Baseten account
Set VERSION_ID to the model version ID listed for Llama 2 on your model dashboard in your Baseten account
Rename .env.example to .env

With that configuration set, you’re ready to run the cookbook:

chainlit run app.py -w

This will open the Chainlit UI on localhost and you’ll be able to start chatting with Llama 2.

Understanding key details

Rather than go through the whole app.py file line-by-line, let’s focus on three key details that make this chatbot work. You can see the full file on GitHub.

Use conversation history and context

Context is essential for the chatbot. Before sending the prompt to the model, prepend the prompt history from the Chainlit user session.

prompt_history = cl.user_session.get("prompt_history")
prompt = f"{prompt_history}{message}"

After the model response is generated and shown to the user, both the message and response are added to the prompt history and saved to the Chainlit user session.

prompt_history += message + response
cl.user_session.set("prompt_history", prompt_history)

Stream Llama 2 output

We’re going to send a POST request to the model endpoint with the argument stream=True to get streaming output, then iterate over that response later in the code.

Note also the max_new_tokens argument which is used to make sure we take advantage of the full context window in Llama 2.

1s = requests.Session()
2with s.post(
3    f"https://app.baseten.co/model_versions/{version_id}/predict",
4    headers= {
5        "Authorization": f"Api-Key {baseten_api_key}"
6    },
7    data=json.dumps({
8        "prompt": prompt, # Prompt includes conversation history
9        "stream": True, # Stream results a token at a time
10        "max_new_tokens": 4096 # Use full context window
11    }),
12) as resp:
13    # Process response here

Strip unneeded characters

When Llama 2 generates a response, its first seven characters are [/INST] which we want to strip from the output. That is a bit tricky with streaming, you can’t just parse the full output before showing it to the user.

Instead, in the response block, fill a buffer to grab those first seven characters, then once the buffer is full, stream the remaining output to the user.

1buffer = ""
2start_response = False
3for token in resp.iter_content(1):
4    token = token.decode('utf-8')
5    buffer += token
6    if not start_response:
7        if "[/INST]" in buffer:
8            start_response = True
9    else:
10        response += token
11        await ui_msg.stream_token(token)

Using Llama 2

Let’s see how our all-open-source chatbot performs. Remember, we’re using the scaled-down 7B version.

A useful result

Llama 2 is capable of expanding on the details of one option from a list:

Llama 2 is able to navigate conversational context and deliver long, detailed outputs

And a hilarious failure

Like all LLMs, Llama 2 has issues. Apparently, Neptune and Uranus are not real planets:

Apparently, Neptune and Uranus are not real planets

On its own, that’s just a series of hallucinations and contradictions. What makes it hilarious is the next chat, where Llama pretends to hallucinate then … changes its mind?

Llama hallucination

Other Llama 2 projects

There’s a lot that you can build with Llama 2! If you want to work more with the model’s context window, try our tutorial for building a chatbot with LangChain. And if you want to evaluate the most powerful version of the model, you can get a packaged version of Llama 70B on GitHub that’s ready to deploy on Baseten.

Subscribe to our newsletter

Stay up to date on model performance, GPUs, and more.

‌