Getting started with foundation models

TL;DR

This post will help you develop a high-level understanding of foundation models, with a focus on the importance of training data and model adaptation for downstream tasks. If you’re interested in deploying open source foundation models, check out our Model Library. You can also download, customize, and deploy foundation models by using Truss, our open source model packaging framework. A link to each of the models available on Baseten has been provided at the end of this post.

For many of us, our first exposure to foundation models is through an app or service such as Lensa or ChatGPT. Lensa is built off of Stable Diffusion, an open source foundation model from Stability AI, in collaboration with LMU and Runway, that generates photo-realistic images based off of text input, while ChatGPT is built off of OpenAI’s generative pre-trained transformer (GPT)-3.5 model. While incredibly popular, Lensa and ChatGPT represent a mere fraction of the apps and services that have been built on top of foundation models in the last few years.

To add to the list, several Baseten engineers created ChatLLaMA, an open-source ChatGPT alternative that uses the 7-billion parameter variant of the LLaMA model, fine-tuned on the Alpaca dataset. In this post we’ll cover the basics of foundation models, using the LLaMA family of models from Meta as an example.

ChatLLaMA is a great example of using a web app to interact with a foundation model

What makes a model a foundation model?

Foundation models are a class of models that meet two general criteria:

  • Foundation models are trained on a dataset that is both broad in scope and massive in size

  • Foundation models can be further adapted to a wide variety of downstream applications

Some examples of foundation models include Stable Diffusion, which allows you to generate original images from text prompts, Whisper, which transcribes audio files across multiple languages, and LLaMA, a text-generating language model.

Traditional machine learning models tend to do one thing and do it well. While you wouldn’t use a regression model–a non-foundation model that’s most often used to describe the relationship between a dependent and independent variables–for object detection, you could take a foundation model and modify it so that it’s capable of  a multitude of downstream tasks other than what it was trained for, by utilizing different methods of adaptation, such as in-context learning and fine-tuning.

Training a foundation model utilizes incredible amounts of data

We cannot understate the importance of data when it comes to training foundation models. In traditional machine learning we train models ourselves, either locally or using cloud computing resources such as GPUs or TPUs. Our data can be labeled or unlabeled, and our training set usually consists of a single type of data. Furthermore, the size of the dataset(s) used in training a traditional machine learning model are manageable from both a cost and computing perspective. 

On the other hand, foundation models use significantly larger datasets that often combine structured, text, speech, and image data into a comprehensive dataset that is then used for training a given foundation model. While foundation models are also trained on GPUs and TPUs, it would be incredibly expensive for you or I to train a foundation model. Instead foundation models are trained by organizations such as OpenAI, Google, and a variety of universities. 

A note on machine learning methods

  • At a very high level, supervised learning is used for classification and regression problems, and relies on labeled datasets for both input and output. Data labeling can be as simple as tagging images of animals with the correct species name to providing in-depth metadata for a medical image. Labeling data requires some level of human intervention, hence the name “supervised”, which imposes a limit on how much data can be labeled both in terms of the time and cost required to complete labeling.

  • Unsupervised learning is used to look for patterns and underlying structures in a dataset, and relies on unlabeled data. Because there is no human-in-the-loop with relation to data labeling, this method is referred to as “unsupervised”.

  • Semi-supervised learning combines both supervised and unsupervised learning methodologies by using a large, unlabeled dataset and a much smaller, labeled dataset.

Because foundation models leverage unsupervised and semi-supervised learning methods, they’re able to utilize these enormous datasets without incurring the costs required to label a dataset.

To highlight the size and scope of the data used to train a foundation model, let’s look at the dataset used to create Meta’s LLaMA model:

The disk size and sampling percentage of the datasets used to train Meta's LLaMA model.

Even after de-duplication and tokenization of the dataset, training the 65-billion parameter LLaMA model using 2,048 A100 GPUs with 80GB of RAM took researchers 21 days to complete. Through some quick, back-of-the-envelope calculations we figure that repeating this training just once using commercially available A100 GPUs would cost well into the millions of dollars.

Interested in learning more about GPU architecture and how to choose the right GPU for your workload? Baseten technical writer Philip Kiely has a great series up on our blog! 

Because the costs of developing a foundation model on our own can be prohibitive, as developers we interact with foundation models in their pre-trained state–meaning that the foundation model has been trained by an organization and released in some capacity to the public–but we can adapt foundation models through various methods, including by adding in our own, much smaller, datasets.

There are multiple methods for adapting foundation models for downstream applications 

In the case of ChatLLaMA, the 7-billion parameter variant of the LLaMA model was fine-tuned on the Alpaca dataset. The LLaMA dataset was trained on publicly available datasets on the web, gathered from sites such as GitHub, Wikipedia, and ArXiv, but the Alpaca dataset is specifically an instruction-following dataset. To this end, when our engineers at Baseten used a fine-tuned version of the LLaMA model using the Alpaca dataset, which essentially “updated” the primary version of the LLaMA model to be adept at answering questions and following instructions.

Fine-tuning and in-context learning are both in-depth model adaptation methods that deserve their own posts, but at a very high level, fine-tuning tends to be the more computationally expensive method, using transfer learning to update some or all of the model weights of a pre-trained model using a new dataset. 

In contrast, in-context learning is a prompt-based method of adaptation, where input/output pairs are given to a model, and then when given a test prompt, the model returns an output largely consistent with the prompts it’s been shown. There is no updating of the model weights with in-context learning. Of particular interest is the fact that we don’t quite understand how in-context learning works, only that it is an emergent feature of foundation models, where the model is capable of performing tasks that it was not originally trained to do.

Download, customize, and deploy foundation models with Baseten

Now that we’ve covered the basics of foundation models, we’d love for you to download, customize, and deploy foundation models using Truss, our open source model serving framework. And if you’re interested in fine-tuning please reach out to us on Twitter, as we have some early capabilities in the works.

Foundation models available 

Text generation

Speech recognition

Image generation

Text-to-speech

Image classification

You can also deploy the above models to Baseten directly through our Model Library, and check out the accompanying documentation.