The best open source large language model

Large language models (LLMs) are the definitive category in generative AI. But with tens of thousands of options, it can be hard to feel confident about making the right tradeoffs between output quality, speed, and cost — especially when models specialize in different tasks.

Taking a holistic view across technical specifications, customer conversations, and our own testing, we’ve put together this list of models to guide you in finding the right starting place for building on top of open source text generation models for chat, code completion, retrieval-augmented generation, and more LLM use cases.

Best overall open source LLM: Llama 3.1 70B Instruct

Meta's latest LLM family, Llama 3.1, offers 8B, 70B, and 405B parameter instruct-tuned models with excellent benchmark performance. The larger model, Llama 3.1 70B Instruct, compares favorably to GPT-3.5, Gemini Pro 1.5, and Claude 3 Sonnet. The models were trained on over 15 trillion tokens, with an emphasis on code and a knowledge cutoff date of December 2023 for Llama 3 70B (March 2024 for Llama 3 8B).

What we love about Llama 3.1 70B Instruct:

  • 128k-token context window with excellent retrieval benchmarks for building RAG-type applications.

  • Along with Meta's systematic investments in safety, Llama 3 models have been instruct-tuned to reduce false refusal rates.

  • Strong code generation and mathematical reasoning capabilities in a general model.

  • New, more efficient tokenizer yields up to 15% fewer tokens, meaning you generate fewer tokens per request.

What to watch out for with Llama 3 70B Instruct:

  • Llama 3.1 70B only supports eight languages, while many models support 2-3x as many.

  • Llama 3.1 models have a custom commercial license that also applies to any fine-tuned derivatives.

Get started with Llama 3.1 70B Instruct or try the smaller but still excellent Llama 3.1 8B Instruct.

The best big LLM: Llama 3.1 405B Instruct

Llama 3.1 405B is an open-source model that truly rivals heavyweights like GPT-4o. While other large models like Mistral Large 2 and Cohere Command-R plus are also extremely powerful, Llama 405B is licensed for commercial use with restrictions that few startups or enterprises would run up against.

What we love about Llama 3.1 405B:

  • Benchmarks favorably against the best closed-source models and backs up those scores with excellent observed real-world performance.

  • Massive 128k-token context window for retrieval-augmented generation and tool use.

  • Built-in function calling with specialized tool and function tokens.

What to watch out for with Llama 3.1 405B:

  • The model is so large that it generally must be run at FP8 on H100 GPUs.

  • Inference is expensive even with optimizations, for many use cases a less-powerful model like Llama 3.1 70B will suffice at a lower cost.

  • Llama 3.1 models have a custom commercial license that also applies to any fine-tuned derivatives.

Contact us for access to Llama 3.1 405B.

Best small LLM under 7 billion parameters: Phi 3 Mini

On the opposite end of the spectrum, Phi 3 Mini is an open source instruct-tuned LLM by Microsoft that achieves state of the art performance for models of its size at just 3.8 billion parameters. Phi 3 Mini runs fast on cheap hardware, making it a strong option for low-cost inference.

What we love about Phi 3 Mini:

  • Excellent output quality rivals 7B LLMs from just a few months ago.

  • 128k-token context window variant allows for unprecedented use cases for models of this size class.

  • Permissive MIT license for unrestricted commercial use.

What to watch out for with Phi 3 Mini models:

  • While the LLM is outstanding for its class, output quality falls behind larger models, especially for factual recall.

  • The 4k-token context window variant consistently scores slightly higher on evals; only use the 128k-token variant when the increased context window is strictly necessary.

  • Phi 3 is an English-only model.

Deploy Phi 3 Mini 4k (or the 128k variant) on a T4 GPU.

Another great LLM family: Mistral and Mixtral

Mistral AI is a foundation model lab founded in France that builds both open-source and proprietary language models. Their open-source models include three sizes of base and instruct-tuned foundation LLM as well as vision models and domain-specific models for math and code.

What we love about Mistral models:

What to watch out for with Mistral models:

  • Batching model requests reduces efficiency gains from Mixture of Experts architecture for 8x7B and 8x22B models.

  • Newer models like Mistral Large are not licensed for commercial use.

  • Light-touch alignment may not be suitable for all use cases.

Choose a model from the Mistral family: 7B, 8x7B, 8x22B, or Pixtral!

Best ML model for code generation: Code Llama

Code Llama is a project by Meta to fine tune their Llama 2 family of models to specialize in code generation tasks. The Code Llama family has nine models, as the model is available in four sizes (7B, 13B, 34B, and the new 70B) across three variants (Base, Instruct, and Python).

The models were trained on over 500 billion tokens of code, with additional specialized training variant-by-variant (e.g. the Python variant was trained on another 100 billion tokens of Python code). The largest models (34B and 70B parameters) outperform GPT on evaluation benchmarks targeted at code generation.

What we love about Code Llama:

  • 70B Instruct variant scores 67.8 on HumanEval (pass@1) vs 67.0 for GPT-4.

  • Four sizes (7B, 13B, 34B, and 70B) and three variants (Base, Instruct, and Python) for maximum flexibility.

  • 7B and 13B sizes are lower latency than 34B and have built-in code completion capabilities.

  • Large context window (up to 100K tokens) is essential for working with code as context (code is much more token-dense than natural language).

What to watch out for:

  • The most powerful 34B and 70B models — the only ones to surpass GPT on benchmarks — do not have code completion capabilities out of the box.

  • Only the Instruct variant is capable of responding to natural language prompts, the other two variants are code completion models.

  • Llama models have a special license that also applies to Code Llama models.

Try Code Llama 7B Instruct for chat-based coding.

Best model for fine tuning: Llama 3.1

The Llama 3.1 family of LLMs offers the most flexibility for fine tuning projects across size (8B, 70B, 405B) and focus (base and code variants). Given Llama 3 models’ strong base performance, any model from the family is a powerful foundation to build on.

What we love about Llama 3.1 for fine tuning:

  • Base models in 2 different sizes (8B, 70B, 405B) lets you make tradeoffs between cost and performance.

  • New Llama 3.1 license explicitly allows for derivatives and teacher models.

  • Llama models have a history as popular choices for fine tuning work, so there’s plenty of research and tooling to build on.

What to watch out for:

  • Heavy-handed alignment in the instruct/chat variants of the models means you may need to start from scratch from the base variant.

  • Llama 3.1 models have a special license that also applies to fine tuned variants.

Experiment with Llama 3.1 8B Instruct on autoscaling infrastructure.

Can open source models replace OpenAI and ChatGPT?

Yes. Llama 3.1 405B compares favorably to GPT-4o on most benchmarks.

Newer open source LLMs like Llama 3 70B Instruct compare favorably to closed-source options like GPT-3.5, Gemini Pro 1.5, and Claude 3 Sonnet. And with fine-tuning, open source models can match or beat the best closed-source models for specific tasks at much lower costs.

How much should I trust model evaluation benchmarks?

Model evaluation benchmarks measure an LLM’s performance on a fixed set of tasks. These benchmarks are designed to assess the accuracy and quality of the model’s output. While there is no universal standard benchmark, there are a number of popular options including ARC, HellaSwag, and MMLU.

There are some worries about the usefulness of evaluation benchmarks. Generally, evaluation benchmarks could be too narrow to fully capture a model’s performance, and more recently there have been concerns about evaluation sets leaking into models’ training data. These problems have solutions. It’s standard practice to look at a model’s average performance across several benchmarks to account for the limitations of any one benchmark, and researchers check for contamination of their training data before releasing models.

Benchmarks performance is a solid signal when picking an LLM, but isn’t the whole story. There’s no need to switch models every time a new variant comes out with a slight uptick in benchmark score, and the most important thing to do is evaluate model output for your exact use case.

What about domain-specific language models?

In many cases, you can get better domain-specific performance for LLMs by fine-tuning open-source models or building custom models. Domain-specific models can rival frontier models on specific topics, like medicine or finance, with a fraction of the parameters, leading to more cost-efficient inference. For example, Writer built custom domain-specific LLMs for medicine and finance and optimized them for fast inference with Baseten.

The best open source LLM

There’s no one best open source LLM, only the LLM that’s best for you. This selection depends on capabilities, features, price point, and license. New models are released every day, and it can feel overwhelming to keep up. But finding the right model for your use case is possible with a bit of guidance and experimentation.

Deploy the best open source LLM for your use case in just a couple of clicks: