Nomic Embed v1.5
SOTA text embedding model with variable dimensionality — outperforms OpenAI text-embedding-ada-002 and text-embedding-3-small models.
Deploy Nomic Embed v1.5 behind an API endpoint in seconds.
Deploy modelNomic Embed v1.5 is a best in class open source text embedding model that outperforms OpenAI text-embedding-ada-002
and text-embedding-3-small
models on MTEB average benchmarks. Nomic Embed v1.5 has three notable properties beyond its benchmark performance:
You can adjust your model’s dimensionality to trade off between cost and accuracy
You can create task-optimized embeddings for retrieval, search, clustering, or classification.
Embeddings are normalized during inference to a length of 1, so you can use cosine similarity just like with OpenAI embeddings.
Nomic Embed v1.5 has a context length of 8,192 tokens and is a fully open model with weights, data, and training code available under the Apache 2.0 license.
Adjustable dimensionality
Operating a text embedding model in production generally comes with three sets of costs:
Generating embeddings from an initial corpus in a batch inference job.
Storing your embeddings in a vector database.
Generating new embedding vectors on the fly in response to queries and comparing them to existing embedding vectors in the database.
These cost more when the dimensionality (meaning length) of the output vector is longer, especially the second one, storage, as longer vectors take up more bytes of storage.
Nomic Embed v1.5 already has half the maximum dimensionality of text-embedding-3-small
while offering similar benchmark performance. This means you’re using half the space in your vector DB to store the embeddings.
But Nomic Embed v1.5 takes things even further with an optional dimensionality
argument. Like other state of the art text embedding models, you can adjust the dimensionality of your output vectors, ranging from the default of 768
to as low as 64
dimensions.
Of course, reducing the dimensionality reduces the amount of information encoded in the output vector. But the drop in performance is not linear. Thanks to a technique called Matryoshka Representation Learning, information is encoded more densely in the early part of the vector than the later parts, so reducing dimensionality has a lesser impact on results quality. Learn more about Matryoshka Representation Learning and Nomic Embed v1.5’s adjustable dimensionality in Nomic’s model writeup.
Task-specific embeddings
Nomic Embed v1.5 takes an optional task_type parameter, which adjusts embeddings to perform better for specific tasks. The task type is prepended to each input string.
Available task types are:
search_document
(default): optimize document-based search for RAG.search_query
: optimize for semantic search.clustering
: find similarities within a dataset.classification
: define categories and apply to queries.
Normalized implementation
Nomic Embed v1.5 has normalized outputs. That means that every output vector from the model has a length of 1 — all vectors lead to points on the same sphere. This matters for using model outputs in a vector database, as it lets you use cosine similarity for comparing model outputs.
Nomic Embed v1.5’s normalization is part of the provided implementation and uses the PyTorch normalization function with euclidean distance. The model code from the Truss shows normalization steps are needed both before and after the Matryoshka dimensionality adjustment is applied.
embeddings = self._model.encode(texts, convert_to_tensor=True)
embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
if m_dim < 768:
embeddings = embeddings[:, :m_dim]
embeddings = F.normalize(embeddings, p=2, dim=1)
While this normalization process lets you use the same cosine similarity function that other popular embedding models use, including OpenAI’s models, it’s important to note that text embedding outputs cannot be compared model-to-model. In other words, comparing the output of one embedding model to the output of another embedding model is not meaningful. Output vectors can only be compared to other output vectors from the same model. That’s why model selection is important for text embedding models, as you have to re-generate your entire corpus to switch to a new model.