voyage-3 & voyage-3-lite: A new generation of small yet mighty general-purpose embedding models

TL;DR – We are excited to announce voyage-3 and voyage-3-lite embedding models, advancing the frontier of retrieval quality, latency, and cost. voyage-3 outperforms OpenAI v3 large by 7.55% on average across all evaluated domains, including code, law, finance, multilingual, and long-context, with 2.2x lower costs and 3x smaller embedding dimension, resulting in 3x lower vectorDB costs. voyage-3-lite offers 3.82% better retrieval accuracy than OpenAI v3 large while costing 6x less and having 6x smaller embedding dimension. Both models support a 32K-token context length, 4x more than OpenAI.

In the last nine months, we have released a suite of our Voyage 2 series embedding models, including state-of-the-art general-purpose models, such as voyage-large-2, and domain-specific models, such as voyage-code-2, voyage-law-2, voyage-finance-2, and voyage-multilingual-2, all extensively trained on data from their respective domains. For example, voyage-multilingual-2 demonstrates superior retrieval quality in French, German, Japanese, Spanish, and Korean, while still providing best-in-class performance in English. We have also fine-tuned models for companies with specific use cases and data, e.g., Harvey.ai.

Now, we are thrilled to introduce our Voyage 3 series embedding models, voyage-3 and voyage-3-lite, with voyage-3-large coming in a few weeks. These models outperform competitors¹ in retrieval quality while significantly reducing price and downstream costs for vectorDB. Specifically, voyage-3:

Outperforms OpenAI v3 large across all eight evaluated domains (tech, code, web, law, finance, multilingual, conservation, and long-context) by 7.55% on average.
Costs 2.2x less than OpenAI v3 large and 1.6x less than Cohere English v3, at $0.06 per 1M tokens.
Has a 3-4x smaller embedding dimension (1024) compared to OpenAI (3072) and E5 Mistral (4096), resulting in 3-4x lower vectorDB costs.
Supports a 32K-token context length, compared to OpenAI (8K) and Cohere (512).

voyage-3-lite is more lightweight model optimized for latency and low cost, which:

Outperforms OpenAI v3 large by 3.82% on average across the domains.
Costs 6.5x less than OpenAI v3 large, at $0.02 per 1M tokens.
Outperforms OpenAI v3 small by 7.58% with the same price.
Has a 6-8x smaller embedding dimension (512) compared to OpenAI (3072) and E5 Mistral (4096), resulting in 6-8x lower vectorDB costs.
Supports a 32K-token context length, compared to OpenAI (8K) and Cohere (512).

The table below summarize the important aspects of these models along with a few competitors, accompanied by a plot of retrieval quality versus cost².

Model	Dimensions	Context Length	Cost (per 1M tokens)	Retrieval Quality (NDCG@10)
voyage-3	1024	32K	$0.06	76.72
voyage-3-lite	512	32K	$0.02	72.98
OpenAI v3 large	3072	8K	$0.13	69.17
OpenAI v3 small	1536	8K	$0.02	67.08
Cohere English v3	1024	512	$0.10	59.33
E5 Mistral	4096	4K	$0.10	70.13
BGE M3	1024	8K	$0.016	66.61

voyage-3 and voyage-3-lite are the result of several research innovations, including an improved architecture, distillation from larger models, over 2T high-quality tokens in pre-training, and retrieval result alignment via human feedback.

Recommendations. Any general-purpose embedding users can upgrade to voyage-3 for better retrieval quality at a low cost, or voyage-3-lite for further cost saving. If you are particularly interested in code, law, finance, and multilingual retrieval, Voyage 2 series domain-specific models (voyage-code-2, voyage-law-2, voyage-finance-2, and voyage-multilingual-2) are still best for their respective domains, even though voyage-3 has highly competitive performance as well (see Section below). If you’ve used Voyage embeddings, you can simply specify "voyage-3" or "voyage-3-lite" as the model parameter in Voyage API calls, for both the corpus and queries.

Evaluation Details

Datasets. We evaluate on 40 domain-specific retrieval datasets, spanning eight domains, technical documentation, code, law, finance, web reviews, multilingual, long documents, and conversations. Each dataset consists of a corpus to be retrieved from and a set of queries. The corpus typically encompasses documents in a particular domain, such as answers in StackExchange, court opinions, technical documentation, etc., and the queries can be questions, summarization of a long document, or simply individual documents. The following table list the datasets in the eight categories except multilingual. The multilingual domain covers 62 datasets covering 26 languages, including French, German, Japanese, Spanish, Korean, Bengali, Portuguese, and Russian. Each of the first 5 languages has multiple datasets. The other languages involve one dataset each and are grouped into an OTHER category in the multilingual radar chart below.

Category	Descriptions	Datasets
TECH	Technical documentation	Cohere, 5G, OneSignal, LangChain, PyTorch
CODE	Code snippets, docstrings	LeetCodeCpp, LeetCodeJava, LeetCodePython, HumanEval, MBPP, DS1000-referenceonly, DS1000, apps_5doc
LAW	Cases, court opinions, statutes, patents	LeCaRDv2, LegalQuAD, LegalSummarization, AILA casedocs, AILA statutes
FINANCE	SEC filings, finance QA	RAG benchmark (Apple-10K-2022), FinanceBench, TAT-QA, Finance Alpaca, FiQA Personal Finance, Stock News Sentiment, ConvFinQA, FinQA, HC3 Finance
WEB	Reviews, forum posts, policy pages	Huffpostsports, Huffpostscience, Doordash, Health4CA
LONG-CONTEXT	Long documents on assorted topics: government reports, academic papers, and dialogues	NarrativeQA, Needle, Passkey, QMSum, SummScreenFD, WikimQA
CONVERSATION	Meeting transcripts, dialogues	Dialog Sum, QA Conv, HQA

A list of all evaluation datasets is available in this spreadsheet.

Models. We evaluate voyage-3 and voyage-3-lite alongside several alternatives, including: OpenAI v3 small (text-embedding-3-small) and large (text-embedding-3-large), E5 Mistral (intfloat/e5-mistral-7b-instruct), BGE M3 (BAAI/bge-m3), Cohere English v3 (embed-english-v3.0), and voyage-large-2-instruct. For the domain-specific and multilingual datasets, we also evaluate voyage-law-2, voyage-finance-2, voyage-multilingual-2, Multilingual E5 (infloat/multilingual-e5-large) and Cohere multilingual v3 (embed-multilingual-v3.0).

Metrics. Given a query, we retrieve the top 10 documents based on cosine similarities and report the normalized discounted cumulative gain (NDCG@10), a standard metric for retrieval quality and a variant of the recall.

Results

Retrieval Across Domains. As discussed earlier and shown in the first radar chart of this post, voyage-3 outperforms OpenAI v3 large by an average of 7.55% across domains. Furthermore, voyage-3 trails closely behind Voyage’s domain-specific models as shown in the bar plots below.

Multilingual Retrieval. As shown in the radar chart below, voyage-3’s multilingual retrieval quality is just slightly behind voyage-multilingual-2, with a lower latency and at half the cost. voyage-3-lite outperforms all non-Voyage models, besting OpenAI v3 large, Cohere multilingual v3, and Multilingual E5 by 4.55%, 3.13%, and 3.89%, respectively.

All the evaluation results are available in this spreadsheet.

Try Voyage 3 Series!

Give voyage-3 and voyage-3-lite a try today! The first 200M tokens are free. Head over to our docs to learn more. If you’re also interested in fine-tuning embeddings, we’d love to hear from you—please email us at contact@voyageai.com. Follow us on X (Twitter) and LinkedIn, and join our Discord for more updates.

Cohere English v3 average NDCG@10 for LAW and LONG-CONTEXT is 33.32% and 42.48%, respectively. We rounded these values up to 45% for the radar chart visualization. ↩︎
E5 Mistral and BGE M3 are open-source models. We use $0.10 for E5 Mistral following industry standard for 7B-parameters models and $0.016 for BGE M3 using Fireworks.ai’s price for 350M-parameter models. ↩︎

Absortio

voyage-3 & voyage-3-lite: A new generation of small yet mighty general-purpose embedding models

Extracto

Contenido

Evaluation Details

Results

Try Voyage 3 Series!