Hong Liu Profile
Hong Liu

@HongLiu9903

Followers
286
Following
39
Media
7
Statuses
53

Co-founder, Lead Research @VoyageAI.

Joined October 2021
Don't wanna be here? Send us removal request.
@HongLiu9903
Hong Liu
8 days
Greater things to come!
Tweet media one
0
2
10
@HongLiu9903
Hong Liu
14 days
RT @dittycheria: We just launched Voyage-context-3, a new embedding model that gives AI a full-document view while preserving chunk-level p….
0
12
0
@HongLiu9903
Hong Liu
14 days
voyage-context-3 marks a paradigm shift to reduce reliability on chunking. The idea dated back to last year when @Yujie_Qian and I discussed how to embed contextual information without breaking VectorDBs. It turns out a new training objective is key to productionize the idea.
@VoyageAI
Voyage AI by MongoDB
14 days
📢 voyage-context-3: contextualized chunk embeddings. - Auto captures of chunk level detail & global doc context, w/o metadata augmentation.- Beats OpenAI-v3-large by 14.24% & Cohere-v4 by 7.89%.- Binary 512-dim matches OpenAI (float, 3072-dim) in accuracy, but 192x cheaper in
Tweet media one
0
0
5
@HongLiu9903
Hong Liu
1 month
6 / 6. Come see the future of multimodal embeddings!.@HaonanC80190.@luo_yuping . 📝 Paper: � Code: 🤖 Models: 📚 Datasets: 🌐 Homepage: 
haon-chen.github.io
TWITTER BANNER DESCRIPTION META TAG
0
1
10
@HongLiu9903
Hong Liu
1 month
5 / 6. The most exciting implication: We can now continuously feed massive, unlabeled web-scale multimodal data into MoCa to constantly improve our embeddings. When it comes to multimodal embedding, scaling might be all you need.
Tweet media one
1
1
9
@HongLiu9903
Hong Liu
1 month
4 / 6. The result? MoCa efficiently transforms causal VLMs into powerful bidirectional embedding models. We've set SoTA of single vector embeddings on MMEB and ViDoRe V2. Our 3B parameter model even matches or outperforms strong 7B+ baselines! 🤯
Tweet media one
Tweet media two
1
0
6
@HongLiu9903
Hong Liu
1 month
3 / 6. MoCa's approach: We jointly reconstruct masked image patches and text tokens using a bidirectional attention mechanism. This captures the rich context from interleaved data. After this continual pre-training, a lightweight contrastive fine-tuning stage suffices.
1
0
3
@HongLiu9903
Hong Liu
1 month
2 / 6. Why the change? Current multimodal embeddings often struggle. They rely on:. Causal attention (like in LLMs), which might be suboptimal for embeddings. Contrastive learning, which needs curated, labeled pairs and is hard to scale. We saw a better way.
1
0
3
@HongLiu9903
Hong Liu
1 month
1/6 Introduce MoCa, a new method for continual pre-training of multimodal embeddings! 🚀. MoCa is the first to effectively scale with unlabeled interleaved image-text data, marking a paradigm shift in multimodal embeddings. Paper, code, & checkpoints! 👇.#AI #Multimodal #ML #NLP
Tweet media one
1
40
139
@HongLiu9903
Hong Liu
2 months
Congrats! Code retrieval is undervalued. It’s great to see more players in the game 6 months after voyage-code-3 release.
@MistralAI
Mistral AI
2 months
Introducing Codestral Embed, the new state-of-the-art embedding model for code.
Tweet media one
0
0
3
@HongLiu9903
Hong Liu
3 months
🔥 Mind-blown by embedding model progress! In the past two months, we made voyage-3.5-lite outperform its 3x larger predecessor, voyage-3. The secret? Distilling from a larger model (voyage-3-large) is incredibly effective. The future of embeddings is here!.
@VoyageAI
Voyage AI by MongoDB
3 months
📢 Meet voyage-3.5 and voyage-3.5-lite!.• flexible dim. and quantizations.• voyage-3.5 & 3.5-lite (int8, 2048 dim.) are 8% & 6% more accurate than OpenAI-v3-large, and 2.2x & 6.5x cheaper, resp. Also 83% less vectorDB cost! .• 3.5-lite ~ Cohere-v4 in quality, but 83% cheaper.
Tweet media one
0
0
2
@HongLiu9903
Hong Liu
4 months
We trained voyage-code-3 back in last Nov. So far no other model is even close in code retrieval. Happy to see it shine in the brilliant @continuedev code assistants!.
@continuedev
Continue
4 months
@metcalfc wrote a deep dive on why your custom AI code assistant should include embeddings and a reranker from @VoyageAI🥇
Tweet media one
0
0
3
@HongLiu9903
Hong Liu
5 months
Proud of the team for what we have achieved! Joining MongoDB opens a new chapter of innovations to reshape the landscape of information retrieval and semantic search.
@dittycheria
Dev Ittycheria
5 months
The risk of hallucinations currently holds enterprises back from deploying AI apps. Excited to share that VoyageAI has joined MongoDB to make high-quality AI-powered search and retrieval easy, enabling organizations to build trustworthy AI apps at scale.
0
0
5
@HongLiu9903
Hong Liu
7 months
Since the difference is astonishing, I have questions for the authors:.Did you use the same setup for baselines and your models?.Did your model use the same prompt in evaluation as what you mentioned in hf?.Most importantly, can you release code to reproduce the results?.
1
0
1
@HongLiu9903
Hong Liu
7 months
The paper does not provide code so I use the standard mteb (. I use the prompt in the hf page for SFR model. voyage-code-2 result reported is 23.3% worse than reproduction. SFR-Embedding-Code-2B_R result reported is 14.5% better than reproduction.
1
0
0
@HongLiu9903
Hong Liu
7 months
Tried to reproduce the COIR results. TLDR: SFR-Embedding-Code-2B_R is 26.5% worse than voyage-code-2 as oppposed to what is claimed in the paper.
Tweet media one
@SFResearch
Salesforce AI Research
7 months
🚨🚨🚨Just released!🚨🚨🚨 . 🚀Introducing the Salesforce Code Embedding Model Family (SFR-Embedding-Code), ranked #1 on CoIR Benchmark! 🚀. Available in 2 sizes: 2B, 400M. Key Highlights:. 1️⃣ 2B Model: Achieves #1 on CoIR. 2️⃣400M Model: Best-performing model under 0.5B.
4
2
13
@HongLiu9903
Hong Liu
7 months
Let’s go large!.
@spyced
Jonathan Ellis
7 months
I ran a fresh evaluation of embedding models tuned for semantic retrieval, including the newest models from Voyage, Jina, Cohere, and NVIDIA. Link in thread.
Tweet media one
Tweet media two
0
0
2
@HongLiu9903
Hong Liu
7 months
voyage-3-large embodies all insights we've learned along the way. It outperforms every model we tested on every type of retrieval task by a considerable margin.
@VoyageAI
Voyage AI by MongoDB
7 months
📢 Announcing the new SOTA voyage-3-large embedding model!. • 9.74% over OpenAI and +20.71% over Cohere.• flexible dim. (256-2048) and quantizations (float, int8, binary).• 8.56% over OpenAI with 1/24x storage cost.• 1.16% over OpenAI with 1/192x storage cost ($10K → $52)
Tweet media one
0
1
12
@HongLiu9903
Hong Liu
8 months
As the 1yr old voyage-code-2 is already unparalleled in code retrieval, voyage-code-3 pushes the boundaries even further.
@VoyageAI
Voyage AI by MongoDB
8 months
Voyage created a total of 238 new high-quality reasoning-intensive code retrieval datasets that address the shortcomings of existing benchmarks (noisy labels, overly simplistic tasks, and data contamination). voyage-code-3 outperforms all other models in every group of datasets.
Tweet media one
0
1
2