Hong Liu Profile
Hong Liu

@HongLiu9903

Followers
310
Following
49
Media
7
Statuses
59

Co-founder, Lead Research @VoyageAI.

Joined October 2021
Don't wanna be here? Send us removal request.
@HongLiu9903
Hong Liu
25 days
🚀 Unveiling the first synthetic pretraining method that doesn’t rely on teacher distillation. Big shoutout to @ZitongYang0 @Aonan12 and the team!
@ZitongYang0
Zitong Yang
25 days
📜 Paper on new pretraining paradigm: Synthetic Bootstrapped Pretraining SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed. Validation: 1T data + 3B model from scratch.🧵
0
0
9
@ZitongYang0
Zitong Yang
25 days
📜 Paper on new pretraining paradigm: Synthetic Bootstrapped Pretraining SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed. Validation: 1T data + 3B model from scratch.🧵
9
50
247
@HongLiu9903
Hong Liu
2 months
Instruction-following takes reranker capabilities to the next level 🔥 Huge thanks to @zhmeishi and @AkshayGoindani1 for driving this leap forward!
@VoyageAI
Voyage AI by MongoDB
2 months
📣 Announcing rerank-2.5 and 2.5-lite: our latest generation of rerankers! • First reranker with instruction-following capabilities • rerank-2.5 and 2.5-lite are 7.94% and 7.16% more accurate than Cohere Reranker v3.5 • Additional 8.13% and 7.55% performance gain with
1
1
11
@HongLiu9903
Hong Liu
3 months
Greater things to come!
0
2
10
@dittycheria
Dev Ittycheria
3 months
We just launched Voyage-context-3, a new embedding model that gives AI a full-document view while preserving chunk-level precision that offers better retrieval performance than leading alternatives. When building AI that reads and reasons over documents (such as reports,
1
13
22
@HongLiu9903
Hong Liu
3 months
voyage-context-3 marks a paradigm shift to reduce reliability on chunking. The idea dated back to last year when @Yujie_Qian and I discussed how to embed contextual information without breaking VectorDBs. It turns out a new training objective is key to productionize the idea.
@VoyageAI
Voyage AI by MongoDB
3 months
📢 voyage-context-3: contextualized chunk embeddings - Auto captures of chunk level detail & global doc context, w/o metadata augmentation - Beats OpenAI-v3-large by 14.24% & Cohere-v4 by 7.89% - Binary 512-dim matches OpenAI (float, 3072-dim) in accuracy, but 192x cheaper in
0
0
5
@HongLiu9903
Hong Liu
4 months
6 / 6 Come see the future of multimodal embeddings! @HaonanC80190 @luo_yuping 📝 Paper:  https://t.co/m0535UY4Hx💻  Code:  https://t.co/sJQE4pJgWa 🤖 Models:  https://t.co/50oJY7lJbv 📚 Datasets:  https://t.co/oQBsywdgUQ 🌐 Homepage:  https://t.co/O1kixSp32w
haon-chen.github.io
TWITTER BANNER DESCRIPTION META TAG
0
1
11
@HongLiu9903
Hong Liu
4 months
5 / 6 The most exciting implication: We can now continuously feed massive, unlabeled web-scale multimodal data into MoCa to constantly improve our embeddings. When it comes to multimodal embedding, scaling might be all you need.
1
1
9
@HongLiu9903
Hong Liu
4 months
4 / 6 The result? MoCa efficiently transforms causal VLMs into powerful bidirectional embedding models. We've set SoTA of single vector embeddings on MMEB and ViDoRe V2. Our 3B parameter model even matches or outperforms strong 7B+ baselines! 🤯
1
0
6
@HongLiu9903
Hong Liu
4 months
3 / 6 MoCa's approach: We jointly reconstruct masked image patches and text tokens using a bidirectional attention mechanism. This captures the rich context from interleaved data. After this continual pre-training, a lightweight contrastive fine-tuning stage suffices.
1
0
3
@HongLiu9903
Hong Liu
4 months
2 / 6 Why the change? Current multimodal embeddings often struggle. They rely on: Causal attention (like in LLMs), which might be suboptimal for embeddings. Contrastive learning, which needs curated, labeled pairs and is hard to scale. We saw a better way.
1
0
3
@HongLiu9903
Hong Liu
4 months
1/6 Introduce MoCa, a new method for continual pre-training of multimodal embeddings! 🚀 MoCa is the first to effectively scale with unlabeled interleaved image-text data, marking a paradigm shift in multimodal embeddings. Paper, code, & checkpoints! 👇 #AI #Multimodal #ML #NLP
1
40
142
@HongLiu9903
Hong Liu
5 months
Congrats! Code retrieval is undervalued. It’s great to see more players in the game 6 months after voyage-code-3 release
@MistralAI
Mistral AI
5 months
Introducing Codestral Embed, the new state-of-the-art embedding model for code.
0
0
3
@HongLiu9903
Hong Liu
5 months
🔥 Mind-blown by embedding model progress! In the past two months, we made voyage-3.5-lite outperform its 3x larger predecessor, voyage-3. The secret? Distilling from a larger model (voyage-3-large) is incredibly effective. The future of embeddings is here!
@VoyageAI
Voyage AI by MongoDB
5 months
📢 Meet voyage-3.5 and voyage-3.5-lite! • flexible dim. and quantizations • voyage-3.5 & 3.5-lite (int8, 2048 dim.) are 8% & 6% more accurate than OpenAI-v3-large, and 2.2x & 6.5x cheaper, resp. Also 83% less vectorDB cost! • 3.5-lite ~ Cohere-v4 in quality, but 83% cheaper.
0
0
2
@HongLiu9903
Hong Liu
7 months
We trained voyage-code-3 back in last Nov. So far no other model is even close in code retrieval. Happy to see it shine in the brilliant @continuedev code assistants!
@continuedev
Continue
7 months
@metcalfc wrote a deep dive on why your custom AI code assistant should include embeddings and a reranker from @VoyageAI🥇
0
0
3
@HongLiu9903
Hong Liu
8 months
Proud of the team for what we have achieved! Joining MongoDB opens a new chapter of innovations to reshape the landscape of information retrieval and semantic search
@dittycheria
Dev Ittycheria
8 months
The risk of hallucinations currently holds enterprises back from deploying AI apps. Excited to share that VoyageAI has joined MongoDB to make high-quality AI-powered search and retrieval easy, enabling organizations to build trustworthy AI apps at scale. https://t.co/8I2x6OLzwR
0
0
5
@HongLiu9903
Hong Liu
9 months
Since the difference is astonishing, I have questions for the authors: Did you use the same setup for baselines and your models? Did your model use the same prompt in evaluation as what you mentioned in hf? Most importantly, can you release code to reproduce the results?
1
0
1
@HongLiu9903
Hong Liu
9 months
The paper does not provide code so I use the standard mteb ( https://t.co/0a36upMiol). I use the prompt in the hf page for SFR model. voyage-code-2 result reported is 23.3% worse than reproduction. SFR-Embedding-Code-2B_R result reported is 14.5% better than reproduction.
1
0
0
@HongLiu9903
Hong Liu
9 months
Tried to reproduce the COIR results. TLDR: SFR-Embedding-Code-2B_R is 26.5% worse than voyage-code-2 as oppposed to what is claimed in the paper.
@SFResearch
Salesforce AI Research
9 months
🚨🚨🚨Just released!🚨🚨🚨 🚀Introducing the Salesforce Code Embedding Model Family (SFR-Embedding-Code), ranked #1 on CoIR Benchmark! 🚀 Available in 2 sizes: 2B, 400M. Key Highlights: 1️⃣ 2B Model: Achieves #1 on CoIR. 2️⃣400M Model: Best-performing model under 0.5B
4
2
13