
Hong Liu
@HongLiu9903
Followers
310
Following
49
Media
7
Statuses
59
Co-founder, Lead Research @VoyageAI.
Joined October 2021
🚀 Unveiling the first synthetic pretraining method that doesn’t rely on teacher distillation. Big shoutout to @ZitongYang0 @Aonan12 and the team!
📜 Paper on new pretraining paradigm: Synthetic Bootstrapped Pretraining SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed. Validation: 1T data + 3B model from scratch.🧵
0
0
9
📜 Paper on new pretraining paradigm: Synthetic Bootstrapped Pretraining SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed. Validation: 1T data + 3B model from scratch.🧵
9
50
247
Instruction-following takes reranker capabilities to the next level 🔥 Huge thanks to @zhmeishi and @AkshayGoindani1 for driving this leap forward!
📣 Announcing rerank-2.5 and 2.5-lite: our latest generation of rerankers! • First reranker with instruction-following capabilities • rerank-2.5 and 2.5-lite are 7.94% and 7.16% more accurate than Cohere Reranker v3.5 • Additional 8.13% and 7.55% performance gain with
1
1
11
We just launched Voyage-context-3, a new embedding model that gives AI a full-document view while preserving chunk-level precision that offers better retrieval performance than leading alternatives. When building AI that reads and reasons over documents (such as reports,
1
13
22
voyage-context-3 marks a paradigm shift to reduce reliability on chunking. The idea dated back to last year when @Yujie_Qian and I discussed how to embed contextual information without breaking VectorDBs. It turns out a new training objective is key to productionize the idea.
📢 voyage-context-3: contextualized chunk embeddings - Auto captures of chunk level detail & global doc context, w/o metadata augmentation - Beats OpenAI-v3-large by 14.24% & Cohere-v4 by 7.89% - Binary 512-dim matches OpenAI (float, 3072-dim) in accuracy, but 192x cheaper in
0
0
5
6 / 6 Come see the future of multimodal embeddings! @HaonanC80190
@luo_yuping 📝 Paper: https://t.co/m0535UY4Hx💻 Code: https://t.co/sJQE4pJgWa 🤖 Models: https://t.co/50oJY7lJbv 📚 Datasets: https://t.co/oQBsywdgUQ 🌐 Homepage: https://t.co/O1kixSp32w
haon-chen.github.io
TWITTER BANNER DESCRIPTION META TAG
0
1
11
5 / 6 The most exciting implication: We can now continuously feed massive, unlabeled web-scale multimodal data into MoCa to constantly improve our embeddings. When it comes to multimodal embedding, scaling might be all you need.
1
1
9
4 / 6 The result? MoCa efficiently transforms causal VLMs into powerful bidirectional embedding models. We've set SoTA of single vector embeddings on MMEB and ViDoRe V2. Our 3B parameter model even matches or outperforms strong 7B+ baselines! 🤯
1
0
6
3 / 6 MoCa's approach: We jointly reconstruct masked image patches and text tokens using a bidirectional attention mechanism. This captures the rich context from interleaved data. After this continual pre-training, a lightweight contrastive fine-tuning stage suffices.
1
0
3
2 / 6 Why the change? Current multimodal embeddings often struggle. They rely on: Causal attention (like in LLMs), which might be suboptimal for embeddings. Contrastive learning, which needs curated, labeled pairs and is hard to scale. We saw a better way.
1
0
3
1/6 Introduce MoCa, a new method for continual pre-training of multimodal embeddings! 🚀 MoCa is the first to effectively scale with unlabeled interleaved image-text data, marking a paradigm shift in multimodal embeddings. Paper, code, & checkpoints! 👇 #AI #Multimodal #ML #NLP
1
40
142
🔥 Mind-blown by embedding model progress! In the past two months, we made voyage-3.5-lite outperform its 3x larger predecessor, voyage-3. The secret? Distilling from a larger model (voyage-3-large) is incredibly effective. The future of embeddings is here!
📢 Meet voyage-3.5 and voyage-3.5-lite! • flexible dim. and quantizations • voyage-3.5 & 3.5-lite (int8, 2048 dim.) are 8% & 6% more accurate than OpenAI-v3-large, and 2.2x & 6.5x cheaper, resp. Also 83% less vectorDB cost! • 3.5-lite ~ Cohere-v4 in quality, but 83% cheaper.
0
0
2
We trained voyage-code-3 back in last Nov. So far no other model is even close in code retrieval. Happy to see it shine in the brilliant @continuedev code assistants!
@metcalfc wrote a deep dive on why your custom AI code assistant should include embeddings and a reranker from @VoyageAI🥇
0
0
3
Proud of the team for what we have achieved! Joining MongoDB opens a new chapter of innovations to reshape the landscape of information retrieval and semantic search
The risk of hallucinations currently holds enterprises back from deploying AI apps. Excited to share that VoyageAI has joined MongoDB to make high-quality AI-powered search and retrieval easy, enabling organizations to build trustworthy AI apps at scale. https://t.co/8I2x6OLzwR
0
0
5
Since the difference is astonishing, I have questions for the authors: Did you use the same setup for baselines and your models? Did your model use the same prompt in evaluation as what you mentioned in hf? Most importantly, can you release code to reproduce the results?
1
0
1
The paper does not provide code so I use the standard mteb ( https://t.co/0a36upMiol). I use the prompt in the hf page for SFR model. voyage-code-2 result reported is 23.3% worse than reproduction. SFR-Embedding-Code-2B_R result reported is 14.5% better than reproduction.
1
0
0