Hong Liu @HongLiu9903 X Profile

Hong Liu

@HongLiu9903

Followers

310

Following

49

Media

7

Statuses

59

Co-founder, Lead Research @VoyageAI.

Joined October 2021

Don't wanna be here? Send us removal request.

Hong Liu

@HongLiu9903

25 days

🚀 Unveiling the first synthetic pretraining method that doesn’t rely on teacher distillation. Big shoutout to @ZitongYang0 @Aonan12 and the team!

Zitong Yang

@ZitongYang0

25 days

📜 Paper on new pretraining paradigm: Synthetic Bootstrapped Pretraining SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed. Validation: 1T data + 3B model from scratch.🧵

0

9

Zitong Yang

@ZitongYang0

25 days

📜 Paper on new pretraining paradigm: Synthetic Bootstrapped Pretraining SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed. Validation: 1T data + 3B model from scratch.🧵

9

50

247

Hong Liu

@HongLiu9903

2 months

Instruction-following takes reranker capabilities to the next level 🔥 Huge thanks to @zhmeishi and @AkshayGoindani1 for driving this leap forward!

Voyage AI by MongoDB

@VoyageAI

2 months

📣 Announcing rerank-2.5 and 2.5-lite: our latest generation of rerankers! • First reranker with instruction-following capabilities • rerank-2.5 and 2.5-lite are 7.94% and 7.16% more accurate than Cohere Reranker v3.5 • Additional 8.13% and 7.55% performance gain with

1

11

Hong Liu

@HongLiu9903

3 months

Greater things to come!

0

2

10

Dev Ittycheria

@dittycheria

3 months

We just launched Voyage-context-3, a new embedding model that gives AI a full-document view while preserving chunk-level precision that offers better retrieval performance than leading alternatives. When building AI that reads and reasons over documents (such as reports,

1

13

22

Hong Liu

@HongLiu9903

3 months

voyage-context-3 marks a paradigm shift to reduce reliability on chunking. The idea dated back to last year when @Yujie_Qian and I discussed how to embed contextual information without breaking VectorDBs. It turns out a new training objective is key to productionize the idea.

Voyage AI by MongoDB

@VoyageAI

3 months

📢 voyage-context-3: contextualized chunk embeddings - Auto captures of chunk level detail & global doc context, w/o metadata augmentation - Beats OpenAI-v3-large by 14.24% & Cohere-v4 by 7.89% - Binary 512-dim matches OpenAI (float, 3072-dim) in accuracy, but 192x cheaper in

0

5

Hong Liu

@HongLiu9903

4 months

6 / 6 Come see the future of multimodal embeddings! @HaonanC80190 @luo_yuping 📝 Paper: https://t.co/m0535UY4Hx💻 Code: https://t.co/sJQE4pJgWa 🤖 Models: https://t.co/50oJY7lJbv 📚 Datasets: https://t.co/oQBsywdgUQ 🌐 Homepage: https://t.co/O1kixSp32w

haon-chen.github.io

TWITTER BANNER DESCRIPTION META TAG

0

1

11

Hong Liu

@HongLiu9903

4 months

5 / 6 The most exciting implication: We can now continuously feed massive, unlabeled web-scale multimodal data into MoCa to constantly improve our embeddings. When it comes to multimodal embedding, scaling might be all you need.

1

9

Hong Liu

@HongLiu9903

4 months

4 / 6 The result? MoCa efficiently transforms causal VLMs into powerful bidirectional embedding models. We've set SoTA of single vector embeddings on MMEB and ViDoRe V2. Our 3B parameter model even matches or outperforms strong 7B+ baselines! 🤯

1

0

6

Hong Liu

@HongLiu9903

4 months

3 / 6 MoCa's approach: We jointly reconstruct masked image patches and text tokens using a bidirectional attention mechanism. This captures the rich context from interleaved data. After this continual pre-training, a lightweight contrastive fine-tuning stage suffices.

1

0

3

Hong Liu

@HongLiu9903

4 months

2 / 6 Why the change? Current multimodal embeddings often struggle. They rely on: Causal attention (like in LLMs), which might be suboptimal for embeddings. Contrastive learning, which needs curated, labeled pairs and is hard to scale. We saw a better way.

1

0

3

Hong Liu

@HongLiu9903

4 months

1/6 Introduce MoCa, a new method for continual pre-training of multimodal embeddings! 🚀 MoCa is the first to effectively scale with unlabeled interleaved image-text data, marking a paradigm shift in multimodal embeddings. Paper, code, & checkpoints! 👇 #AI #Multimodal #ML #NLP

1

40

142

Hong Liu

@HongLiu9903

5 months

Congrats! Code retrieval is undervalued. It’s great to see more players in the game 6 months after voyage-code-3 release

Mistral AI

@MistralAI

5 months

Introducing Codestral Embed, the new state-of-the-art embedding model for code.

0

3

Hong Liu

@HongLiu9903

5 months

🔥 Mind-blown by embedding model progress! In the past two months, we made voyage-3.5-lite outperform its 3x larger predecessor, voyage-3. The secret? Distilling from a larger model (voyage-3-large) is incredibly effective. The future of embeddings is here!

Voyage AI by MongoDB

@VoyageAI

5 months

📢 Meet voyage-3.5 and voyage-3.5-lite! • flexible dim. and quantizations • voyage-3.5 & 3.5-lite (int8, 2048 dim.) are 8% & 6% more accurate than OpenAI-v3-large, and 2.2x & 6.5x cheaper, resp. Also 83% less vectorDB cost! • 3.5-lite ~ Cohere-v4 in quality, but 83% cheaper.

0

2

Hong Liu

@HongLiu9903

7 months

We trained voyage-code-3 back in last Nov. So far no other model is even close in code retrieval. Happy to see it shine in the brilliant @continuedev code assistants!

Continue

@continuedev

7 months

@metcalfc wrote a deep dive on why your custom AI code assistant should include embeddings and a reranker from @VoyageAI🥇

0

3

Hong Liu

@HongLiu9903

8 months

Proud of the team for what we have achieved! Joining MongoDB opens a new chapter of innovations to reshape the landscape of information retrieval and semantic search

Dev Ittycheria

@dittycheria

8 months

The risk of hallucinations currently holds enterprises back from deploying AI apps. Excited to share that VoyageAI has joined MongoDB to make high-quality AI-powered search and retrieval easy, enabling organizations to build trustworthy AI apps at scale. https://t.co/8I2x6OLzwR

0

5

Hong Liu

@HongLiu9903

9 months

@SFResearch @YeLiu918 @RuiMeng_ @JotyShafiq @silviocinguetta @yingbozhou_ai @CaimingXiong @semih__yavuz

0

Hong Liu

@HongLiu9903

9 months

Since the difference is astonishing, I have questions for the authors: Did you use the same setup for baselines and your models? Did your model use the same prompt in evaluation as what you mentioned in hf? Most importantly, can you release code to reproduce the results?

1

0

1

Hong Liu

@HongLiu9903

9 months

The paper does not provide code so I use the standard mteb ( https://t.co/0a36upMiol). I use the prompt in the hf page for SFR model. voyage-code-2 result reported is 23.3% worse than reproduction. SFR-Embedding-Code-2B_R result reported is 14.5% better than reproduction.

1

0

Hong Liu

@HongLiu9903

9 months

Tried to reproduce the COIR results. TLDR: SFR-Embedding-Code-2B_R is 26.5% worse than voyage-code-2 as oppposed to what is claimed in the paper.

Salesforce AI Research

@SFResearch

9 months

🚨🚨🚨Just released!🚨🚨🚨 🚀Introducing the Salesforce Code Embedding Model Family (SFR-Embedding-Code), ranked #1 on CoIR Benchmark! 🚀 Available in 2 sizes: 2B, 400M. Key Highlights: 1️⃣ 2B Model: Achieves #1 on CoIR. 2️⃣400M Model: Best-performing model under 0.5B

4

2

13