Ruohong Zhang @RuohongZhang X Profile

Ruohong Zhang

@RuohongZhang

Followers

117

Following

89

Media

12

Statuses

18

Grokking @xAI ; PhD @cmu LTI

Joined August 2020

Don't wanna be here? Send us removal request.

Ruohong Zhang

@RuohongZhang

10 months

[p1] Improve Visual Language Model Chain-of-thought Reasoning. paper link: project page (to be updated upon approval on release): Content:.1. We distill 193K CoT data.2. Train with SFT.3. DPO to futher improve performance

3

38

215

Ruohong Zhang

@RuohongZhang

6 months

RT @tingchenai: Yes, the native voice experience is coming to Grok soon! Let us know what specific features you want to see (or hear)!.

0

23

0

Ruohong Zhang

@RuohongZhang

6 months

RT @lmarena_ai: As part of Chatbot Arena's graduation🎓, we're excited to announce that we changed our X handle to @lmarena_ai! For open-sou….

0

16

0

Ruohong Zhang

@RuohongZhang

10 months

[p6] DPO credit assignment on VQA.DPO model assigns negative scores on first hallucinated item or wrong associated knowledge at token-level, though it is trained on binary reward.

8

1

6

Ruohong Zhang

@RuohongZhang

10 months

[p5] DPO model can do reward.Our DPO can rank candidates on MMMU datasets and others. Whereas, RLAIF doesn't show gains as reward model.

1

5

Ruohong Zhang

@RuohongZhang

10 months

[p4] RL show improved performance and generalization. preference pairs are build by comparing predicted answer with annotated answer (similar to math reasoning). Our DPO data outperforms SOTA RLAIF data on VQA datasets

1

0

4

Ruohong Zhang

@RuohongZhang

10 months

[p3] SFT Improves CoT Reasoning.We show .1. training on direct prediction has positive but limited gain on CoT reasoning. 2. SFT with CoT + direct gives the best performance across datasets.

1

0

4

Ruohong Zhang

@RuohongZhang

10 months

[p2] Distillation of 193k CoT on 9 VQA datasets. Common world knowledge (A-OKVQA), chart interpretation (ChartQA), document information localization (DocVQA, InfoVQA), real-world text extraction (TextVQA), scientific reasoning (AI2D, SQA), and math (MathVision, G-LLaVA).

1

0

5

Ruohong Zhang

@RuohongZhang

1 year

RT @natolambert: It's not PPO > DPO, .It's policy generated data > stale data,. In this paper, we answer this question by performing a rigo….

0

78

0

Ruohong Zhang

@RuohongZhang

1 year

RT @stefan_fee: Crazy finding!!!!! -> ” Without introducing any additional data or advanced training techniques, and merely by reformatt….

0

23

0

Ruohong Zhang

@RuohongZhang

1 year

[p6]Additionally, we provide: 1) 900k detailed video caption dataset, 2) A high-quality QA evaluation benchmark for video LMMs. Check project page for more details

0

1

Ruohong Zhang

@RuohongZhang

1 year

[p5] We adopt full-modeling training of DPO for 3 eps with lr 5e-7. There is 8.1% gain over SFT counterpart.

1

0

1

Ruohong Zhang

@RuohongZhang

1 year

[p4] Our reward mechanism, using detailed caption as proxy to video, is well-aligned with GPT-4V reward taking video frames as input. We show consistence of preference agreement with GPT-4V to get a quality measure.

1

0

1

Ruohong Zhang

@RuohongZhang

1 year

[p3] The effectiveness of Video LMMs can be enhanced from DPO training using language model reward, which leverages detailed video captions as proxies for video content, leading to cost-effective preference optimization for video LMM alignment.

1

0

1

Ruohong Zhang

@RuohongZhang

1 year

[p2] We introduce a pipeline to develop instruction-following video LMM from large-scale high-quality video caption, following 1) caption pre-training, 2) instruction sft, and 3) direct preference modeling(DPO)

1

0

1

Ruohong Zhang

@RuohongZhang

1 year

[p1] 🐕Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward🐕. Paper link: page: How to effectively train video large multimodal Model (LMM) alignment with preference modeling?

2

16

66

Ruohong Zhang

@RuohongZhang

1 year

RT @EdwardSun0909: 🌟Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision 🌟. How can we keep im….

0

57

0