RuohongZhang Profile Banner
Ruohong Zhang Profile
Ruohong Zhang

@RuohongZhang

Followers
117
Following
89
Media
12
Statuses
18

Grokking @xAI ; PhD @cmu LTI

Joined August 2020
Don't wanna be here? Send us removal request.
@RuohongZhang
Ruohong Zhang
10 months
[p1] Improve Visual Language Model Chain-of-thought Reasoning. paper link: project page (to be updated upon approval on release): Content:.1. We distill 193K CoT data.2. Train with SFT.3. DPO to futher improve performance
Tweet media one
3
38
215
@RuohongZhang
Ruohong Zhang
6 months
RT @tingchenai: Yes, the native voice experience is coming to Grok soon! Let us know what specific features you want to see (or hear)!.
0
23
0
@RuohongZhang
Ruohong Zhang
6 months
RT @lmarena_ai: As part of Chatbot Arena's graduation🎓, we're excited to announce that we changed our X handle to @lmarena_ai! For open-sou….
0
16
0
@RuohongZhang
Ruohong Zhang
10 months
[p6] DPO credit assignment on VQA.DPO model assigns negative scores on first hallucinated item or wrong associated knowledge at token-level, though it is trained on binary reward.
Tweet media one
8
1
6
@RuohongZhang
Ruohong Zhang
10 months
[p5] DPO model can do reward.Our DPO can rank candidates on MMMU datasets and others. Whereas, RLAIF doesn't show gains as reward model.
Tweet media one
1
1
5
@RuohongZhang
Ruohong Zhang
10 months
[p4] RL show improved performance and generalization. preference pairs are build by comparing predicted answer with annotated answer (similar to math reasoning). Our DPO data outperforms SOTA RLAIF data on VQA datasets
Tweet media one
Tweet media two
1
0
4
@RuohongZhang
Ruohong Zhang
10 months
[p3] SFT Improves CoT Reasoning.We show .1. training on direct prediction has positive but limited gain on CoT reasoning. 2. SFT with CoT + direct gives the best performance across datasets.
Tweet media one
1
0
4
@RuohongZhang
Ruohong Zhang
10 months
[p2] Distillation of 193k CoT on 9 VQA datasets. Common world knowledge (A-OKVQA), chart interpretation (ChartQA), document information localization (DocVQA, InfoVQA), real-world text extraction (TextVQA), scientific reasoning (AI2D, SQA), and math (MathVision, G-LLaVA).
Tweet media one
Tweet media two
1
0
5
@RuohongZhang
Ruohong Zhang
1 year
RT @natolambert: It's not PPO > DPO, .It's policy generated data > stale data,. In this paper, we answer this question by performing a rigo….
0
78
0
@RuohongZhang
Ruohong Zhang
1 year
RT @stefan_fee: Crazy finding!!!!! -> ” Without introducing any additional data or advanced training techniques, and merely by reformatt….
0
23
0
@RuohongZhang
Ruohong Zhang
1 year
[p6]Additionally, we provide: 1) 900k detailed video caption dataset, 2) A high-quality QA evaluation benchmark for video LMMs. Check project page for more details
Tweet media one
Tweet media two
0
0
1
@RuohongZhang
Ruohong Zhang
1 year
[p5] We adopt full-modeling training of DPO for 3 eps with lr 5e-7. There is 8.1% gain over SFT counterpart.
Tweet media one
1
0
1
@RuohongZhang
Ruohong Zhang
1 year
[p4] Our reward mechanism, using detailed caption as proxy to video, is well-aligned with GPT-4V reward taking video frames as input. We show consistence of preference agreement with GPT-4V to get a quality measure.
Tweet media one
1
0
1
@RuohongZhang
Ruohong Zhang
1 year
[p3] The effectiveness of Video LMMs can be enhanced from DPO training using language model reward, which leverages detailed video captions as proxies for video content, leading to cost-effective preference optimization for video LMM alignment.
Tweet media one
1
0
1
@RuohongZhang
Ruohong Zhang
1 year
[p2] We introduce a pipeline to develop instruction-following video LMM from large-scale high-quality video caption, following 1) caption pre-training, 2) instruction sft, and 3) direct preference modeling(DPO)
Tweet media one
1
0
1
@RuohongZhang
Ruohong Zhang
1 year
[p1] 🐕Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward🐕. Paper link: page: How to effectively train video large multimodal Model (LMM) alignment with preference modeling?
Tweet media one
2
16
66
@RuohongZhang
Ruohong Zhang
1 year
RT @EdwardSun0909: 🌟Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision 🌟. How can we keep im….
0
57
0