Cheng-Yu Hsieh Profile
Cheng-Yu Hsieh

@cydhsieh

Followers
585
Following
142
Media
26
Statuses
54

PhD student @UWcse

Seattle, USA
Joined September 2022
Don't wanna be here? Send us removal request.
@cydhsieh
Cheng-Yu Hsieh
4 months
Excited to introduce FocalLens: an instruction tuning framework that turns existing VLMs/MLLMs into text-conditioned vision encoders that produce visual embeddings focusing on relevant visual information given natural language instructions!. 📢: @HPouransari will be presenting
Tweet media one
1
8
29
@cydhsieh
Cheng-Yu Hsieh
2 months
RT @jae_sung_park96: 🔥We are excited to present our work Synthetic Visual Genome (SVG) at #CVPR25 tomorrow! .🕸️ Dense scene graph with d….
0
8
0
@cydhsieh
Cheng-Yu Hsieh
2 months
RT @ajratner: Agentic AI will transform every enterprise–but only if agents are trusted experts. The key: Evaluation & tuning on specializ….
0
75
0
@cydhsieh
Cheng-Yu Hsieh
3 months
RT @PeterSushko: 1/8🧵 Thrilled to announce RealEdit (to appear in CVPR 2025)! We introduce a real-world image-editing dataset sourced from….
0
8
0
@cydhsieh
Cheng-Yu Hsieh
4 months
RT @jramapuram: Stop by poster #596 at 10A-1230P tomorrow (Fri 25 April) at #ICLR2025 to hear more about Sigmoid Attention! . We just pushe….
0
14
0
@cydhsieh
Cheng-Yu Hsieh
4 months
0
0
0
@cydhsieh
Cheng-Yu Hsieh
4 months
🚀Being able to better focus on the relevant visual information, FocalLens shows improvements over standard CLIP models on a variety of downstream tasks, including image-image retrieval, image-text retrieval, and classification tasks!
Tweet media one
1
0
0
@cydhsieh
Cheng-Yu Hsieh
4 months
We train FocalLens using visual instruction tuning data in the form of (image, instruction, output), by aligning the instruction-conditioned visual representations of the images to their corresponding outputs.
Tweet media one
1
0
0
@cydhsieh
Cheng-Yu Hsieh
4 months
‼️Most vision encoders generate fixed representations independent of the task or context of interest. For example, CLIP embeddings emphasize high-level semantics but often omit finer-grained details such as background, quantity, or spatial relations — which can be critical for
Tweet media one
1
0
1
@cydhsieh
Cheng-Yu Hsieh
5 months
RT @JieyuZhang20: The 2nd Synthetic Data for Computer Vision workshop at @CVPR! We had a wonderful time last year, and we want to build on….
0
9
0
@cydhsieh
Cheng-Yu Hsieh
5 months
RT @MahtabBg: I'm exited to announce that our work (AURORA) got accepted into #CVPR2025🎉! Special thanks to my coauthors: @ch1m1m0ry0, @cyd….
0
4
0
@cydhsieh
Cheng-Yu Hsieh
6 months
RT @YungSungChuang: (1/5)🚨LLMs can now self-improve to generate better citations✅. 📝We design automatic rewards to assess citation quality….
0
77
0
@cydhsieh
Cheng-Yu Hsieh
8 months
RT @MahtabBg: Introducing AURORA 🌟: Our new training framework to enhance multimodal language models with Perception Tokens; a game-changer….
0
9
0
@cydhsieh
Cheng-Yu Hsieh
10 months
RT @kamath_amita: Hard negative finetuning can actually HURT compositionality, because it teaches VLMs THAT caption perturbations change me….
Tweet card summary image
arxiv.org
Several benchmarks have concluded that our best vision-language models (e.g., CLIP) are lacking in compositionality. Given an image, these benchmarks probe a model's ability to identify its...
0
10
0
@cydhsieh
Cheng-Yu Hsieh
1 year
🤔 In training vision models, what value do AI-generated synthetic images provide compared to the upstream (real) data used in training the generative models in the first place?. 💡 We find using "relevant" upstream real data still leads to much stronger results compared to using.
@scottgeng00
Scott Geng
1 year
Will training on AI-generated synthetic data lead to the next frontier of vision models?🤔. Our new paper suggests NO—for now. Synthetic data doesn't magically enable generalization beyond the generator's original training set. 📜: Details below🧵(1/n).
0
3
10
@cydhsieh
Cheng-Yu Hsieh
1 year
‼️ LLMs hallucinate facts even if provided with correct/relevant contexts.💡 We find models' attention weight distribution on input context versus their own generated tokens serves as a strong detector for such hallucinations.🚀 The detector transfers across models/tasks, and can.
@YungSungChuang
Yung-Sung Chuang
1 year
🚨Can we "internally" detect if LLMs are hallucinating facts not present in the input documents? 🤔. Our findings:.- 👀Lookback ratio—the extent to which LLMs put attention weights on context versus their own generated tokens—plays a key role.- 🔍We propose a hallucination
Tweet media one
0
6
36
@cydhsieh
Cheng-Yu Hsieh
1 year
🧵(n/n).🚀A huge shout out to our amazing team that makes this work possible: @YungSungChuang, @chunliang_tw , @ZifengWang315 , Long T. Le, Abhishek Kumar, James Glass, @ajratner, @chl260, @RanjayKrishna, @tomaspfister!!.
0
0
4
@cydhsieh
Cheng-Yu Hsieh
1 year
🧵(5/n).3⃣Finally, we show our method is complementary to existing re-ordering based methods that place relevant documents at the beginning/end of the input prompt, offering a new layer to improve current RAG pipelines.
Tweet media one
1
0
5
@cydhsieh
Cheng-Yu Hsieh
1 year
🧵(4/n) We show that:.1⃣Models' calibrated attention reflects well the relevance of a document to a user query, outperforming existing re-ranking metrics. 2⃣Calibrated attention further improves models' RAG performances (over 10 pp) against the standard baseline.
Tweet media one
Tweet media two
1
0
3