Liyan Tang Profile
Liyan Tang

@LiyanTang4

Followers
233
Following
166
Media
19
Statuses
172

Final-year PhD @UTAustin || NLP || MiniCheck || Prev Intern @GoogleDeepMind, @bespokelabsai, @AmazonScience

Austin, TX, US
Joined February 2022
Don't wanna be here? Send us removal request.
@LiyanTang4
Liyan Tang
2 years
🔎📄New model & benchmark to check LLMs’ output against docs (e.g., fact-check RAG) 🕵️ MiniCheck: a model w/GPT-4 accuracy @ 400x cheaper 📚LLM-AggreFact: collects 10 human-labeled datasets of errors in model outputs https://t.co/oFTS68mQOL w/ @PhilippeLaban, @gregd_nlp 🧵
3
28
90
@LiyanTang4
Liyan Tang
2 months
Our paper "ChartMuseum 🖼️" is now accepted to #NeurIPS2025 Datasets and Benchmarks Track! Even the latest models, such as GPT-5 and Gemini-2.5-Pro, still cannot do well on challenging 📉chart understanding questions , especially on those that involve visual reasoning 👀!
@LiyanTang4
Liyan Tang
6 months
Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B
1
20
36
@gregd_nlp
Greg Durrett
3 months
📢I'm joining NYU (Courant CS + Center for Data Science) starting this fall! I’m excited to connect with new NYU colleagues and keep working on LLM reasoning, reliability, coding, creativity, and more! I’m also looking to build connections in the NYC area more broadly. Please
94
48
764
@ZEYULIU10
Leo Liu
5 months
LLMs trained to memorize new facts can’t use those facts well.🤔 We apply a hypernetwork to ✏️edit✏️ the gradients for fact propagation, improving accuracy by 2x on a challenging subset of RippleEdit!💡 Our approach, PropMEND, extends MEND with a new objective for propagation.
5
72
198
@xiye_nlp
Xi Ye
5 months
🤔 Recent mech interp work showed that retrieval heads can explain some long-context behavior. But can we use this insight for retrieval? 📣 Introducing QRHeads (query-focused retrieval heads) that enhance retrieval Main contributions: 🔍 Better head detection: we find a
2
23
70
@fangcong_y10593
Fangcong Yin
6 months
Solving complex problems with CoT requires combining different skills. We can do this by: 🧩Modify the CoT data format to be “composable” with other skills 🔥Train models on each skill 📌Combine those models Lead to better 0-shot reasoning on tasks involving skill composition!
5
38
87
@PuyuanPeng
Puyuan Peng
6 months
The paper is out! https://t.co/GikR01dy5S
@PuyuanPeng
Puyuan Peng
7 months
Announcing the new SotA voice-cloning TTS model: 𝗩𝗼𝗶𝗰𝗲𝗦𝘁𝗮𝗿 ⭐️ VoiceStar is - autoregressive, - voice-cloning, - robust, - duration controllable, - *test-time extrapolation*, generates speech longer than training duration! Code&Model: https://t.co/7vxDpnayks
0
12
60
@gregd_nlp
Greg Durrett
6 months
Check out ChartMuseum from @LiyanTang4 @_grace_kim and many other collaborators from UT! Charts questions take us beyond current benchmarks for math/multi-hop QA/etc., which CoT is very good at, to *visual reasoning*, which is hard to express with text CoT!
@LiyanTang4
Liyan Tang
6 months
Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B
1
11
34
@LiyanTang4
Liyan Tang
6 months
Read the full paper: ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models https://t.co/88PuGbZKYc 🏅Leaderboard: https://t.co/utp0ghM2UO 🤗 Dataset: https://t.co/Sg9lwuSXSc Code:
Tweet card summary image
github.com
[NeurIPS 2025] ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models - Liyan06/ChartMuseum
1
2
5
@LiyanTang4
Liyan Tang
6 months
❌ Extended-thinking in CoTs yields minimal improvement in chart understanding. ⁉️ Why? Fundamental limitations in their visual reasoning capabilities. We identify 4 key shortcomings below and find that models sometimes cannot find the right strategy for visual questions
1
0
4
@LiyanTang4
Liyan Tang
6 months
Existing chart QA benchmarks have limitations ❌ Limited real-world chart sources ❌ Questions are created with LLM in the loop ❌ Saturated/similar model performance ChartMuseum ✅ 184 chart sources ✅ Entirely human-written questions ✅ Clear distinctions in model performance
1
0
6
@LiyanTang4
Liyan Tang
6 months
Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B
2
34
78
@PhilippeLaban
Philippe Laban
6 months
🆕paper: LLMs Get Lost in Multi-Turn Conversation In real life, people don’t speak in perfect prompts. So we simulate multi-turn conversations — less lab-like, more like real use. We find that LLMs get lost in conversation. 👀What does that mean? 🧵1/N 📄 https://t.co/xt2EfGRh7e
4
39
130
@AnirudhKhatry
Anirudh Khatry
7 months
🚀Introducing CRUST-Bench, a dataset for C-to-Rust transpilation for full codebases 🛠️ A dataset of 100 real-world C repositories across various domains, each paired with: 🦀 Handwritten safe Rust interfaces. 🧪 Rust test cases to validate correctness. 🧵[1/6]
3
21
68
@gregd_nlp
Greg Durrett
7 months
New work led by @LiyanTang4 with a strong new model for chart understanding! Check out the blog post, model, and playground! Very fun to play around with Bespoke-MiniChart-7B and see what a 7B VLM can do!
@bespokelabsai
Bespoke Labs
7 months
Announcing Bespoke-MiniChart-7B, a new SOTA in chart understanding for models of comparable size on seven benchmarks, on par with Gemini-1.5-Pro and Claude-3.5! 🚀 Beyond its real-world applications, chart understanding is a good challenging problem for VLMs, since it requires
1
10
32
@LiyanTang4
Liyan Tang
7 months
Check out my work at @bespokelabsai We release Bespoke-MiniChart-7B, a new SOTA in chart understanding of its size Chart understanding is really fun and challenging and requires reasoning skills beyond math reasoning It's a great starting point for open chart model development!
@bespokelabsai
Bespoke Labs
7 months
Announcing Bespoke-MiniChart-7B, a new SOTA in chart understanding for models of comparable size on seven benchmarks, on par with Gemini-1.5-Pro and Claude-3.5! 🚀 Beyond its real-world applications, chart understanding is a good challenging problem for VLMs, since it requires
0
11
31
@gregd_nlp
Greg Durrett
7 months
Check out Manya's work on evaluation for open-ended tasks! The criteria from EvalAgent can be plugged into LLM-as-a-judge or used for refinement. Great tool with a ton of potential, and there's LOTS to do here for making LLMs better at writing!
@ManyaWadhwa1
Manya Wadhwa
7 months
Evaluating language model responses on open-ended tasks is hard! 🤔 We introduce EvalAgent, a framework that identifies nuanced and diverse criteria 📋✍️. EvalAgent identifies 👩‍🏫🎓 expert advice on the web that implicitly address the user’s prompt 🧵👇
1
5
52
@gregd_nlp
Greg Durrett
7 months
Check out Ramya et al.'s work on understanding discourse similarities in LLM-generated text! We see this as an important step in quantifying the "sameyness" of LLM text, which we think will be a step towards fixing it!
@ramya_namuduri
Ramya Namuduri
7 months
Have that eerie feeling of déjà vu when reading model-generated text 👀, but can’t pinpoint the specific words or phrases 👀? ✨We introduce QUDsim, to quantify discourse similarities beyond lexical, syntactic, and content overlap.
0
4
24
@bespokelabsai
Bespoke Labs
7 months
OpenAI’s o4 just showed that multi-turn tool use is a huge deal for AI agents. Today, we show how to do the same with your own agents, using RL and open-source models. We used GRPO on only 100 high quality questions from the BFCL benchmark, and post-trained a 7B Qwen model to
21
52
380