Jie Ruan @JieRuan75 X Profile

Jie Ruan

@JieRuan75

Followers

149

Following

135

Media

3

Statuses

59

PhD student at University of Michigan @UMich | Mphil from Peking University @PKU1898.

Joined November 2023

Don't wanna be here? Send us removal request.

Jie Ruan

@JieRuan75

2 months

🔍LLMs now give medical diagnoses, legal advice, and even tackle scientific problems. ❓Your LLM sounds smart. But what if it’s just good at faking expertise?.🚀We built ExpertLongBench to find out. 📉And the results? They revealed several concerns.👇.🔗

1

20

33

Jie Ruan

@JieRuan75

1 day

RT @liusiyang_641: New RAG-empowered tool for cancer care prep. Helps patients go from uninformed to visit-ready - by guiding knowledge, v….

0

11

0

Jie Ruan

@JieRuan75

1 day

RT @NaihaoDeng: 🚨 Excited to share a set of papers I led or collaborated on that are being presented at #ACL2025 this week! 🧵👇. 1. Rethinki….

arxiv.org

Recent advances in table understanding have focused on instruction-tuning large language models (LLMs) for table-related tasks. However, existing research has overlooked the impact of...

0

10

0

Jie Ruan

@JieRuan75

8 days

RT @OwainEvans_UK: New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only….

0

1K

0

Jie Ruan

@JieRuan75

1 month

RT @zkjzou: 🔥 Excited to introduce ManyICLBench (ACL 2025).🧐 Do many-shot ICL tasks evaluate LCLMs' ability to retrieve the most similar ex….

arxiv.org

Many-shot in-context learning (ICL) has emerged as a unique setup to both utilize and test the ability of large language models to handle long context. This paper delves into long-context language...

0

19

0

Jie Ruan

@JieRuan75

1 month

RT @MKhalifaaaa: 🚨 Deadline for SCALR 2025 Workshop: Test‑time Scaling & Reasoning Models at COLM '25 @COLM_conf is approaching!🚨. https:/….

0

11

0

Jie Ruan

@JieRuan75

2 months

11/.Grateful to collaborate with @InderjeetNair, @ ShuyangCao, @amyliiu, @sheza_munir, and @LuWang__, with support from @launchnlp, @michigan_AI, and many amazing domain experts!.

0

1

Jie Ruan

@JieRuan75

2 months

10/.Your input is invaluable in making ExpertLongBench more representative and impactful across expert domains. Let’s build better evaluations for expert-level AI — together 🔬🧠⚖️.

1

0

1

Jie Ruan

@JieRuan75

2 months

9/.📢 We actively encourage contributions from the research community — including:.- ✅ Proposing new tasks and contributing data.- 🔁 Suggesting improvements to existing ones.- 🧠 Sharing domain-specific insights ⚖️🧪🏥📚.

1

0

1

Jie Ruan

@JieRuan75

2 months

8/.✅ LLMs are great at trivia. ❌ But when it comes to replacing real experts?. They’ve got a long road ahead. 🔗Start here:.Leaderboard: Paper: Data: Code:

github.com

Contribute to launchnlp/ExpertLongBench development by creating an account on GitHub.

1

0

1

Jie Ruan

@JieRuan75

2 months

7/.Worse: models often "cover" the right checklist items… but get them wrong. ✅ High coverage ≠ high quality. 🚨A model might sound right — but still mislead you. That’s risky in law, medicine, science.

1

0

1

Jie Ruan

@JieRuan75

2 months

6/.So… how did today’s best LLMs do?.🟥 GPT-4o?.🟥 Claude 3?.🟥 Gemini?.🥶 Top score: 26.8 F1. 📉 On task T2: Legal Statement of Fact Generation, the best model scored just 7.9. Let that sink in. ⚠️ Even the best models barely passed.

1

0

1

Jie Ruan

@JieRuan75

2 months

5/.💡Also cool: CLEAR works with open-source models like Qwen2.5-72B. No need to worry about shifting OpenAI APIs — get reproducible results using Qwen. 📈We found high agreement and strong correlation — evaluation that’s scalable, transparent, and reproducible.

1

0

1

Jie Ruan

@JieRuan75

2 months

4/.🧠But how to evaluate complex outputs like legal summaries or ESG reports?.✅ We built CLEAR — a checklist-based eval framework grounded in expert-written rubrics.

1

0

1

Jie Ruan

@JieRuan75

2 months

3/. EXPERTLONGBENCH spans 11 tasks across 9 domains:. ⚖️ Law. 🧪 Chemistry. 🏥 Healthcare. 📚 Education. 💰 Finance . 🧬 Biology. and more. Input: up to 200K tokens.Output: up to 5,000+ tokens.⚠️ This isn’t a quiz. This is work.

1

0

1

Jie Ruan

@JieRuan75

2 months

2/.🤖Most benchmarks?.✅Multiple-choice. ✅Short answers. But real experts….✍️ draft legal briefs.🩺write clinical notes.🧪explain chemical syntheses.— stuff that takes tens of hours. We turned those into benchmark tasks.

1

0

1

Jie Ruan

@JieRuan75

4 months

RT @YunxiangZhang4: 🚨 New Benchmark Drop!.Can LLMs actually do ML research? Not toy problems, not Kaggle tweaks—but real, unsolved ML confe….

0

35

0

Jie Ruan

@JieRuan75

9 months

RT @shi_weiyan: It feels emotional to hear that #EMNLP is going back to China after 10 years🥹🥹🥹 thanks @emnlpmeeting ❤️❤️❤️ .

0

6

0

Jie Ruan

@JieRuan75

9 months

RT @liusiyang_641: "The Invisible Minority" – Older Adults 👵👴.Age bias is often overlooked compared to gender or race, yet by 2030, 1 in 6….

0

14

0

Jie Ruan

@JieRuan75

9 months

RT @FrederickXZhang: Heard of the Alaska-Hawaii merger?🤔Wonder if LLMs know it’s pending government approval before it can happen? They stu….

0

18

0