Jacqueline He @jcqln_h X Profile

Jacqueline He

@jcqln_h

Followers

200

Following

6K

Media

7

Statuses

205

cs phd @uwnlp, prev. bse cs @princeton

https://t.co/B0A5K2VCmH

Joined May 2018

Don't wanna be here? Send us removal request.

Ilia Shumailov🦔

@iliaishacked

6 months

Are modern large language models (LLMs) vulnerable to privacy attacks that can determine if given data was used for training? Models and dataset are quite large, what should we even expect? Our new paper looks into this exact question. 🧵 (1/10)

2

21

115

Jacqueline He

@jcqln_h

6 months

🌟 Acknowledgements: PIC is a collaborative endeavor between @uwnlp and @PrincetonPLI. Thanks to the dream team! @HowardYen1 @margs_li @StellaLisy @ZhiyuanZeng_ @WeijiaShi2 @tsvetshop @danqi_chen @PangWeiKoh @LukeZettlemoyer

0

5

Jacqueline He

@jcqln_h

6 months

Ultimately, faithful generation isn’t just about reducing errors—it’s about controlling what LMs say. PIC provides a framework for doing so at the level of discrete claims. 🔗 Code: https://t.co/oYN4DQNSYv 📄 Paper:

arxiv.org

A central challenge in language models (LMs) is faithfulness hallucination: the generation of information unsubstantiated by input context. To study this problem, we propose Precise Information...

1

0

4

Jacqueline He

@jcqln_h

6 months

Notably, swapping Llama 3.1 8B Inst. with PIC-LM yields: 52.5% → 61.5% EM on retrieval-augmented ambiguous QA; 65.9% → 86.0% factual precision in a chain-of-verification pipeline on a birthplace factoid task; 13.5% → 22.6% F1@5 in a chain-of-verification pipeline on QamPARI.

1

0

4

Jacqueline He

@jcqln_h

6 months

We further show that improvements in PIC can likewise also bring end-task factuality gains. When situated in modular pipelines like RAG or chain-of-verification, where claims are either externally provided or self-generated, PIC-LM can boost factual accuracy.

1

0

4

Jacqueline He

@jcqln_h

6 months

Next, we propose PIC-LM, which is initialized from LLaMA 3.1 8B Instruct and post-trained using SFT + DPO and a weakly supervised preference data construction strategy. On PIC-Bench, PIC-LM outperforms all open baselines and closes the gap to GPT-4o.

1

0

4

Jacqueline He

@jcqln_h

6 months

We first propose PIC-Bench, which measures PIC across 8 long-form generation tasks (e.g., summarization, LFQA) and 2 settings: including all input claims, or only a subset. We benchmark a range of LMs, and find that even GPT-4o intrinsically hallucinates in >70% of PIC outputs.

1

0

4

Jacqueline He

@jcqln_h

6 months

Unlike most hallucination work, which targets factuality and extrinsic errors, PIC isolates a simple case of intrinsic hallucination: models must generate text using only a provided set of claims. This task should be easy—yet modern LMs struggle.

2

0

4

Jacqueline He

@jcqln_h

6 months

LMs often output answers that sound right but aren’t supported by input context. This is intrinsic hallucination: the generation of plausible, but unsupported content. We propose Precise Information Control (PIC): a task requiring LMs to ground only on given verifiable claims.

2

26

50

Shangbin Feng

@shangbinfeng

6 months

Check out our work on LLMs and scientific knowledge updates!

Yike Wang

@yikewang_

6 months

LLMs are helpful for scientific research — but will they continuously be helpful? Introducing 🔍ScienceMeter: current knowledge update methods enable 86% preservation of prior scientific knowledge, 72% acquisition of new, and 38%+ projection of future ( https://t.co/zDjjl5GBaZ).

0

11

54

Jacqueline He

@jcqln_h

6 months

congrats @kjha02 !! cool work 🎊🎉🎇

Kunal Jha

@kjha02

6 months

Oral @icmlconf !!! Can't wait to share our work and hear the community's thoughts on it, should be a fun talk! Can't thank my collaborators enough: @cogscikid @liangyanchenggg @SimonShaoleiDu @maxhkw @natashajaques

1

0

Zhiyuan Zeng

@ZhiyuanZeng_

8 months

Is a single accuracy number all we can get from model evals?🤔 🚨Does NOT tell where the model fails 🚨Does NOT tell how to improve it Introducing EvalTree🌳 🔍identifying LM weaknesses in natural language 🚀weaknesses serve as actionable guidance (paper&demo 🔗in🧵) [1/n]

5

102

266

Hamish Ivison

@hamishivi

9 months

How well do data-selection methods work for instruction-tuning at scale? Turns out, when you look at large, varied data pools, lots of recent methods lag behind simple baselines, and a simple embedding-based method (RDS) does best! More below ⬇️ (1/8)

4

68

328

Hamish Ivison

@hamishivi

9 months

We trained a diffusion LM! 🔁 Adapted from Mistral v0.1/v0.3. 📊 Beats AR models in GSM8k when we finetune on math data. 📈 Performance improves by using more test-time compute (reward guidance or more diffusion steps). Check out @jaesungtae's thread for more details!

Jake Tae

@jaesungtae

9 months

1/ Excited to share our new work which we’ve been working on since past year: TESS 2! TESS 2 is a 7B instruction-tuned diffusion LM that can perform close to AR counterparts for general QA tasks, trained by adapting from an existing pretrained AR model. 🧵

1

8

39

Stella Li

@StellaLisy

9 months

Asking the right questions can make or break decisions in high-stake fields like medicine, law, and beyond✴️ Our new framework ALFA—ALignment with Fine-grained Attributes—teaches LLMs to PROACTIVELY seek information through better questions🏥❓ (co-led with @jiminmun_) 👉🏻🧵

7

46

198

Ai2

@allen_ai

10 months

Can AI really help with literature reviews? 🧐 Meet Ai2 ScholarQA, an experimental solution that allows you to ask questions that require multiple scientific papers to answer. It gives more in-depth, detailed, and contextual answers with table comparisons, expandable sections

14

73

221

Hila Gonen

@hila_gonen

1 year

Extremely excited to share that I will be joining @UBC_CS as an Assistant Professor this summer! I will be recruiting students this coming cycle!

15

18

148

Akari Asai

@AkariAsai

1 year

🚨 I’m on the job market this year! 🚨 I’m completing my @uwcse Ph.D. (2025), where I identify and tackle key LLM limitations like hallucinations by developing new models—Retrieval-Augmented LMs—to build more reliable real-world AI systems. Learn more in the thread! 🧵

26

119

820

Jacqueline He

@jcqln_h

1 year

Check out our OpenScholar project!! Huge congrats to @AkariAsai for leading the project — working with her has been a wonderful experience!! 🌟

Akari Asai

@AkariAsai

1 year

1/ Introducing ᴏᴘᴇɴꜱᴄʜᴏʟᴀʀ: a retrieval-augmented LM to help scientists synthesize knowledge 📚 @uwnlp @allen_ai With open models & 45M-paper datastores, it outperforms proprietary systems & match human experts. Try out our demo! We also introduce ꜱᴄʜᴏʟᴀʀQᴀʙᴇɴᴄʜ,

0

7

Howard Yen

@HowardYen1

1 year

Introducing HELMET, a long-context benchmark that supports >=128K length, covering 7 diverse applications. We evaluated 51 long-context models and found HELMET provide more reliable signals for model development https://t.co/fMOtJrnyhm A 🧵 on why you should use HELMET⛑️

2

32

79