Yi Zeng 曾祎 @ICCV
@EasonZeng623
Followers
1K
Following
2K
Media
63
Statuses
579
probe to improve | Ph.D. @VTEngineering | Amazon Research Fellow | #AI_safety 🦺 #AI_security 🛡 | I deal with the dark side of machine learning.
Virginia, US
Joined August 2017
Now you know there's another dude just discussed AI Safety and Security with both sides ;) #NeurIPS2023 [📸 With legendaries @ylecun and Yoshua Bengio]
1
5
119
🫂Dear Metamates impacted today—beyond safety, our foundation model team also has opening for native multimodal roles (text/video/audio, OCR, and more):
BTW, We’re growing a safer foundation model research stack at @tiktok_us —safety pretraining, RLHF/RLAIF, evals. 🚨Intern + FTE roles. 📨DMs open.
0
0
5
BTW, We’re growing a safer foundation model research stack at @tiktok_us —safety pretraining, RLHF/RLAIF, evals. 🚨Intern + FTE roles. 📨DMs open.
@ICCVConference vibe kicking in with @liang_weixin and the very shy @xiangyuqi_pton. Batch-norm the smiles, dropout the shyness👇
0
1
19
"I'm excellent, grandma" -- Sora 2 Pro > a viral and unhinged vine post
0
0
3
Can someone fact-check this for me: why does every Sora 2 vid featuring @sama have the same Adidas sneakers ? Is it just me or is this… consistent? 😂
1
0
1
[10/12] To make matters worse, soon AI agents will know you better than your friends. Will they give you uncomfortable truths? Or keep validating you so you’ll never leave?
20
235
5K
🧵 New paper from @AISecurityInst x @AiEleuther that I led with Kyle O’Brien: Open-weight LLM safety is both important & neglected. But we show that filtering dual-use knowledge from pre-training data improves tamper resistance *>10x* over post-training baselines.
7
41
200
gpt-oss is overfit on refusal safety — sky-high over-refusal, tanked helpfulness on my bench. gpt-5 swings the other way: max helpfulness, refusals softened with follow-ups even on harmful asks. What do you see in this pattern?
0
1
4
'Update Federal procurement guidelines to ensure that the government only contracts with frontier large language model (LLM) developers who ensure that their systems are objective and free from top-down ideological bias.' there is an executive order on this arriving today.
5
10
60
🎉 Thrilled to be presenting my first paper at @icmlconf! "Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning" We introduce ACTOR—a lightweight, activation-based training method that reduces over-refusal without
arxiv.org
Safety alignment is crucial for large language models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user...
1
3
11
🔹 AI alignment really needs interdisciplinary work! 🔹 See my talk on "how to humanize AI to persuade them for jailbreaking":
📄 ACL'24 Outstanding & Best Social Impact Paper: https://t.co/Eygpfmoyjx 🎥 Full talk from Singapore Alignment Workshop:
0
8
44
I'll never forget this model as well as the relationship between pretrain, SFT and RLHF.
Baidu just released 23 models at the same time on @huggingface - from 0.3B to 424B parameters. Let’s go!
7
49
523
Sparsity can make your LoRA fine-tuning go brrr 💨 Announcing SparseLoRA (ICML 2025): up to 1.6-1.9x faster LLM fine-tuning (2.2x less FLOPs) via contextual sparsity, while maintaining performance on tasks like math, coding, chat, and ARC-AGI 🤯 🧵1/ https://t.co/ilZgkfj78J
5
58
208
🌉 Bridging Offline & Online RL for LLMs 🌉 📝: https://t.co/G12TS6Z84n New paper shows on verifiable & non-verifiable tasks: - Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO
1
98
455
1/ 🔥 AI agents are reaching a breakthrough moment in cybersecurity. In our latest work: 🔓 CyberGym: AI agents discovered 15 zero-days in major open-source projects 💰 BountyBench: AI agents solved real-world bug bounty tasks worth tens of thousands of dollars 🤖
28
154
533
AIR-Bench is a Spotlight @iclr_conf 2025! Catch our poster on Fri, Apr 26, 10 a.m.–12:30 p.m. SGT (Poster Session 5). Sadly, I won’t be there in person (visa woes, again), but the insights—and our incredible team—will be with you in Singapore. Go say hi 👋
🧵[1/5] Introducing AIR 2024: Unifying AI risk categorizations with a shared language to improve AI safety. W/ @kevin_klyman @andyz245 @YUYANG_UCLA @MinzhouP & guidance from @ruoxijia @dawnsongtweets @percyliang @uiuc_aisecure for kicking off my AI policy research journey 🏦.
0
3
21
🚀 Really excited to launch #AgentX competition hosted by @BerkeleyRDI @UCBerkeley alongside our LLM Agents MOOC series (a global community of 22k+ learners & growing fast). Whether you're building the next disruptive AI startup or pushing the research frontier, AgentX is your
20
110
417
1/14: If sparse autoencoders work, they should give us interpretable classifiers that help with probing in difficult regimes (e.g. data scarcity). But we find that SAE probes consistently underperform! Our takeaway: mech interp should use stronger baselines to measure progress 🧵
12
60
521
Excited to share new work from my internship @GoogleAI ! Curious as to how we should measure the similarity between examples in pretraining datasets? We study the role of similarity in pretraining 1.7B parameter language models on the Pile. arxiv: https://t.co/iyS3Fxtx9a 1/🧵
5
42
170