Jifan Chen
@Jifan_chen
Followers
406
Following
4K
Media
53
Statuses
481
Building code agents @awscloud. Ph.D. from @UTAustin. Interpretable and Robust Models #NLProc. I have a super powerful language model in my brain.
Joined March 2014
CS == Counter-Strike?
0
0
0
You spend $1B training a model A. Someone on your team leaves and launches their own model API B. You're suspicious. Was B was derived (e.g., fine-tuned) from A? But you only have blackbox access to B... With our paper, you can still tell with strong statistical guarantees
๐Did someone steal your language model? We can tell you, as long as you shuffled your training data๐. All we need is some text from their model! Concretely, suppose Alice trains an open-weight model and Bob uses it to produce text. Can Alice prove Bob used her model?๐จ
55
215
2K
Unfortunately I won't be able to attend #COLM2025 in person this year, but please check out our work being presented by my advisors/collaborators! If you are interested in evaluation of open-ended tasks/creativity/reasoning please reach out and we can schedule a chat!
On my way to #COLM2025 ๐ Check out https://t.co/snFTIg24Am - QUDsim: Discourse templates in LLM stories https://t.co/xqvbDvH5v0 - EvalAgent: retrieval-based eval targeting implicit criteria https://t.co/f3JRojHeLb - RoboInstruct: code generation for robotics with simulators
0
3
19
Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks Code-capable large language model (LLM) agents are increasingly embedded into software engineering workflows where they can read, write, and execute code, raising the stakes of
0
5
17
Check JAWS-Bench โ a benchmark that stress-tests code agents across three workspaces, led by @ShoumikSaha7 this summer: You build agents? Test them where attackers live: repos, files, tools. You do safety? Care about what runs, not just what the model says.
Code agents donโt just talk -- they execute. What happens when you jailbreak them? Announcing JAWS-Bench (from my summer at @amazon AWS): a benchmark to jailbreak code agents across 3 workspaces -- empty โ single-file โ multi-file. The results? They break. A lot. Details ๐งต๐
0
1
8
๐จModeling Abstention via Selective Help-seeking LLMs learn to use search tools to answer questions they would otherwise hallucinate on. But can this also teach them what they know vs not? @momergul_ introduces MASH that trains LLMs for search and gets abstentions for free!
1
22
37
Congrats Greg! The new logo actually maintains the UT legacy. Liked it a lot!
๐ขI'm joining NYU (Courant CS + Center for Data Science) starting this fall! Iโm excited to connect with new NYU colleagues and keep working on LLM reasoning, reliability, coding, creativity, and more! Iโm also looking to build connections in the NYC area more broadly. Please
1
0
5
ok it really *does* feel like having an ambitious STEM PhD in your pocket!
9
5
166
Our team is *hiring* interns & researchers! Weโre a small team of hardcore researchers & engineers working on foundation models, agentic methods, and embodiment. If you have strong publications and related experience, plz fill out application form. https://t.co/U4gOvNQ9qR
1
3
14
Really happy to finally see this work published after several delays. Sometimes good things take time! ๐ Good food for thought during the weekend : ) #ACL2025
Excited to share our #ACL2025NLP paper, "๐๐ข๐ญ๐๐๐ฏ๐๐ฅ: ๐๐ซ๐ข๐ง๐๐ข๐ฉ๐ฅ๐-๐๐ซ๐ข๐ฏ๐๐ง ๐๐ข๐ญ๐๐ญ๐ข๐จ๐ง ๐๐ฏ๐๐ฅ๐ฎ๐๐ญ๐ข๐จ๐ง ๐๐จ๐ซ ๐๐จ๐ฎ๐ซ๐๐ ๐๐ญ๐ญ๐ซ๐ข๐๐ฎ๐ญ๐ข๐จ๐ง"! ๐ If youโre working on RAG, Deep Research and Trustworthy AI, this is for you. Why? Citation quality is
1
0
5
Very excited to share the project I've been working on over the past several months! We proposed Deep Researcher with Test-Time Diffusion, a novel method to leverage iterative draft+revision to tackle complex questions demanding exhaustive search and reasoning.
3
9
28
Introducing Kiro, an all-new agentic IDE that has a chance to transform how developers build software. Let me highlight three key innovations that make Kiro special: 1 - Kiro introduces spec-driven development, helping developers express their intent clearly through natural
130
408
2K
Happy to share that EvalAgent has been accepted to #COLM2025 @COLM_conf ๐๐จ๐ฆ We introduce a framework to identify implicit and diverse evaluation criteria for various open-ended tasks! ๐
Evaluating language model responses on open-ended tasks is hard! ๐ค We introduce EvalAgent, a framework that identifies nuanced and diverse criteria ๐โ๏ธ. EvalAgent identifies ๐ฉโ๐ซ๐ expert advice on the web that implicitly address the userโs prompt ๐งต๐
1
19
77
After three successful runs of #DL4C at ICLRโ22 (remote), ICLRโ23 (๐ท๐ผ/remote), and ICLRโ25 (๐ธ๐ฌ), Iโm thrilled to announce the 4th #DL4C workshop, ๐๐ฒ๐ฒ๐ฝ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ณ๐ผ๐ฟ ๐๐ผ๐ฑ๐ฒ ๐ถ๐ป ๐๐ต๐ฒ ๐๐ด๐ฒ๐ป๐๐ถ๐ฐ ๐๐ฟ๐ฎ, is coming to #NeurIPS2025 in San Diego, marking our first
๐ฃExcited to announce that the 4th #DL4C workshop โDeep Learning for Code in the Agentic Era" is coming to @NeurIPSConf 2025! AI coding agents are transforming software development at an unprecedented pace. Join us to explore the cutting edge of agent-based programming,
2
7
24
Seven years ago, I co-led a paper called ๐๐ผ๐๐ฝ๐ผ๐๐ค๐ that has motivated and facilitated many #AI #Agents research works since. Today, I'm asking that you stop using HotpotQA blindly for agents research in 2025 and beyond. In my new blog post, I revisit the brief history of
7
46
224
What happens when an LLM is asked to use information that contradicts its knowledge? We explore knowledge conflict in a new preprint๐ TLDR: Performance drops, and this could affect the overall performance of LLMs in model-based evaluation.๐๐งตโฌ๏ธ 1/8 #NLProc #LLM #AIResearch
4
23
86
LLMs trained to memorize new facts canโt use those facts well.๐ค We apply a hypernetwork to โ๏ธeditโ๏ธ the gradients for fact propagation, improving accuracy by 2x on a challenging subset of RippleEdit!๐ก Our approach, PropMEND, extends MEND with a new objective for propagation.
5
75
197
Great to work on this benchmark with astronomers in our NSF-Simons CosmicAI institute! What I like about it: (1) focus on data processing & visualization, a "bite-sized" AI4Sci task (not automating all of research) (2) eval with VLM-as-a-judge (possible with strong, modern VLMs)
How good are LLMs at ๐ญ scientific computing and visualization ๐ญ? AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results. SOTA models like Gemini 2.5 Pro & Claude 4 Opus only match ground truth scientific utility 16% of the time. ๐งต
2
4
25
Have you thought about making your reasoning model stronger through *skill composition*? It's not as hard as you'd imagine! Check out our work!!!
Solving complex problems with CoT requires combining different skills. We can do this by: ๐งฉModify the CoT data format to be โcomposableโ with other skills ๐ฅTrain models on each skill ๐Combine those models Lead to better 0-shot reasoning on tasks involving skill composition!
1
2
11