Princeton NLP Group @princeton_nlp X Profile

Princeton NLP Group

@princeton_nlp

Followers

5K

Following

283

Media

34

Statuses

258

Princeton NLP Group led by @prfsanjeevarora @danqi_chen @karthik_r_n

https://t.co/lm3d7Y0smq

Princeton, NJ

Joined August 2020

Don't wanna be here? Send us removal request.

Ofir Press

@OfirPress

4 months

AlgoTune is a benchmark that penalizes expensive models, since we give each model a budget of $1 to solve each task. Cool to see open weight models doing well!

Ori Press

@ori_press

4 months

We just benchmarked Qwen 3 Coder and GLM 4.5 on AlgoTune, and they manage to beat Claude Opus 4! We're excited to see if the models that will be released this week manage to make progress. Also: I just defended my PhD and I'm on the industry job market, my DMs are open :)

1

25

carlos

@_carlosejimenez

4 months

What happens if you compare LMs on SWE-bench without the fancy scaffolds? Our new leaderboard “SWE-bench (bash only)” shows you which LMs are the best at getting the job done with just bash. More on why this is important 👇

14

27

205

Princeton Laboratory for Artificial Intelligence

@PrincetonAInews

4 months

Shoutout to all the @Princeton researchers participating in @icmlconf #ICML2025 Browse through some of the cutting edge research from AI Lab students, post-docs and faculty being presented this year: https://t.co/pCRjcKaH7G

0

9

48

Ben Shi

@BenShi34

6 months

As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes? In our new preprint, we investigate human-centric model reasoning for knowledge transfer 🧵:

6

41

182

carlos

@_carlosejimenez

6 months

Improved reasoning increases performance on benchmarks, but are models able to pass their knowledge onto humans? 🧐 We evaluate models’ communication abilities in teaching novel solutions to users! See our new paper!

Ben Shi

@BenShi34

6 months

As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes? In our new preprint, we investigate human-centric model reasoning for knowledge transfer 🧵:

0

1

6

Kabir

@plodq

7 months

Introducing SWE-bench Multilingual: a new eval in the SWE-bench family to test LLM coding abilities in *9* programming languages, fully integrated with SB so it can plug into existing workflows. Claude 3.7 gets 43% on SB Multilingual vs 63% on SB Verified, a 20 pt drop!🧵

2

16

67

Ofir Press

@OfirPress

7 months

Join us on May 21st- I'll talk about how we built SWE-bench & SWE-agent and what I'm excited about for the future of autonomous AI systems.

PyTorch

@PyTorch

7 months

Can language model systems autonomously complete entire tasks end-to-end? In our next Expert Exchange webinar, @OfirPress explores autonomous LM systems for software engineering, featuring SWE-bench and SWE-agent—used by OpenAI, Meta, & more. 🔗 Register:

2

3

20

Stanford NLP Group

@stanfordnlp

7 months

Our warmest congratulations to ⁦@danqi_chen⁩, ⁦@stanfordnlp⁩ grad and now Associate Professor at ⁦@PrincetonCS⁩ and Associate Director of ⁦@PrincetonPLI⁩ on her stunning ⁦⁦@iclr_conf⁩ keynote!

6

22

277

Alex L Zhang

@a1zhang

7 months

Claude can play Pokemon, but can it play DOOM? With a simple agent, we let VLMs play it, and found Sonnet 3.7 to get the furthest, finding the blue room! Our VideoGameBench (twenty games from the 90s) and agent are open source so you can try it yourself now --> 🧵

22

57

415

Ben Shi

@BenShi34

8 months

Can language models effectively impersonate you to family and friends? We find that they can: 44% of the time, close friends and family mis-identify Llama-3.1-8b as human… 🧵👇

3

9

Ofir Press

@OfirPress

8 months

Congrats on the Verified and Multimodal SWE-bench numbers. https://t.co/Wt18YbfPyX

venturebeat.com

Zencoder launches powerful AI coding agents with "Coffee Mode" that outperform competitors on benchmarks while integrating with existing developer environments, allowing programmers to be more...

0

1

5

Ofir Press

@OfirPress

8 months

We just updated the SWE-bench Multimodal leaderboard. Congrats to Globant, Zencoder, and the Agentless team from UIUC for their strong results.

1

5

29

Princeton NLP Group

@princeton_nlp

8 months

Nothing like a sunny hike to welcome spring!

1

6

79

Alex Wettig

@_awettig

9 months

🤔 Ever wondered how prevalent some type of web content is during LM pre-training? In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐 Key takeaway: domains help us curate better pre-training data! 🧵/N

5

63

210

Ofir Press

@OfirPress

9 months

This Tuesday (Feb 18), @_carlosejimenez will discuss SWE-bench and the future of codegen evals, as part of the Conference on Synthetic Software in NYC. @KLieret will also be there. RSVP: https://t.co/2yFJEpy6jE

1

2

8

Kilian Lieret

@KLieret

9 months

SWE-agent 1.0 is the open-source SOTA on SWE-bench Lite! Tons of new features: massively parallel runs; cloud-based deployment; extensive configurability with tool bundles; new command line interface & utilities.

3

18

61

Yong Lin

@Yong18850571

10 months

🚀 Introducing Goedel-Prover: A 7B LLM achieving SOTA open-source performance in automated theorem proving! 🔥 ✅ Improving +7% over previous open source SOTA on miniF2F 🏆 Ranking 1st on the PutnamBench Leaderboard 🤖 Solving 1.9X total problems compared to prior works on Lean

13

68

274

Ofir Press

@OfirPress

10 months

Congrats to o3-mini on setting a new high score on SciCode!! R1 clocks in at an impressive 4.6%, matching Claude 3.5. SciCode is our super-tough programming benchmark written by PhDs in various scientific domains.

10

3

42

Ofir Press

@OfirPress

10 months

SciCode is our super tough coding benchmark testing the abilities of LMs to program code based on research in physics/biology/material science/... o1 is the SoTA with 7%. To make it easier to use we're putting it into the Inspect AI format, as a few groups were asking for this.

4

9

50

Princeton NLP Group

@princeton_nlp

10 months

Congrats to the DeepSeek team on the impressive SWE-bench results!

DeepSeek

@deepseek_ai

10 months

🚀 DeepSeek-R1 is here! ⚡ Performance on par with OpenAI-o1 📖 Fully open-source model & technical report 🏆 MIT licensed: Distill & commercialize freely! 🌐 Website & API are live now! Try DeepThink at https://t.co/v1TFy7LHNy today! 🐋 1/n

0

3