Princeton NLP Group
@princeton_nlp
Followers
5K
Following
283
Media
34
Statuses
258
Princeton NLP Group led by @prfsanjeevarora @danqi_chen @karthik_r_n
Princeton, NJ
Joined August 2020
AlgoTune is a benchmark that penalizes expensive models, since we give each model a budget of $1 to solve each task. Cool to see open weight models doing well!
We just benchmarked Qwen 3 Coder and GLM 4.5 on AlgoTune, and they manage to beat Claude Opus 4! We're excited to see if the models that will be released this week manage to make progress. Also: I just defended my PhD and I'm on the industry job market, my DMs are open :)
1
1
25
What happens if you compare LMs on SWE-bench without the fancy scaffolds? Our new leaderboard βSWE-bench (bash only)β shows you which LMs are the best at getting the job done with just bash. More on why this is important π
14
27
205
Shoutout to all the @Princeton researchers participating in @icmlconf #ICML2025 Browse through some of the cutting edge research from AI Lab students, post-docs and faculty being presented this year: https://t.co/pCRjcKaH7G
0
9
48
As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes? In our new preprint, we investigate human-centric model reasoning for knowledge transfer π§΅:
6
41
182
Improved reasoning increases performance on benchmarks, but are models able to pass their knowledge onto humans? π§ We evaluate modelsβ communication abilities in teaching novel solutions to users! See our new paper!
As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes? In our new preprint, we investigate human-centric model reasoning for knowledge transfer π§΅:
0
1
6
Introducing SWE-bench Multilingual: a new eval in the SWE-bench family to test LLM coding abilities in *9* programming languages, fully integrated with SB so it can plug into existing workflows. Claude 3.7 gets 43% on SB Multilingual vs 63% on SB Verified, a 20 pt drop!π§΅
2
16
67
Join us on May 21st- I'll talk about how we built SWE-bench & SWE-agent and what I'm excited about for the future of autonomous AI systems.
Can language model systems autonomously complete entire tasks end-to-end? In our next Expert Exchange webinar, @OfirPress explores autonomous LM systems for software engineering, featuring SWE-bench and SWE-agentβused by OpenAI, Meta, & more. π Register:
2
3
20
Our warmest congratulations to β¦@danqi_chenβ©, β¦@stanfordnlpβ© grad and now Associate Professor at β¦@PrincetonCSβ© and Associate Director of β¦@PrincetonPLIβ© on her stunning β¦β¦@iclr_confβ© keynote!
6
22
277
Claude can play Pokemon, but can it play DOOM? With a simple agent, we let VLMs play it, and found Sonnet 3.7 to get the furthest, finding the blue room! Our VideoGameBench (twenty games from the 90s) and agent are open source so you can try it yourself now --> π§΅
22
57
415
Can language models effectively impersonate you to family and friends? We find that they can: 44% of the time, close friends and family mis-identify Llama-3.1-8b as humanβ¦ π§΅π
3
3
9
Congrats on the Verified and Multimodal SWE-bench numbers. https://t.co/Wt18YbfPyX
venturebeat.com
Zencoder launches powerful AI coding agents with "Coffee Mode" that outperform competitors on benchmarks while integrating with existing developer environments, allowing programmers to be more...
0
1
5
We just updated the SWE-bench Multimodal leaderboard. Congrats to Globant, Zencoder, and the Agentless team from UIUC for their strong results.
1
5
29
π€ Ever wondered how prevalent some type of web content is during LM pre-training? In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages π Key takeaway: domains help us curate better pre-training data! π§΅/N
5
63
210
This Tuesday (Feb 18), @_carlosejimenez will discuss SWE-bench and the future of codegen evals, as part of the Conference on Synthetic Software in NYC. @KLieret will also be there. RSVP: https://t.co/2yFJEpy6jE
1
2
8
SWE-agent 1.0 is the open-source SOTA on SWE-bench Lite! Tons of new features: massively parallel runs; cloud-based deployment; extensive configurability with tool bundles; new command line interface & utilities.
3
18
61
π Introducing Goedel-Prover: A 7B LLM achieving SOTA open-source performance in automated theorem proving! π₯ β
Improving +7% over previous open source SOTA on miniF2F π Ranking 1st on the PutnamBench Leaderboard π€ Solving 1.9X total problems compared to prior works on Lean
13
68
274
Congrats to o3-mini on setting a new high score on SciCode!! R1 clocks in at an impressive 4.6%, matching Claude 3.5. SciCode is our super-tough programming benchmark written by PhDs in various scientific domains.
10
3
42
SciCode is our super tough coding benchmark testing the abilities of LMs to program code based on research in physics/biology/material science/... o1 is the SoTA with 7%. To make it easier to use we're putting it into the Inspect AI format, as a few groups were asking for this.
4
9
50
Congrats to the DeepSeek team on the impressive SWE-bench results!
π DeepSeek-R1 is here! β‘ Performance on par with OpenAI-o1 π Fully open-source model & technical report π MIT licensed: Distill & commercialize freely! π Website & API are live now! Try DeepThink at https://t.co/v1TFy7LHNy today! π 1/n
0
0
3