princeton_nlp Profile Banner
Princeton NLP Group Profile
Princeton NLP Group

@princeton_nlp

Followers
5K
Following
283
Media
34
Statuses
258

Princeton NLP Group led by @prfsanjeevarora @danqi_chen @karthik_r_n

Princeton, NJ
Joined August 2020
Don't wanna be here? Send us removal request.
@OfirPress
Ofir Press
4 months
AlgoTune is a benchmark that penalizes expensive models, since we give each model a budget of $1 to solve each task. Cool to see open weight models doing well!
@ori_press
Ori Press
4 months
We just benchmarked Qwen 3 Coder and GLM 4.5 on AlgoTune, and they manage to beat Claude Opus 4! We're excited to see if the models that will be released this week manage to make progress. Also: I just defended my PhD and I'm on the industry job market, my DMs are open :)
1
1
25
@_carlosejimenez
carlos
4 months
What happens if you compare LMs on SWE-bench without the fancy scaffolds? Our new leaderboard β€œSWE-bench (bash only)” shows you which LMs are the best at getting the job done with just bash. More on why this is important πŸ‘‡
14
27
205
@PrincetonAInews
Princeton Laboratory for Artificial Intelligence
4 months
Shoutout to all the @Princeton researchers participating in @icmlconf #ICML2025 Browse through some of the cutting edge research from AI Lab students, post-docs and faculty being presented this year: https://t.co/pCRjcKaH7G
0
9
48
@BenShi34
Ben Shi
6 months
As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes? In our new preprint, we investigate human-centric model reasoning for knowledge transfer 🧡:
6
41
182
@_carlosejimenez
carlos
6 months
Improved reasoning increases performance on benchmarks, but are models able to pass their knowledge onto humans? 🧐 We evaluate models’ communication abilities in teaching novel solutions to users! See our new paper!
@BenShi34
Ben Shi
6 months
As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes? In our new preprint, we investigate human-centric model reasoning for knowledge transfer 🧡:
0
1
6
@plodq
Kabir
7 months
Introducing SWE-bench Multilingual: a new eval in the SWE-bench family to test LLM coding abilities in *9* programming languages, fully integrated with SB so it can plug into existing workflows. Claude 3.7 gets 43% on SB Multilingual vs 63% on SB Verified, a 20 pt drop!🧡
2
16
67
@OfirPress
Ofir Press
7 months
Join us on May 21st- I'll talk about how we built SWE-bench & SWE-agent and what I'm excited about for the future of autonomous AI systems.
@PyTorch
PyTorch
7 months
Can language model systems autonomously complete entire tasks end-to-end? In our next Expert Exchange webinar, @OfirPress explores autonomous LM systems for software engineering, featuring SWE-bench and SWE-agentβ€”used by OpenAI, Meta, & more. πŸ”— Register:
2
3
20
@stanfordnlp
Stanford NLP Group
7 months
Our warmest congratulations to ⁦@danqi_chen⁩, ⁦@stanfordnlp⁩ grad and now Associate Professor at ⁦@PrincetonCS⁩ and Associate Director of ⁦@PrincetonPLI⁩ on her stunning ⁦⁦@iclr_conf⁩ keynote!
6
22
277
@a1zhang
Alex L Zhang
7 months
Claude can play Pokemon, but can it play DOOM? With a simple agent, we let VLMs play it, and found Sonnet 3.7 to get the furthest, finding the blue room! Our VideoGameBench (twenty games from the 90s) and agent are open source so you can try it yourself now --> 🧡
22
57
415
@BenShi34
Ben Shi
8 months
Can language models effectively impersonate you to family and friends? We find that they can: 44% of the time, close friends and family mis-identify Llama-3.1-8b as human… πŸ§΅πŸ‘‡
3
3
9
@OfirPress
Ofir Press
8 months
We just updated the SWE-bench Multimodal leaderboard. Congrats to Globant, Zencoder, and the Agentless team from UIUC for their strong results.
1
5
29
@princeton_nlp
Princeton NLP Group
8 months
Nothing like a sunny hike to welcome spring!
1
6
79
@_awettig
Alex Wettig
9 months
πŸ€” Ever wondered how prevalent some type of web content is during LM pre-training? In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐 Key takeaway: domains help us curate better pre-training data! 🧡/N
5
63
210
@OfirPress
Ofir Press
9 months
This Tuesday (Feb 18), @_carlosejimenez will discuss SWE-bench and the future of codegen evals, as part of the Conference on Synthetic Software in NYC. @KLieret will also be there. RSVP: https://t.co/2yFJEpy6jE
1
2
8
@KLieret
Kilian Lieret
9 months
SWE-agent 1.0 is the open-source SOTA on SWE-bench Lite! Tons of new features: massively parallel runs; cloud-based deployment; extensive configurability with tool bundles; new command line interface & utilities.
3
18
61
@Yong18850571
Yong Lin
10 months
πŸš€ Introducing Goedel-Prover: A 7B LLM achieving SOTA open-source performance in automated theorem proving! πŸ”₯ βœ… Improving +7% over previous open source SOTA on miniF2F πŸ† Ranking 1st on the PutnamBench Leaderboard πŸ€– Solving 1.9X total problems compared to prior works on Lean
13
68
274
@OfirPress
Ofir Press
10 months
Congrats to o3-mini on setting a new high score on SciCode!! R1 clocks in at an impressive 4.6%, matching Claude 3.5. SciCode is our super-tough programming benchmark written by PhDs in various scientific domains.
10
3
42
@OfirPress
Ofir Press
10 months
SciCode is our super tough coding benchmark testing the abilities of LMs to program code based on research in physics/biology/material science/... o1 is the SoTA with 7%. To make it easier to use we're putting it into the Inspect AI format, as a few groups were asking for this.
4
9
50
@princeton_nlp
Princeton NLP Group
10 months
Congrats to the DeepSeek team on the impressive SWE-bench results!
@deepseek_ai
DeepSeek
10 months
πŸš€ DeepSeek-R1 is here! ⚑ Performance on par with OpenAI-o1 πŸ“– Fully open-source model & technical report πŸ† MIT licensed: Distill & commercialize freely! 🌐 Website & API are live now! Try DeepThink at https://t.co/v1TFy7LHNy today! πŸ‹ 1/n
0
0
3