Berkeley AI Research
@berkeley_ai
Followers
233K
Following
377
Media
41
Statuses
1K
We're graduate students, postdocs, faculty and scientists at the cutting edge of artificial intelligence research.
Berkeley, CA
Joined July 2017
our new work on controlling recsys with natural language, led by @MicahCarroll, and with great collaborators Addie Foote, @kjfeng_ , Marcus Williams, @ancadianadragan, @wbradknox
https://t.co/5mZdCDNZTP
arxiv.org
When users are dissatisfied with recommendations from a recommender system, they often lack fine-grained controls for changing them. Large language models (LLMs) offer a solution by allowing users...
5
7
39
What really matters in matrix-whitening optimizers (Shampoo/SOAP/PSGD/Muon)? We ran a careful comparison, dissecting each algorithm. Interestingly, we find that proper matrix-whitening can be seen as *two* transformations, and not all optimizers implement both. Blog:
5
45
324
Autoregressive language models learn to compress data by mapping sequences to high-dimensional representations and decoding one token at a time. The quality of compression, as defined by the ability to predict the next token given a prompt, progressively improves (as measured by
4
17
52
LLMs have shown a remarkable ability to “self-refine” and learn from their mistakes via in-context learning. But in robotics, most methods are single-shot. How can we bring inference-time adaptation to robot learning? A 🧵:
10
18
127
New work from @aditya_oberai & @seohong_park: instead of 1-step TD backups or n-step, can we "divide and conquer" over the trajectory, backing up finer and finer increments? Improves over bias of TD-0 and variance of MC. Principle is old, but getting it to work takes some care!
TD Learning can suffer on long tasks: ↑ deep bellman recursions → ↓ poor scalability (despite big data) We introduce a new method (TRL) with a "divide-and-conquer" value update, which scales well with long horizons!
6
26
241
TD Learning can suffer on long tasks: ↑ deep bellman recursions → ↓ poor scalability (despite big data) We introduce a new method (TRL) with a "divide-and-conquer" value update, which scales well with long horizons!
2
28
227
🌍 LLMs can use long chain-of-thought (CoT) to reason in English, but what about other languages? New paper w/ @BerkeleyNLP: We study how scaling, pretraining, post-training & inference affect long CoT across 9 languages. Spoiler: English long CoT ≠ multilingual long CoT 🧵
3
6
18
Our new paper with Sonali Sharma and @RoxanaDaneshjou is out in @npjDigitalMed! We examine how medical safety and disclaimer messages in public LLMs have changed over time when answering patient questions.
Generative AI models are giving fewer medical disclaimers over time. 📉 In 2022, ~26% of AI health answers had a disclaimer. By 2025? <1%. As models get smarter, they’re getting less safe. Patients may take outputs as medical advice. https://t.co/2OYQvKdezT
3
8
17
AI can now see, reason, and segment the Earth. 🌍 Meet LISAt, our #NeurIPS2025 Datasets & Benchmarks paper - the first foundation model that turns language queries into pixel-level satellite segmentations. 🛰️ (1/n) 🔗 https://t.co/ApVZgGF0cU
@NeurIPSConf @berkeley_ai
4
3
29
Can a robot inspect all views of an object? Today @IROS, we present Omni-Scan from @berkeley_ai, a novel method for bimanual robo 360° object scanning & reconstruction using 3D Gaussian Splats. (1/8) 🔗 https://t.co/8emyJfUNk4
5
12
123
🧠 New preprint: How Do LLMs Use Their Depth? We uncover a “Guess-then-Refine” mechanism across layers - early layers predict high-frequency tokens as guesses; later layers refine them as context builds Paper - https://t.co/5PitHjmJJZ
@neuranna @GopalaSpeech @berkeley_ai
15
72
516
Catch up on our most recent Community Lecture: “Transmission Versus Truth: What Will It Take to Make an AI as Smart as a 4-Year-Old?” with Alison Gopnik. This was the last of six community lectures for 2025, and all are available to watch on SFI’s YouTube channel. Watch here:
0
12
26
New evaluation results from @AnthropicAI's Claude Sonnet 4.5’s system card on our CyberGym benchmark reveals a striking trend: AI cybersecurity capabilities are advancing at unprecedented speed—from ~10% (Claude-Sonnet -3.7) to ~30% success rates (Claude-Sonnet-4.5) (with single
1/ 🔥 AI agents are reaching a breakthrough moment in cybersecurity. In our latest work: 🔓 CyberGym: AI agents discovered 15 zero-days in major open-source projects 💰 BountyBench: AI agents solved real-world bug bounty tasks worth tens of thousands of dollars 🤖
7
15
56
Amazing! 10 @BerkeleyEECS @SkyCompLab grad students are Amazon AI PhD Fellows! Congrats! Learn more about our fellows here: https://t.co/zuCGKlmSNe
#AmazonAIFellowship
@BerkeleySky
eecs.berkeley.edu
Today, Amazon announced its new AI PhD Fellowship program, offering two years of funding to over 100 PhD students across nine universities. Ten of these inaugural fellowships have been awarded to...
🎓 Amazon launches AI PhD Fellowship program, providing $68 million over two years to fund PhD students at 9 universities pursuing research in machine learning, computer vision, and natural-language processing. #AmazonAIFellowship
0
14
59
Humans handle dynamic situations easily, what about models? Turns out, they break in three distinct ways: ⛔ Force Stop → Reasoning leakage (won’t stop) ⚡️ Speedup → Panic (rushed answers) ❓ Info Updates → Self-doubt (reject updates) 👉Check out https://t.co/wKrnsMkiFY
5
20
66
Simulation drives robotics progress, but how do we close the reality gap? Introducing GaussGym: an open-source framework for learning locomotion from pixels with ultra-fast parallelized photorealistic rendering across >4,000 iPhone, GrandTour, ARKit, and Veo scenes! Thread 🧵
11
63
325
How can a robot provide details of plant anatomy for plant phenotyping? Today @IROS2025 , we present Botany-Bot from @berkeley_ai @Siemens. Botany-Bot 1) creates segmented 3D models of plants using Gaussian splats and GarField 2) uses a robot arm to expose hidden details. (1/9)
2
4
24
✨Introducing ECHO, the newest in-the-wild image generation benchmark! You’ve seen new image models and new use cases discussed on social media, but old benchmarks don’t test them! We distilled this qualitative discussion into a structured benchmark. 🔗 https://t.co/wJmmEY8TFQ
3
31
114
Do you have many models to choose from and little labeled data with which to evaluate them? Check out our #neurips2025 paper, which presents a method to estimate model performance more accurately than previous methods using both labeled + unlabeled data.
New #NeurIPS2025 paper: how should we evaluate machine learning models without a large, labeled dataset? We introduce Semi-Supervised Model Evaluation (SSME), which uses labeled and unlabeled data to estimate performance! We find SSME is far more accurate than standard methods.
2
12
107
Super excited about @wenjie_ma's work on verifying math proofs! ✅ 24 competitions, 3 SoTAs (o3, Gemini-2.5-Pro, R1) ✅ Strong evaluator -- a carefully designed evaluator with simple ensemble beats agentic ones ✅ Strong best-of-n performance Check out the paper & website!
LLMs solving math benchmarks with verifiable answers like AIME? ✅ LLMs solving math proofs? ❌ Still an open problem. RL works great for final-answer problems, but proofs are different: - Often no single checkable answer - Correct answers can hide flawed reasoning The key
3
15
119