Luke DH Lee
@luke_lee_ai
Followers
112
Following
846
Media
5
Statuses
45
PhD student @UCBerkeley. CS master’s @UCL. Formerly visiting researcher @ Stanford AI Lab.
Joined August 2018
New paper! Rare SAE dataset approach: We train Sparse Autoencoders using only synthetic data generated by the model itself, revealing features that truly reflect what’s inside the model.
2
1
4
My group & collaborators have developed many popular benchmarks over the years, e.g., MMLU, MATH, APPS---really excited about our latest benchmark OMEGA Ω: 🔍Can LLMs really think outside the box in math? a new benchmark probing 3 axes of generalization: 1️⃣ Exploratory 2️⃣
📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬 We built a benchmark to find out → OMEGA Ω 📐 💥 We found
4
35
157
🧵Working with #MCP or building a modular #RAG system, but not sure which rankers to use from your pool? 📊 Rank the Rankers⚡Route smart. This paper shows how. 👨🔬 w/ Fernando Diaz @841io 💻 Code: https://t.co/fPBzHWzuF2 Paper:
0
0
2
Beginning of autonomous bug discovery & defense! 🔥 AI agents now match elite hackers — 15 zero-days found, $30K+ bugs patched. Huge milestone by @dawnsongtweets & team!
1/ 🔥 AI agents are reaching a breakthrough moment in cybersecurity. In our latest work: 🔓 CyberGym: AI agents discovered 15 zero-days in major open-source projects 💰 BountyBench: AI agents solved real-world bug bounty tasks worth tens of thousands of dollars 🤖
0
0
0
In Large Language Monkeys, we showed the scaling laws of inference-time compute with repeated sampling--the power law relationship between the number of repeated attempts and the fraction of problems solved! The following amazing work theoretically proves the necessary and
Interested in test time / inference scaling laws? Then check out our newest preprint!! 📉 How Do Large Language Monkeys Get Their Power (Laws)? 📉 https://t.co/Vz76RpmXdF w/ @JoshuaK92829 @sanmikoyejo @Azaliamirh @jplhughes @jordanjuravsky @sprice354_ @aengus_lynch1
0
41
170
🚨New Breakthrough in Tip-of-the-Tongue (TOT) Retrieval Research! We address data limitations and offer a fresh evaluation method for the TOT complex queries. Curious how TREC TOT track test queries are created? Check out this thread🧵 and our paper📄:
2
11
30
The most bullish AI capability I'm looking for is not whether it's able to solve PhD grade problems. It's whether you'd hire it as a junior intern. Not "solve this theorem" but "get your slack set up, read these onboarding docs, do this task and let's check in next week".
356
680
10K
Agency > Intelligence I had this intuitively wrong for decades, I think due to a pervasive cultural veneration of intelligence, various entertainment/media, obsession with IQ etc. Agency is significantly more powerful and significantly more scarce. Are you hiring for agency? Are
1K
7K
37K
We are releasing CodeMonkeys, a system for solving SWE-bench problems with a focus on careful parallel and serial scaling of test-time compute! CodeMonkeys solves 57.4% of issues on SWE-bench Verified and and running our selection mechanism on an ensemble of existing top
My fellow code monkeys (@jordanjuravsky @ryansehrlich) and I are excited to release CodeMonkeys: a system for solving SWE-bench issues specifically designed to leverage test-time compute! CodeMonkeys solves 57.4% of issues on SWE-bench Verified. A core component of our system
5
23
122
Announcing LANG-JEPA — a new language model architecture I’m working on that optimizes in “concept” space instead of “token” space. Inspired by @ylecun's JEPA (I-JEPA for images, V-JEPA for video), LANG-JEPA asks: What if we train for conceptual understanding directly, rather
24
118
943
Thanks for covering our work on test time scaling! Turns out repeated sampling alone is surprisingly effective (~ log linear relationship between num samples and coverage across many reasoning tasks) and even better if combined with sequential “thinking”!
Test-Time Compute scaling but in simple! @OpenAI o1/o3 made big waves by being able to scale inference compute relative to downstream performance. Here is a poor man's recipe for it. “Scaling Inference Compute with Repeated Sampling” is a paper that demonstrates how repeated
0
18
139
New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
208
708
4K
Excited to share our NeurIPS 2024 Oral, Convolutional Differentiable Logic Gate Networks, leading to a range of inference efficiency records, including inference in only 4 nanoseconds 🏎️. We reduce model sizes by factors of 29x-61x over the SOTA. Paper: https://t.co/Aptk35mKir
18
255
2K
I read Google's paper about their quantum computer so you don't have to. They claim to have ran a quantum computation in 5 minutes that would take a normal computer 10^25 years. But what was that computation? Does it live up to the hype? I will break it down.🧵
505
3K
24K
Training Large Language Models to Reason in a Continuous Latent Space Introduces a new paradigm for LLM reasoning called Chain of Continuous Thought (COCONUT) Extremely simple change: instead of mapping between hidden states and language tokens using the LLM head and embedding
51
301
2K
AI as AI compiler? Very excited to release KernelBench, a new code generation benchmark for evaluating models' ability to generate correct and efficient CUDA kernels. KernelBench has 4 levels: Level 1 (100 tasks): Single-kernel operators (e.g. matmuls) Level 2 (100 tasks):
Kernels are the kernel of deep learning. 🙃...but writing kernels sucks. Can LLMs help? 🤔 Introducing 🌽 KernelBench (Preview), a new coding benchmark designed to evaluate the ability of LLMs to generate ⚡️efficient💨 GPU kernels for optimizing neural network performance.
3
29
165
AI for AI chips for AI!
Great article in Freethink about the use of AI in chip design by NVIDIA, Synopsys, and Cadence, as well as Google's use of AlphaChip! https://t.co/IvquPpVrO1
3
1
58
After the news was announced, John and Demis reunited with their teams in London. A snapshot of what they had to say ↓
3
21
246
Winning the @NobelPrize is the honour of a lifetime and the realisation of a lifelong dream - it still hasn’t really sunk in yet. With AlphaFold2 we cracked the 50-year grand challenge of protein structure prediction: predicting the 3D structure of a protein purely from its
BREAKING NEWS The Royal Swedish Academy of Sciences has decided to award the 2024 #NobelPrize in Chemistry with one half to David Baker “for computational protein design” and the other half jointly to Demis Hassabis and John M. Jumper “for protein structure prediction.”
384
919
9K
Timely and fascinating research on LLM security in multi-agent systems! The discovery of 'Prompt Infection' highlights the vulnerabilities of larger models and the critical need for robust safeguards.
🚨 Multi-agent systems are no longer safe from prompt injection! In our paper, we introduce Prompt Infection—an infectious prompt injection attack that spreads like a virus across LLM agents, turning your multi-agent system into a network of compromised agents. TL;DR: 1. One
1
2
6