Tianjun Zhang
@tianjun_zhang
Followers
2K
Following
991
Media
23
Statuses
153
Project Lead of LiveCodeBench, RAFT and Gorilla LLM, PhD student @berkeley_ai
California, USA
Joined March 2017
Introducing DeepCoder-14B-Preview - our fully open-sourced reasoning model reaching o1 and o3-mini level on coding and math. The best part is, we’re releasing everything: not just the model, but the dataset, code, and training recipe—so you can train it yourself!🔥 Links below:
23
206
880
Proud to share what we have built! Tops the @lmarena_ai leaderboard with only 17B parameters. Huge wing for the open source! Enjoy 😉
Introducing our first set of Llama 4 models! We’ve been hard at work doing a complete re-design of the Llama series. I’m so excited to share it with the world today and mark another major milestone for the Llama herd as we release the *first* open source models in the Llama 4
1
0
29
🚨 NEW PAPER: "Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning"! 🤔 With all these long-reasoning LLMs, what are we actually optimizing for? Length penalties? Token budgets? We needed a better way to think about it! Website: https://t.co/G5JTmryx0d 🧵[1/9]
6
63
312
What do developers 𝘳𝘦𝘢𝘭𝘭𝘺 think of AI coding assistants? In October, we launched @CopilotArena to collect user preferences on real dev workflows. After months of live service, we’re here to share our findings in our recent preprint. Here's what we have learned /🧵
Introducing Copilot Arena - Interactive coding evaluation in the wild. Our extension lets you test top models for free, right in VSCode. Let's vote and build the Copilot leaderboard! Download here: https://t.co/Zyc9iL3u9m Led by @iamwaynechi and @valeriechen_ at CMU. 1/🧵
2
38
161
LLMs for GPU kernel🌽generation have been getting Pop🍿ular since our preview last Dec; excited to announce 📢 our full paper 📃 for KernelBench! Turns out KernelBench is quite challenging 🧠 — frontier models outperform the PyTorch Eager baseline <20% of the time. More 🧵👇
9
79
310
With the success of LLM agents like OpenAI Operator, we are entering a new scaling era, but how do we train these agent models? We present InSTA, the largest training environment for LLM agents, containing live web navigation tasks for 150k diverse websites in multiple
9
31
161
✨RL magic is in the air! Introducing DeepScaleR-1.5B-Preview—a fully open-source, 1.5B-parameter model trained with RL to surpass o1-preview for general math reasoning. 📜Blog: https://t.co/eHqApwRfnH 💻Github: https://t.co/tRsDN7xV4M
16
50
151
Magic of RL! You don’t need super large models to develop such behavior! Congrats @jiayi_pirate!
We reproduced DeepSeek R1-Zero in the CountDown game, and it just works Through RL, the 3B base LM develops self-verification and search abilities all on its own You can experience the Ahah moment yourself for < $30 Code: https://t.co/B2IsN1PrXV Here's what we learned 🧵
1
0
4
Congrats to @OpenAI on the impressive performance of o1 model! Seems o1 already achieves 76% on LoveCodeBench, how should we improve it to make it harder🤔🤔
2
0
14
I will be at #NeurIPS2024 this week! Happy to chat about large scale RL for reasoning and agents!
0
0
18
Check the new video arena! Pick your favorite video🚀
🚀 Just Launched: VideoArena!🎥 Discover head-to-head comparisons of video clips generated from the same prompts across top text-to-video models. Compare outputs from 7 leading models and we're adding more soon! 🔗 Check out the leaderboard: https://t.co/IkQTFB7am5
#Text2Video
1
0
6
Check out the amazing BFCL V2!
🚀Excited to announce the release of BFCL V2 • Live! 🏆 As LLMs evolve into intelligent agents, the Berkeley Function-Calling Leaderboard (BFCL) is leading the way in evaluating their real-world function-calling capabilities. V2 • Live features 📢 enterprise-contributed data,
0
0
8
Two new papers on self-improvement: paper 1 today ⬇️ In RISE, we build on online imitation to teach LLMs *how* to improve their own responses *sequentially*. w/ Llama2/3/Mistral, this gives solid +10-20% in 5 turns, outperforms parallel sampling! https://t.co/hVy3T1ZoGi 🧵⬇️
1
28
125
@AIatMeta And we use Berkeley Function Calling Leaderboard for evaluation! Congrats to my colleagues @shishirpatil_ @charlie_jcj02 @HuanzhiMao @profjoeyg Ion Stoica @fanjia_yan🫡
0
1
8
This paper claims that Llama3-8B+BoT (Buffer of Thoughts) has the potential to surpass Llama3-70B model. 🤯 'Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models' - Propose buffer-manager to dynamically update the meta-buffer, thus enhancing the capacity
10
93
621
Thought-Augmented Reasoning with LLMs Presents a thought-augmented reasoning approach, Buffer of Thoughts, to enhance the accuracy, efficiency, and robustness of LLM-based reasoning. It leverages a meta-buffer containing high-level thoughts (thought templates) distilled from
13
97
443
🤔why LLMs can only follow 1 thought template (e.g., CoT)? In our paper, LLMs can select their own thought process flexibly! Big improvement on agentic tasks! 🎉
Excited to introduce our new prompting method on LLMs, Buffer of Thoughts (BoT), collaborating with @tianjun_zhang at @berkeley_ai. Notably, Llama3-8B+BoT can beat Llama3-70B on reasoning tasks. Paper: https://t.co/M4KjqlhiyZ Code: https://t.co/DMaUcu8IOi
1
2
9
Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models Significant performance improvements over previous SOTA methods: 11% on Game of 24, 20% on Geometric Shapes and 51% on Checkmate-in-One. repo: https://t.co/wMT8p9h5sW abs: https://t.co/PpENCDeNTN
3
32
165