
Beidi Chen
@BeidiChen
Followers
15K
Following
1K
Media
35
Statuses
530
Asst. Prof @CarnegieMellon, @amazon Scholar, Prev: Visiting Researcher @Meta, Postdoc @Stanford, Ph.D. @RiceUniversity, Large-Scale ML, a fan of Dota2.
Joined November 2011
🎉 glad to see our attention sink is widely adopted and contribute to the strong open source models ~ please check out this post by @Guangxuan_Xiao on many insights and hypothesis. It would be interesting for folks who’ve seen artifacts / outliers on generated content and model.
I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models. For those interested in the details:.
3
6
125
RT @InfiniAILab: 🤖 GPT-5 supports 128K output / 400K input tokens. 📜 Wiles’s Fermat proof took ~88K tokens — the final output only. 🧩 Add….
0
2
0
RT @Guangxuan_Xiao: The release of GPT-OSS-120B & GPT-OSS-20B models today incorporates my Attention Sink work (.….
0
47
0
Big Congrats @Anshumali_ 🎈🎉.
Congrats to Rice CS' @Anshumali_ Shrivastava, who has been promoted to full professor. Shrivastava is well on his way to revolutionizing how LLMs & other deep learning models are trained & stored, using new algorithms to make AI scalable & more accessible.
1
0
27
RT @haoailab: (1/n) 🚀 With FastVideo, you can now generate a 5-second video in 5 seconds on a single H200 GPU!. Introducing FastWan series,….
0
110
0
🥳.
Huge thanks to @tinytitans_icml for an amazing workshop — see you next year!.Honored to receive a Best Paper Award 🏆. Let’s unlock the potential of sparsity! .Next up: scaling to hundreds/thousands of rollouts?.Or making powerful R1/K2-level LLMs (not just 8B 4-bit models) run
8
5
145
RT @IronSteveZhou: I will be in front of the GSM-Infinite poster tomorrow 2-4:30 pm. 🫡 East Exhibition Hall E-2901. Please come and say hi.….
0
2
0
Cool blog! Initially the longterm memory and consistency requirement of world model is something I felt the current video gen techniques couldn’t get us there yet. So I’m not a believer on world models just by scaling video gen. But I gave it a second thought — worrying too much.
What exactly is a "world model"? And what limits existing video generation models from being true world models?. In my new blog post, I argue that a true video world model must be causal, interactive, persistent, real-time, and physical accurate.
0
3
13
👀.
We are just a little bit faster than@nvidia GPUs on Qwen 235B. 18X faster. @CerebrasSystems inference is blazing fast. come build cool stuff on Cerebras inference
0
0
7
RT @svlevine: Action chunking is a great idea in robotics: by getting a model to produce a short sequence of actions, it _just works better….
0
108
0
Always wanted to get rid of it!! I remember I suspected the correlation between the success of llama3 and a big expansion of its Vocab size😁 it was also very painful for speculative decoding (we once wanted to use SSM as a good draft for longcontext transformers but failed due.
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
3
8
148
RT @allen_ai: Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collabora….
0
74
0
RT @WentaoGuo7: 🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On….
0
73
0
I was asked many times lately what repo to use by students who’re working on test-time scaling with slight modified attention or generation workflow (customized reward model /search). HF is a bit too time consuming esp with tons of token generation and Sglang/vllm is a bit hard.
🧵 Glad to introduce LiteSys . the inference framework we used in📄 Kinetics: Rethinking Test-Time Scaling Laws ( to evaluate test-time scaling (32K+ generated tokens) at scale. If you are:.✅ Looking for an inference framework that's easy to extend. 🐢
2
24
224
RT @aviral_kumar2: Given the confusion around what RL does for reasoning in LLMs, @setlur_amrith & I wrote a new blog post on when RL simpl….
pinnate-flare-8f3.notion.site
Amrith Setlur and Aviral Kumar, Carnegie Mellon University
0
39
0