Cade Daniel 🇺🇸 Profile
Cade Daniel 🇺🇸

@cdnamz

Followers
1K
Following
175K
Media
12
Statuses
341

systems performance

San Francisco, CA
Joined July 2012
Don't wanna be here? Send us removal request.
@HrishbhDalal
Hrishbh Dalal
1 year
What if we could teach LLMs to be algorithm inventors? I trained an LLM to improve sorting algorithms through pure reinforcement learning - and it discovered optimizations giving 47.92x speedups over an optimized python based Timsort baseline! No cold-start data needed. I used
20
63
784
@jefrankle
Jonathan Frankle
1 year
The hardest part about finetuning LLMs is that people generally don't have high-quality labeled data. Today, @databricks introduced TAO, a new finetuning method that only needs inputs, no labels necessary. Best of all, it actually beats supervised finetuning on labeled data.
13
137
898
@simran_s_arora
Simran Arora
1 year
BASED ✌️ turns 1! One year since its launch at NeurIPS 2023 — and it's helped shape the new wave of efficient LMs. ⚡️ Fastest linear attention kernels 🧠 405B models trained on 16 GPUs 💥 Inspired Mamba-v2, RWKVs, MiniMax Checkout our retrospective below!
3
57
106
@hongyangzh
Hongyang Zhang
1 year
Jointly announcing EAGLE-3 with SGLang: Setting a new record in LLM inference acceleration! - 5x🚀than vanilla (on HF) - 1.4x🚀than EAGLE-2 (on HF) - A record of ~400 TPS on LLama 3.1 8B with a single H100 (on SGLang) - 1.65x🚀in latency even for large bs=64 (on SGLang) - A new
14
43
298
@shanli_xing
Shanli Xing
1 year
🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: https://t.co/R780Rth03x
1
33
181
@simonguozirui
Simon Guo
1 year
LLMs for GPU kernel🌽generation have been getting Pop🍿ular since our preview last Dec; excited to announce 📢 our full paper 📃 for KernelBench! Turns out KernelBench is quite challenging 🧠 — frontier models outperform the PyTorch Eager baseline <20% of the time. More 🧵👇
9
73
306
@cdnamz
Cade Daniel 🇺🇸
1 year
Welcome @istoica05
@haozhangml
Hao Zhang
1 year
Thrilled to see @istoica05 joining X and couldn't agree more with his insights on the importance of shared infrastructure. "Open source" encompasses more than just open weights—it includes open data, open artifacts, and open infrastructure!
0
0
11
@cdnamz
Cade Daniel 🇺🇸
1 year
Congrats!
@victor207755822
Deli Chen
1 year
Unbelievable results, feels like a dream—our R1 model is now #1 in the world (with style control)! 🌍🏆 Beyond words right now. 🤯 All I know is we keep pushing forward to make open-source AGI a reality for everyone. 🚀✨ #OpenSource #AI #AGI #DeepSeekR1
0
0
3
@Grad62304977
Grad
1 year
People waking up to take their bitter lesson pill https://t.co/fswrLVjfCC
@rm_rafailov
Rafael Rafailov @ NeurIPS
1 year
DeepSeek R1 with "Cold Start" pretty much works as expected. I still don't buy the R1 Zero result, the base models barely output coherent solutions without finagling. My bet is there is some correction/reflection/backtracking-like data in mid-training.
3
3
89
@cdnamz
Cade Daniel 🇺🇸
1 year
love finding bangers so damn good they force a follow
0
1
12
@Suhail
Suhail
1 year
Once the AI labs realize they need to make products for survival, they will immediately reformulate their strategy to competing with the most obvious working thing that is vaguely under the guise of the original mission. You should presume you will be ruthlessly copied.
30
39
774
@rchoudhury997
Rohan Choudhury
1 year
Excited to finally release our NeurIPS 2024 (spotlight) paper! We introduce Run-Length Tokenization (RLT), a simple way to significantly speed up your vision transformer on video with no loss in performance!
22
170
1K
@vima_gupta
Vima Gupta
1 year
1/7 🧵 MoEs: A tale of expectation vs reality Marketing: "Only compute the expert parameters you need!" Reality: Batch 16 requests → ALL experts activate At serving time (vLLM/TGI), arithmetic intensity: AI ≈ (num_tokens * top_k) / total_experts In simpler terms: Your decode
4
7
32
@gm8xx8
𝚐𝔪𝟾𝚡𝚡𝟾
1 year
Pie: Pooling CPU Memory for LLM Inference paper: https://t.co/PoSbHta0n3 Pie is an LLM inference framework that tackles the memory challenges of large models by enabling efficient GPU-CPU memory swapping and adaptive expansion. It optimizes memory usage without increasing
Tweet card summary image
arxiv.org
The rapid growth of LLMs has revolutionized natural language processing and AI analysis, but their increasing size and memory demands present significant challenges. A common solution is to spill...
1
40
171
@mitrma
Michael Matthews
1 year
🍎 The core of Kinetix is our new 2D rigid body physics engine: Jax2D. This is a minimal rewrite of the classic Box2D engine made by @erin_catto. Jax2D allows us to run thousands of heterogeneous parallel environments on a single GPU (yes, you can vmap over different tasks!) 8/
4
4
40
@MillionInt
Jerry Tworek
1 year
ARR is the only meaningful AGI metric
6
6
69
@vllm_project
vLLM
1 year
Speculative decoding is one of the best tool in the vLLM's suite of inference optimization tool box, accelerating the inference without accuracy loss. Checkout our blog post for more details about the state of spec decode in vLLM today! 🧵 https://t.co/swMbYFX8Vl
Tweet card summary image
vllm.ai
Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll
5
50
236
@neurosp1ke
Andreas Köpf
1 year
If you are interested in the latest GPU MODE news (upcoming lectures, videos etc.) please follow our new official twitter/x account: @GPU_MODE
0
2
17
@simran_s_arora
Simran Arora
1 year
Want Llama 405B, but wish it scaled linearly in sequence length??? Enter LoLCATS: an efficient method for "turning Transformers to linear attention models", all on an academic budget!! We use LoLCATS to linearize the *full Llama 3.1 model family* for the first time – 20+ points
9
87
646
@Ar_Douillard
Arthur Douillard
1 year
KV Prediction for Improved Time to First Token LLM inference can be split in two phases: Prefilling and Decoding. The decoding phase is in autoregressive mode, where tokens are generating one by one, by re-using previous Key/Value tensors in the KV-cache. To speed up that
3
44
205