Cade Daniel 🇺🇸 @cdnamz X Profile

Cade Daniel 🇺🇸

@cdnamz

Followers

1K

Following

175K

Media

12

Statuses

341

systems performance

https://t.co/9rLpdVX1ai

San Francisco, CA

Joined July 2012

Don't wanna be here? Send us removal request.

Hrishbh Dalal

@HrishbhDalal

1 year

What if we could teach LLMs to be algorithm inventors? I trained an LLM to improve sorting algorithms through pure reinforcement learning - and it discovered optimizations giving 47.92x speedups over an optimized python based Timsort baseline! No cold-start data needed. I used

20

63

784

Jonathan Frankle

@jefrankle

1 year

The hardest part about finetuning LLMs is that people generally don't have high-quality labeled data. Today, @databricks introduced TAO, a new finetuning method that only needs inputs, no labels necessary. Best of all, it actually beats supervised finetuning on labeled data.

13

137

898

Simran Arora

@simran_s_arora

1 year

BASED ✌️ turns 1! One year since its launch at NeurIPS 2023 — and it's helped shape the new wave of efficient LMs. ⚡️ Fastest linear attention kernels 🧠 405B models trained on 16 GPUs 💥 Inspired Mamba-v2, RWKVs, MiniMax Checkout our retrospective below!

3

57

106

Hongyang Zhang

@hongyangzh

1 year

Jointly announcing EAGLE-3 with SGLang: Setting a new record in LLM inference acceleration! - 5x🚀than vanilla (on HF) - 1.4x🚀than EAGLE-2 (on HF) - A record of ~400 TPS on LLama 3.1 8B with a single H100 (on SGLang) - 1.65x🚀in latency even for large bs=64 (on SGLang) - A new

14

43

298

Shanli Xing

@shanli_xing

1 year

🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: https://t.co/R780Rth03x

1

33

181

Simon Guo

@simonguozirui

1 year

LLMs for GPU kernel🌽generation have been getting Pop🍿ular since our preview last Dec; excited to announce 📢 our full paper 📃 for KernelBench! Turns out KernelBench is quite challenging 🧠 — frontier models outperform the PyTorch Eager baseline <20% of the time. More 🧵👇

9

73

306

Cade Daniel 🇺🇸

@cdnamz

1 year

Welcome @istoica05

Hao Zhang

@haozhangml

1 year

Thrilled to see @istoica05 joining X and couldn't agree more with his insights on the importance of shared infrastructure. "Open source" encompasses more than just open weights—it includes open data, open artifacts, and open infrastructure!

0

11

Cade Daniel 🇺🇸

@cdnamz

1 year

Congrats!

Deli Chen

@victor207755822

1 year

Unbelievable results, feels like a dream—our R1 model is now #1 in the world (with style control)! 🌍🏆 Beyond words right now. 🤯 All I know is we keep pushing forward to make open-source AGI a reality for everyone. 🚀✨ #OpenSource #AI #AGI #DeepSeekR1

0

3

Grad

@Grad62304977

1 year

People waking up to take their bitter lesson pill https://t.co/fswrLVjfCC

Rafael Rafailov @ NeurIPS

@rm_rafailov

1 year

DeepSeek R1 with "Cold Start" pretty much works as expected. I still don't buy the R1 Zero result, the base models barely output coherent solutions without finagling. My bet is there is some correction/reflection/backtracking-like data in mid-training.

3

89

Cade Daniel 🇺🇸

@cdnamz

1 year

love finding bangers so damn good they force a follow

0

1

12

Suhail

@Suhail

1 year

Once the AI labs realize they need to make products for survival, they will immediately reformulate their strategy to competing with the most obvious working thing that is vaguely under the guise of the original mission. You should presume you will be ruthlessly copied.

30

39

774

Rohan Choudhury

@rchoudhury997

1 year

Excited to finally release our NeurIPS 2024 (spotlight) paper! We introduce Run-Length Tokenization (RLT), a simple way to significantly speed up your vision transformer on video with no loss in performance!

22

170

1K

Vima Gupta

@vima_gupta

1 year

1/7 🧵 MoEs: A tale of expectation vs reality Marketing: "Only compute the expert parameters you need!" Reality: Batch 16 requests → ALL experts activate At serving time (vLLM/TGI), arithmetic intensity: AI ≈ (num_tokens * top_k) / total_experts In simpler terms: Your decode

4

7

32

𝚐𝔪𝟾𝚡𝚡𝟾

@gm8xx8

1 year

Pie: Pooling CPU Memory for LLM Inference paper: https://t.co/PoSbHta0n3 Pie is an LLM inference framework that tackles the memory challenges of large models by enabling efficient GPU-CPU memory swapping and adaptive expansion. It optimizes memory usage without increasing

arxiv.org

The rapid growth of LLMs has revolutionized natural language processing and AI analysis, but their increasing size and memory demands present significant challenges. A common solution is to spill...

1

40

171

Michael Matthews

@mitrma

1 year

🍎 The core of Kinetix is our new 2D rigid body physics engine: Jax2D. This is a minimal rewrite of the classic Box2D engine made by @erin_catto. Jax2D allows us to run thousands of heterogeneous parallel environments on a single GPU (yes, you can vmap over different tasks!) 8/

4

40

Jerry Tworek

@MillionInt

1 year

ARR is the only meaningful AGI metric

6

69

vLLM

@vllm_project

1 year

Speculative decoding is one of the best tool in the vLLM's suite of inference optimization tool box, accelerating the inference without accuracy loss. Checkout our blog post for more details about the state of spec decode in vLLM today! 🧵 https://t.co/swMbYFX8Vl

vllm.ai

Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll

5

50

236

Andreas Köpf

@neurosp1ke

1 year

If you are interested in the latest GPU MODE news (upcoming lectures, videos etc.) please follow our new official twitter/x account: @GPU_MODE

0

2

17

Simran Arora

@simran_s_arora

1 year

Want Llama 405B, but wish it scaled linearly in sequence length??? Enter LoLCATS: an efficient method for "turning Transformers to linear attention models", all on an academic budget!! We use LoLCATS to linearize the *full Llama 3.1 model family* for the first time – 20+ points

9

87

646

Arthur Douillard

@Ar_Douillard

1 year

KV Prediction for Improved Time to First Token LLM inference can be split in two phases: Prefilling and Decoding. The decoding phase is in autoregressive mode, where tokens are generating one by one, by re-using previous Key/Value tensors in the KV-cache. To speed up that

3

44

205