Carson Poole
@CarsonPoole
Followers
934
Following
1K
Media
131
Statuses
1K
this was lots of fun and lots of all nighters over the past few weeks. really happy with what we achieved!
NVIDIA Blackwell can achieve 303 output tokens/s for DeepSeek R1 in FP4 precision, per our benchmarking of an Avian API endpoint Artificial Analysis benchmarked DeepSeek R1 on an @avian_io private API endpoint. Running DeepSeek R1 in FP4 precision on NVIDIA Blackwell, their
2
2
12
someone pls make a worldle that has a daily leaderboard for whom can make the highest logprob sentence in some range of tokens
0
0
0
I have never understood why people need these tools? What is hard about (# billion params) * (2 for 16 bit, 1 for 8bit, etc) * (fudge factor for activations, kv cache) < (vram on your GPU in GB)
I made an internal tool for myself to check the VRAM required to run models on GPUs Open-sourcing it today! "do-i-have-the-vram" checks the amount of vram you need to loading the model, without loading the model! use it by running ` pip install do-i-have-the-vram `
0
0
1
the comparison is really striking between when Google released Lion versus everybody quietly switching to Muon
0
1
1
Hello World đ Welcome to Drafted â an AI tool that lets anyone design a home from scratch, tailored to your life. https://t.co/zoa23fKUvV
techcrunch.com
Drafted is now nearly five months old, and it's everything Atmos wasn't.
9
15
102
a phenomenon I havenât seen anybody point out is what happens when you can âfew shotâ a robot? with sufficient scale this ability emerged with LLMs. instead of training it to perform a specific task, can you show it 2-3 representative examples of itself doing said task?
We got our robots to wash pans, clean windows, make peanut butter sandwiches, and more! Fine-tuning our latest model enables all of these tasks, and this has interesting implications for robotics, Moravec's paradox, and the future of large models in embodied AI. More below!
0
0
0
the momentum is building
Nice, short post illustrating how simple text (discrete) diffusion can be. Diffusion (i.e. parallel, iterated denoising, top) is the pervasive generative paradigm in image/video, but autoregression (i.e. go left to right bottom) is the dominant paradigm in text. For audio I've
0
0
0
TIL the originator of the phrase "embarassingly parallel" is Cleve Moler, the creator of matlab (sorry if that gives you painful flashbacks)
0
0
0
the affordability of a kWh for the median US household over time. feels like this needs a @CommunityNotes for being so egregiously misleading
1
0
3
in 2021 I emailed Philippe Tillet (creator of Triton) about adding 4bit datatypes, and he was (reasonably!) skeptical at the time. not a dunk - Philippe is obviously world-class; just a reminder to update your mental models while you update your language models :)
NVFP4: 4-bit pretraining for LLMs ⢠New format w/ 2-level scaling + RHT + stochastic rounding ⢠Trains 12B model on 10T tokens ⢠Matches FP8 baseline: MMLU-pro 62.58% vs 62.62% ⢠6.8Ă efficiency boost potential â faster, cheaper frontier LLMs
0
0
2
another one
đˇImagine you are the boss of Google DeepMind. To train the best diffusion language model in world within 1 year, using 800 TPU pods, which model size will you go for? đżď¸Â We build Quokka to help you decideâthe first-ever large-scale scaling law for DLMs. Interesting facts: 1.
1
0
1
took this on my flight last week lmao
$2.2 billion solar plant in California turned off after years of wasted money: âNever lived up to its promisesâ https://t.co/TuRZYvDyjX
0
0
1
1 matmul â tenth grade math class 100 matmuls â youâve solved a system of equations 100,000 matmuls â you overfit a linear regression 1 million matmuls â your MacBookâs M4 sounds like a jet engine 1 quintillion matmuls â you have summoned god from silicon
0
0
3
the way people are now saying, âI was asking Chat,â or âjust ask Chat how to do itâ is a phenomenon I havenât seen since the verbifying of Google
1
0
0