
xjdr
@_xjdr
Followers
23K
Following
26K
Media
741
Statuses
6K
I don't think people understand how funding works and how crippling a seed / series A valuation of $500M is. I'm wishing them the best of luck, but they are now doing a very hard thing on very hard mode.
$56M seed for former CTO of Stripe for 7yrs sound about right. Will 100% raise $100M in 6 months at $1B+ val - I don’t make the rules
63
65
3K
this is the first potential "stop what i am doing and investigate everything about this" thing i've seen in a while.
Everything you love about generative models — now powered by real physics!. Announcing the Genesis project — after a 24-month large-scale research collaboration involving over 20 research labs — a generative physics engine able to generate 4D dynamical worlds powered by a physics
28
40
980
This has been and will continue to be my recommendation for anyone in this position. Learn jax and sign up for Its one of the best things Google has ever done. You can do meaningful research for free, but the learning curve is steep. strap in.
To get money, you need a job in AI. To get a job in AI you need to understand Cuda, Cloud computing, distributed systems, Pytorch/Jax, and Triton. To learn Cuda, Cloud computing, distributed systems, Pytorch/Jax, and Triton, you need money. Where is the on-ramp here?.
15
49
917
I was on a hiring committee at google and it was pretty easy to just ask "tell me in detail about how something you worked on and were proud of works".
People who complain about leetcode questions during the interview process need to put themselves in the shoes of the company. You have 1000 resumes to get through, at least 50% of whom can’t actually code. How would you filter through this stack in a reasonable amount of time?.
32
19
907
If they didn't at least pick up R1 and V3 and make it better with their fucking rockstar team and 150000 GPUs everyone should be fired on the spot and I will hear no more about it.
Elon Musk says Grok 3 will be released in "a week or two" and it is "scary smart", displaying reasoning skills that outperform any other AI model that has been released
16
10
774
whalebros cooked here. Not only does it seem to replicate the o1-preview results, it seems to pretty effectively replicate (at least parts of) the process. My guess is it uses something very similar to the lets verify step-by-step ORMs / PRMs to train and reward the the CoT in.
🚀 DeepSeek-R1-Lite-Preview is now live: unleashing supercharged reasoning power!. 🔍 o1-preview-level performance on AIME & MATH benchmarks. 💡 Transparent thought process in real-time. 🛠️ Open-source models & API coming soon!. 🌐 Try it now at #DeepSeek
16
25
769
This is how i've been doing my cuda / ptx work for the last few weeks and i can both attest to R1 being particularly cracked at it AND that if you actually run a benchmark / compiler in the loop is does much better than you could possibly imagine. is this fast takeoff? almost.
uh it might be over. they put r1 in a loop for 15minutes and it generated: "better than the optimized kernels developed by skilled engineers in some cases"
17
34
762
Llamas . Tokenizer Free?! USING ENTROPY STEERING?!?!! . sometimes the universe conspires to make a paper just for you and it feels wonderful when it happens.
🚀 Introducing the Byte Latent Transformer (BLT) – An LLM architecture that scales better than Llama 3 using byte-patches instead of tokens 🤯 . Paper 📄 Code 🛠️
12
38
718
This is potentially a very significant discovery for a lot of reasons. For now, it's safe to say that entropy based sampling and training techniques are shaping up to be unreasonably effective at combatting entropy collapse and hallucinations in current models.
a few days ago @_xjdr and i discovered that each llm (the ones we tested at least) has a unique, stable entropy/varentropy characteristic which is reproducible from *entirely random* hidden state prompts.
24
64
706
watching @yacineMTB discover terence tao, zig and nvim out loud in real time is like watching a baby deer learn to walk. nature really is beautiful.
17
5
626
It would take a long ass article to articulate this properly but this is not a vageupoast. I have spent the last few months working on some very hard problems (more on that soon). I've been using a combination of R1 and DeepResearch to build and formalize the ideas and proofs.
as much as i am paying attention to AI each and every day, the future snuck up on me last night and this is Day 0 of a brand new world. i can confidently say that now.
33
22
628
Sorry for the sorry state of the entropix repo, i unexpectedly had to be heads down on some last min lab closure mop up work and was AFK. Now that i have some compute again (HUGE shout outs to @0xishand, @Yuchenj_UW and @evanjconrad) we're in the amazing position that we need.
32
27
605
if scale was really all you needed amazon and microsoft wouldn't need to use other people's models and google would be winning in every way.
Claude will help power Amazon's next-generation AI assistant, Alexa+. Amazon and Anthropic have worked closely together over the past year, with @mikeyk leading a team that helped Amazon get the full benefits of Claude's capabilities.
28
19
568
to double down on this, the specific original goal was to see what we could accomplish with a vanilla OSS model without touching the weights or the architecture at all. This is a series of inference time compute experiments that essentially use the model outputs as as read only.
@_xjdr is measuring the total variation in all the token choices per individual prediction and using that as a heuristic. you can actually visualize this measurement. this is not an architectural tweak, this is doing fancy state modification based off that
14
18
529
One of Jeff Dean's super powers is to be able to come up with reasonable approximations for very complex problems quickly. He also has the "latency numbers every engineer should know" that helped him reason about map reduce, search indexes, etc for this reason as well. Incredibly
It's really great to see the impact that TPUs have had and continue to have on Google's ability to do machine learning training and inference at scale, and to provide that same capability to @googlecloud customers via Cloud TPUs. Here's a bit of backstory on how they came to be.
5
33
516
i think its worth taking a moment to put into perspective how cool this work is. GPT2 is really what the entire OpenAI empire was built on / was deemed too dangerous to release a few short years ago and it is now reproducible in less than 8 min on a single (large) machine.
New NanoGPT training speed record: 3.28 FineWeb val loss in 7.23 minutes on 8xH100. Previous record: 7.8 minutes.Changelog:.- Added U-net-like connectivity pattern.- Doubled learning rate. This record is by @brendanh0gan
13
34
514
hahahaha what?!?!."The test cluster comprised 25 storage nodes (2 NUMA domains/node, 1 storage service/NUMA, 2×400Gbps NICs/node) and 50 compute nodes (2 NUMA domains, 192 physical cores, 2.2 TiB RAM, and 1×200 Gbps NIC/node). Sorting 110.5 TiB of data across 8,192 partitions.
🚀 Day 5 of #OpenSourceWeek: 3FS, Thruster for All DeepSeek Data Access. Fire-Flyer File System (3FS) - a parallel file system that utilizes the full bandwidth of modern SSDs and RDMA networks. ⚡ 6.6 TiB/s aggregate read throughput in a 180-node cluster.⚡ 3.66 TiB/min.
12
15
494