Henry Ko
@henryHM_ko
Followers
874
Following
225
Media
4
Statuses
54
performance and efficiency in ML | CS @ UC Berkeley, @BerkeleyML
Joined June 2024
I had a lot of fun writing a new blog: Optimizing NSA for TPUs There's an accompanying colab notebook with it too! I hope this helps people tinker with NSA in JAX + TPU kernels with Pallas
5
44
452
I spent the last ~2 weeks recreating the @thinkymachines LoRA without Regret experiments from scratch. SFT: Qwen3-4B on No Robots dataset RL: Qwen3-1.7B on MATH dataset so cool to see rank 1 LoRA matching the performance of full finetuning! 🤯
13
39
434
Excited to see Gated DeltaNet being adopted in the @Alibaba_Qwen series ! It has also previously demonstrated strong effectiveness in @nvidia's Jet-Nemotron
🚀 Introducing Qwen3-Next-80B-A3B — the FUTURE of efficient LLMs is here! 🔹 80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!) 🔹Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed &
9
53
551
i remade tiny-tpu to support both inference and training! we successfully tested our architecture on the classic XOR problem. here's what i learned throughout the process:👇
as the unemployed friend on a monday afternoon, i spent the past two months building a TPU without any prerequisite knowledge in digital logic, ASIC design, or verilog. here are my coolest unlocks so far:
38
55
324
Today we're putting out an update to the JAX TPU book, this time on GPUs. How do GPUs work, especially compared to TPUs? How are they networked? And how does this affect LLM training? 1/n
38
525
4K
Do you want to train massive deep learning models with ease? Our 10 new tutorial notebooks of our popular UvA DL course show you how, implementing data, pipeline and tensor parallelism (and more) from scratch in JAX+Flax! 🚀🚀 Check them out here: https://t.co/7AtaB7j8l9 🧵 1/11
10
122
617
Introducing our latest technical report: Context Rot - How Increasing Input Tokens Impacts LLM Performance Our results reveal that models do not use their context uniformly. full report in replies
39
91
898
Context windows are huge now (1M+ tokens) but context depth remains limited. Attention can only resolve one link at a time. Our tiny 5-layer model beats GPT-4.5 on a task requiring deep recursion. How? It learned to divide & conquer. Why this mattersđź§µ
5
7
56
looking for my next thing! thinking about dropping out. would love to learn more about opportunities within hardware acceleration or interpretability. dms are open. happy to chat. would love to hear what makes you excited!
25
22
138
Recordings:
@oswaldjoh and @ninoscherrer will present MesaNet at the ASAP seminar on Tuesday, June 24 at 2 PM ET! MesaNet is a locally optimal test-time training (TTT) layer that optimizes the key-value reconstruction objective over the entire history. If you're into TTT, don't miss it!
1
13
65
NVIDIA Tensor Core Evolution From Volta To Blackwell Amdahl’s Law, Strong Scaling Asynchronous Execution Blackwell, Hopper, Ampere, Turing, Volta https://t.co/qZRKz2VdKw
newsletter.semianalysis.com
Amdahl’s Law, Strong Scaling, Asynchronous Execution, Blackwell, Hopper, Ampere, Turing, Volta, TMA
4
30
241
Great deep dive into TPUs with amazing visuals by our very own @henryHM_ko!
I wrote a new blog on TPUs -- it's been fun seeing how different they are from GPUs and also drawing things on excalidraw again✏️ https://t.co/kEZXbB8vmX
0
3
22
I wrote a new blog on TPUs -- it's been fun seeing how different they are from GPUs and also drawing things on excalidraw again✏️ https://t.co/kEZXbB8vmX
37
190
1K
ksim is a JAX-based framework that makes your wackiest RL ideas simple to implement. Why use it? It's modular. Trying new architectures, updating rollout logic, and reformulating your objective is as easy as overriding a method. https://t.co/5lqlYTXv1a
github.com
RL training library for humanoid locomotion and manipulation. Built on top of MuJoCo and JAX. - kscalelabs/ksim
In the last month, we’ve been building an open-source framework for robot learning and sim-to-real transfer, made for RL whole-body control from simple walking to complex human imitation Check out the details on HN: https://t.co/QWAVQloqPs Get started in 5 minutes ⬇️
0
1
23
(1/5) I’m pleased to share that my research with @seowondeog12052 has been accepted to RECOMB 2025 (Poster) and IEEE EMBC 2025 (Paper)! Preprint: https://t.co/sW1AsKkZDL We introduce a generative approach to pesticide design—optimizing small molecules to reduce toxicity.
arxiv.org
Global climate change has reduced crop resilience and pesticide efficacy, making reliance on synthetic pesticides inevitable, even though their widespread use poses significant health and...
2
3
8
per the event description: ​Viren Jain is a Senior Staff Research Scientist at Google in Mountain View, California, where he leads Google’s Connectomics team., responsible for tools like SegCLR and TensorStore. https://t.co/KCNeLkI63a
luma.com
Viren Jain is a Senior Staff Research Scientist at Google in Mountain View, California, where he leads Google’s Connectomics team. They have been responsible…
0
2
11
Google's TPUv7 is out! ML accelerator marketing material is usually pretty inscrutable (what numbers are even comparable?), so here I'll explain concretely how this compares with Nvidia. đź§µ
18
143
2K
excited to share what I’ve been working on @trychroma! we introduce representative generative benchmarking - custom eval sets built from your own data link to technical report in replies
17
28
303