
Sukjun (June) Hwang
@sukjun_hwang
Followers
3K
Following
539
Media
14
Statuses
80
ML PhD student @mldcmu advised by @_albertgu
Pittsburgh, PA
Joined April 2023
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
98
744
5K
Coming from a computer vision background and now in sequence modeling, I’m often struck by how disconnected LLMs and vision feel. Our work, AUSM, treats video as language -- and it reveals a few blind spots we’ve overlooked.
We connect the autoregressive pipeline of LLMs with streaming video perception. Introducing AUSM: Autoregressive Universal Video Segmentation Model. A step toward unified, scalable video perception — inspired by how LLMs unified NLP. 📝
4
8
137
1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼🍳 - 3B LLMs beat 8B models🚀 - Pareto frontier for performance
23
125
705
Self-Questioning Language Models: LLMs that learn to generate their own questions and answers via asymmetric self-play RL. There is no external training data – the only input is a single prompt specifying the topic.
25
184
1K
🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n
128
192
1K
I'll be giving the first H-Net talk this afternoon at 4:30-5 PT at the ES-FoMo workshop! come support the fight against Big Token 🙏
Looking forward to seeing everyone for ES-FoMo part three tomorrow! We'll be in East Exhibition Hall A (the big one), and we've got an exciting schedule of invited talks, orals, and posters planned for you tomorrow. Let's meet some of our great speakers! 1/
4
11
140
1/So much of privacy research is designing post-hoc methods to make models mem. free. It’s time we turn that around with architectural changes. Excited to add Memorization Sinks to the transformer architecture this #ICML2025 to isolate memorization during LLM training🧵
1
23
59
Just realized we forgot to link the code, check it out! Model checkpoints are included so you can play with it yourself and see what boundaries it's learning Code: https://t.co/BtQaU383xJ Paper: https://t.co/AVW1Rtzpqw 12/10
arxiv.org
Major progress on language models (LMs) in recent years has largely resulted from moving away from specialized models designed for specific tasks, to general models based on powerful architectures...
2
15
96
Albert has written amazing blog posts full of behind-the-scenes stories and wonderful insights about H-Net. You should check them out! https://t.co/NL9Eus1YBa
This was an incredibly important project to me - I’ve wanted to solve it for years, but had no idea how. This was all @sukjun_hwang and @fluorane's amazing work! I wrote about the story of its development, and what might be coming next. The H-Net:
5
6
106
We’re incredibly excited to see how H-Nets will allow models to learn more efficiently, with less priors and pre-processing, across all sorts of modalities! This work was a collaboration with @cartesia_ai 10/10
7
4
152
Finally, a key ingredient of H-Net is using state space models (SSMs) such as Mamba layers in the outer stages. SSMs naturally compress data into their recurrent states, which is not only more efficient, but turns out to be crucial toward building higher-level abstractions. 9/
1
7
117
DNA is an unusual “language”, and previous architectures showed different modeling power on DNA sequences (e.g., Mamba > Transformer). But any of them can be wrapped inside an H-Net for much stronger scaling, learning nearly 4 times as efficiently with data! 8/
2
11
149
On languages without easy segmentation cues like English, H-Net’s advantage over token-based baselines grows even further. While code is compressible and other heuristics also perform very well, languages like Chinese are more challenging and H-Net shows its strongest results. 7/
1
4
124
Because it operates over finer-grained bytes instead of pre-defined tokens, H-Net is dramatically more robust to textual perturbations. This all comes for *free*, without needing to introduce adversarial training or modify the data mix at all. 6/
1
4
145
When completely matched for both data (bytes/batch) and compute (FLOPs/byte), H-Net has higher performance than the tokenized Transformer as well as all byte-level baselines. A 2-stage H-Net matches the downstream evaluations of the Transformer of 2x its size! 5/
2
6
156
By using a data-dependent chunking strategy, a 1-stage H-Net learns a compression ratio similar to BPE tokenization, but already scales better! Iterating the hierarchy to 2 stages allows it to operate over even higher levels of abstraction, learning even faster from data. 4/
1
8
198
H-Net introduces several technical components, including a similarity-score routing module and EMA-based smoothing module, to allow learning discrete chunk boundaries stably. And because it’s fully end-to-end, H-Net can be *recursively iterated* to more stages of hierarchy! 3/
5
40
416
H-Net operates over any language, discovering semantic units where heuristic tokenizers fail. By directly learning patterns from raw bytes, dynamic chunking enables models that are more flexible, powerful, and robust. Paper: https://t.co/AVW1RtyRAY w/ @_albertgu and @fluorane 2/
3
27
351
🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯 With @tedzadouri and @tri_dao
13
73
332