sukjun_hwang Profile Banner
Sukjun (June) Hwang Profile
Sukjun (June) Hwang

@sukjun_hwang

Followers
3K
Following
539
Media
14
Statuses
80

ML PhD student @mldcmu advised by @_albertgu

Pittsburgh, PA
Joined April 2023
Don't wanna be here? Send us removal request.
@sukjun_hwang
Sukjun (June) Hwang
2 months
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
Tweet media one
Tweet media two
98
744
5K
@sukjun_hwang
Sukjun (June) Hwang
8 days
Coming from a computer vision background and now in sequence modeling, I’m often struck by how disconnected LLMs and vision feel. Our work, AUSM, treats video as language -- and it reveals a few blind spots we’ve overlooked.
@miran_heo
Miran Heo
8 days
We connect the autoregressive pipeline of LLMs with streaming video perception. Introducing AUSM: Autoregressive Universal Video Segmentation Model. A step toward unified, scalable video perception — inspired by how LLMs unified NLP. 📝
4
8
137
@main_horse
main
8 days
μtransfer for Mamba2 & Muon
Tweet media one
4
23
195
@pratyushmaini
Pratyush Maini
22 days
1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳 - 3B LLMs beat 8B models🚀 - Pareto frontier for performance
Tweet media one
23
125
705
@lchen915
Lili
1 month
Self-Questioning Language Models: LLMs that learn to generate their own questions and answers via asymmetric self-play RL. There is no external training data – the only input is a single prompt specifying the topic.
Tweet media one
25
184
1K
@mihirp98
Mihir Prabhudesai
2 months
🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n
Tweet media one
128
192
1K
@_albertgu
Albert Gu
2 months
I'll be giving the first H-Net talk this afternoon at 4:30-5 PT at the ES-FoMo workshop! come support the fight against Big Token 🙏
@ESFoMo
ES-FoMo@ICML2025
2 months
Looking forward to seeing everyone for ES-FoMo part three tomorrow! We'll be in East Exhibition Hall A (the big one), and we've got an exciting schedule of invited talks, orals, and posters planned for you tomorrow. Let's meet some of our great speakers! 1/
Tweet media one
4
11
140
@gaurav_ghosal
Gaurav Ghosal
2 months
1/So much of privacy research is designing post-hoc methods to make models mem. free. It’s time we turn that around with architectural changes. Excited to add Memorization Sinks to the transformer architecture this #ICML2025 to isolate memorization during LLM training🧵
Tweet media one
1
23
59
@sukjun_hwang
Sukjun (June) Hwang
2 months
Just realized we forgot to link the code, check it out! Model checkpoints are included so you can play with it yourself and see what boundaries it's learning Code: https://t.co/BtQaU383xJ Paper: https://t.co/AVW1Rtzpqw 12/10
Tweet card summary image
arxiv.org
Major progress on language models (LMs) in recent years has largely resulted from moving away from specialized models designed for specific tasks, to general models based on powerful architectures...
2
15
96
@sukjun_hwang
Sukjun (June) Hwang
2 months
Albert has written amazing blog posts full of behind-the-scenes stories and wonderful insights about H-Net. You should check them out! https://t.co/NL9Eus1YBa
@_albertgu
Albert Gu
2 months
This was an incredibly important project to me - I’ve wanted to solve it for years, but had no idea how. This was all @sukjun_hwang and @fluorane's amazing work! I wrote about the story of its development, and what might be coming next. The H-Net:
5
6
106
@sukjun_hwang
Sukjun (June) Hwang
2 months
We’re incredibly excited to see how H-Nets will allow models to learn more efficiently, with less priors and pre-processing, across all sorts of modalities! This work was a collaboration with @cartesia_ai 10/10
7
4
152
@sukjun_hwang
Sukjun (June) Hwang
2 months
Finally, a key ingredient of H-Net is using state space models (SSMs) such as Mamba layers in the outer stages. SSMs naturally compress data into their recurrent states, which is not only more efficient, but turns out to be crucial toward building higher-level abstractions. 9/
Tweet media one
1
7
117
@sukjun_hwang
Sukjun (June) Hwang
2 months
DNA is an unusual “language”, and previous architectures showed different modeling power on DNA sequences (e.g., Mamba > Transformer). But any of them can be wrapped inside an H-Net for much stronger scaling, learning nearly 4 times as efficiently with data! 8/
Tweet media one
2
11
149
@sukjun_hwang
Sukjun (June) Hwang
2 months
On languages without easy segmentation cues like English, H-Net’s advantage over token-based baselines grows even further. While code is compressible and other heuristics also perform very well, languages like Chinese are more challenging and H-Net shows its strongest results. 7/
Tweet media one
1
4
124
@sukjun_hwang
Sukjun (June) Hwang
2 months
Because it operates over finer-grained bytes instead of pre-defined tokens, H-Net is dramatically more robust to textual perturbations. This all comes for *free*, without needing to introduce adversarial training or modify the data mix at all. 6/
Tweet media one
1
4
145
@sukjun_hwang
Sukjun (June) Hwang
2 months
When completely matched for both data (bytes/batch) and compute (FLOPs/byte), H-Net has higher performance than the tokenized Transformer as well as all byte-level baselines. A 2-stage H-Net matches the downstream evaluations of the Transformer of 2x its size! 5/
Tweet media one
2
6
156
@sukjun_hwang
Sukjun (June) Hwang
2 months
By using a data-dependent chunking strategy, a 1-stage H-Net learns a compression ratio similar to BPE tokenization, but already scales better! Iterating the hierarchy to 2 stages allows it to operate over even higher levels of abstraction, learning even faster from data. 4/
Tweet media one
1
8
198
@sukjun_hwang
Sukjun (June) Hwang
2 months
H-Net introduces several technical components, including a similarity-score routing module and EMA-based smoothing module, to allow learning discrete chunk boundaries stably. And because it’s fully end-to-end, H-Net can be *recursively iterated* to more stages of hierarchy! 3/
Tweet media one
5
40
416
@sukjun_hwang
Sukjun (June) Hwang
2 months
H-Net operates over any language, discovering semantic units where heuristic tokenizers fail. By directly learning patterns from raw bytes, dynamic chunking enables models that are more flexible, powerful, and robust. Paper: https://t.co/AVW1RtyRAY w/ @_albertgu and @fluorane 2/
Tweet media one
Tweet media two
3
27
351
@WentaoGuo7
Wentao Guo
2 months
🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯 With @tedzadouri and @tri_dao
Tweet media one
13
73
332