Sukjun (June) Hwang @sukjun_hwang X Profile

Sukjun (June) Hwang

@sukjun_hwang

Followers

3K

Following

572

Media

14

Statuses

88

ML PhD student @mldcmu advised by @_albertgu

https://t.co/gKH43TH0nt

Pittsburgh, PA

Joined April 2023

Don't wanna be here? Send us removal request.

Sukjun (June) Hwang

@sukjun_hwang

4 months

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

98

750

5K

Seohong Park

@seohong_park

2 days

We scaled up an "alternative" paradigm in RL: *divide and conquer*. Compared to Q-learning (TD learning), divide and conquer can naturally scale to much longer horizons. Blog post: https://t.co/xtXBzya0bI Paper: https://t.co/nqYkLucsWu ↓

11

72

423

Karan Goel

@krandiash

3 days

We've raised $100M from Kleiner Perkins, Index Ventures, Lightspeed, and NVIDIA. Today we're introducing Sonic-3 - the state-of-the-art model for realtime conversation. What makes Sonic-3 great: - Breakthrough naturalness - laughter and full emotional range - Lightning fast -

1K

8K

Andrej Risteski

@risteski_a

17 days

I have been thinking a lot recently about framing a variety of inference-time tasks as doing algorithm design with access to strong oracles (e.g. generators, different types of verifiers, convolved scores, ...) --- as an alternative to "end-to-end" analyses.

Dylan Foster 🐢

@canondetortugas

20 days

New paper we're excited to get online! Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking. A totally new framework based on ~backtracking~ for using process verifiers to guide inference, w/ connections to approximate counting/sampling in theoretical CS.

3

11

55

Saining Xie

@sainingxie

17 days

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

56

331

2K

Seohong Park

@seohong_park

22 days

Introducing *dual representations*! tl;dr: We represent a state by the "set of similarities" to all other states. This dual perspective has lots of nice properties and practical benefits in RL. Blog post: https://t.co/lw1PortD9E Paper: https://t.co/zYKFjyOy7C ↓

14

96

787

Emily Byun

@yewonbyun_

22 days

💡Can we trust synthetic data for statistical inference? We show that synthetic data (e.g. LLM simulations) can significantly improve the performance of inference tasks. The key intuition lies in the interactions between the moments of synthetic data and those of real data

2

36

135

Aditi Raghunathan

@AdtRaghunathan

1 month

There’s been a lot of work on unlearning in LLMs, trying to erase memorization without hurting capabilities — but we haven’t seen much success. ❓What if unlearning is actually doomed from the start? 👇This thread explains why and how *memorization sinks* offer a new way forward.

6

39

174

Yuda Song

@yus167

2 months

LLMs lose diversity after RL post-training, and this hurts test-time scaling & creativity. Why does this collapse happen, and how can we fix it? Our new work introduces: 🔍 RL as Sampling (analysis) 🗺️ Outcome-based Exploration (intervention) [1/n]

9

87

468

Sukjun (June) Hwang

@sukjun_hwang

2 months

Coming from a computer vision background and now in sequence modeling, I’m often struck by how disconnected LLMs and vision feel. Our work, AUSM, treats video as language -- and it reveals a few blind spots we’ve overlooked.

Miran Heo

@miran_heo

2 months

We connect the autoregressive pipeline of LLMs with streaming video perception. Introducing AUSM: Autoregressive Universal Video Segmentation Model. A step toward unified, scalable video perception — inspired by how LLMs unified NLP. 📝

4

8

134

main

@main_horse

2 months

μtransfer for Mamba2 & Muon

4

24

198

Pratyush Maini

@pratyushmaini

2 months

1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳 - 3B LLMs beat 8B models🚀 - Pareto frontier for performance

23

124

713

Lili

@lchen915

3 months

Self-Questioning Language Models: LLMs that learn to generate their own questions and answers via asymmetric self-play RL. There is no external training data – the only input is a single prompt specifying the topic.

27

183

1K

Mihir Prabhudesai

@mihirp98

3 months

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

127

196

1K

Albert Gu

@_albertgu

3 months

I'll be giving the first H-Net talk this afternoon at 4:30-5 PT at the ES-FoMo workshop! come support the fight against Big Token 🙏

ES-FoMo@ICML2025

@ESFoMo

3 months

Looking forward to seeing everyone for ES-FoMo part three tomorrow! We'll be in East Exhibition Hall A (the big one), and we've got an exciting schedule of invited talks, orals, and posters planned for you tomorrow. Let's meet some of our great speakers! 1/

4

11

139

Gaurav Ghosal

@gaurav_ghosal

4 months

1/So much of privacy research is designing post-hoc methods to make models mem. free. It’s time we turn that around with architectural changes. Excited to add Memorization Sinks to the transformer architecture this #ICML2025 to isolate memorization during LLM training🧵

1

24

61

Sukjun (June) Hwang

@sukjun_hwang

4 months

Just realized we forgot to link the code, check it out! Model checkpoints are included so you can play with it yourself and see what boundaries it's learning Code: https://t.co/BtQaU383xJ Paper: https://t.co/AVW1Rtzpqw 12/10

arxiv.org

Major progress on language models (LMs) in recent years has largely resulted from moving away from specialized models designed for specific tasks, to general models based on powerful architectures...

2

16

99

Sukjun (June) Hwang

@sukjun_hwang

4 months

Albert has written amazing blog posts full of behind-the-scenes stories and wonderful insights about H-Net. You should check them out! https://t.co/NL9Eus1YBa

Albert Gu

@_albertgu

4 months

This was an incredibly important project to me - I’ve wanted to solve it for years, but had no idea how. This was all @sukjun_hwang and @fluorane's amazing work! I wrote about the story of its development, and what might be coming next. The H-Net:

5

106

Sukjun (June) Hwang

@sukjun_hwang

4 months

We’re incredibly excited to see how H-Nets will allow models to learn more efficiently, with less priors and pre-processing, across all sorts of modalities! This work was a collaboration with @cartesia_ai 10/10

7

4

153

Sukjun (June) Hwang

@sukjun_hwang

4 months

Finally, a key ingredient of H-Net is using state space models (SSMs) such as Mamba layers in the outer stages. SSMs naturally compress data into their recurrent states, which is not only more efficient, but turns out to be crucial toward building higher-level abstractions. 9/

1

7

117

Sukjun (June) Hwang

@sukjun_hwang

4 months

DNA is an unusual “language”, and previous architectures showed different modeling power on DNA sequences (e.g., Mamba > Transformer). But any of them can be wrapped inside an H-Net for much stronger scaling, learning nearly 4 times as efficiently with data! 8/

2

11

149