 
            
              Sukjun (June) Hwang
            
            @sukjun_hwang
Followers
                3K
              Following
                572
              Media
                14
              Statuses
                88
              ML PhD student @mldcmu advised by @_albertgu
              
              Pittsburgh, PA
            
            
              
              Joined April 2023
            
            
           Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data 
          
                
                98
              
              
                
                750
              
              
                
                5K
              
             We scaled up an "alternative" paradigm in RL: *divide and conquer*. Compared to Q-learning (TD learning), divide and conquer can naturally scale to much longer horizons. Blog post:  https://t.co/xtXBzya0bI  Paper:  https://t.co/nqYkLucsWu  ↓ 
          
                
                11
              
              
                
                72
              
              
                
                423
              
             We've raised $100M from Kleiner Perkins, Index Ventures, Lightspeed, and NVIDIA. Today we're introducing Sonic-3 - the state-of-the-art model for realtime conversation. What makes Sonic-3 great: - Breakthrough naturalness - laughter and full emotional range - Lightning fast - 
          
                
                1K
              
              
                
                1K
              
              
                
                8K
              
             I have been thinking a lot recently about framing a variety of inference-time tasks as doing algorithm design with access to strong oracles (e.g. generators, different types of verifiers, convolved scores, ...) --- as an alternative to "end-to-end" analyses. 
           New paper we're excited to get online! Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking. A totally new framework based on ~backtracking~ for using process verifiers to guide inference, w/ connections to approximate counting/sampling in theoretical CS. 
            
                
                3
              
              
                
                11
              
              
                
                55
              
             three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n) 
          
                
                56
              
              
                
                331
              
              
                
                2K
              
             Introducing *dual representations*! tl;dr: We represent a state by the "set of similarities" to all other states. This dual perspective has lots of nice properties and practical benefits in RL. Blog post:  https://t.co/lw1PortD9E  Paper:  https://t.co/zYKFjyOy7C  ↓ 
          
                
                14
              
              
                
                96
              
              
                
                787
              
             💡Can we trust synthetic data for statistical inference? We show that synthetic data (e.g. LLM simulations) can significantly improve the performance of inference tasks. The key intuition lies in the interactions between the moments of synthetic data and those of real data 
          
                
                2
              
              
                
                36
              
              
                
                135
              
             There’s been a lot of work on unlearning in LLMs, trying to erase memorization without hurting capabilities — but we haven’t seen much success. ❓What if unlearning is actually doomed from the start? 👇This thread explains why and how *memorization sinks* offer a new way forward. 
          
                
                6
              
              
                
                39
              
              
                
                174
              
             LLMs lose diversity after RL post-training, and this hurts test-time scaling & creativity. Why does this collapse happen, and how can we fix it? Our new work introduces: 🔍 RL as Sampling (analysis) 🗺️ Outcome-based Exploration (intervention) [1/n] 
          
                
                9
              
              
                
                87
              
              
                
                468
              
             Coming from a computer vision background and now in sequence modeling, I’m often struck by how disconnected LLMs and vision feel. Our work, AUSM, treats video as language -- and it reveals a few blind spots we’ve overlooked. 
           We connect the autoregressive pipeline of LLMs with streaming video perception. Introducing AUSM: Autoregressive Universal Video Segmentation Model. A step toward unified, scalable video perception — inspired by how LLMs unified NLP. 📝 
          
                
                4
              
              
                
                8
              
              
                
                134
              
             1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼🍳 - 3B LLMs beat 8B models🚀 - Pareto frontier for performance 
          
                
                23
              
              
                
                124
              
              
                
                713
              
             Self-Questioning Language Models: LLMs that learn to generate their own questions and answers via asymmetric self-play RL. There is no external training data – the only input is a single prompt specifying the topic. 
          
                
                27
              
              
                
                183
              
              
                
                1K
              
             🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n 
          
                
                127
              
              
                
                196
              
              
                
                1K
              
             I'll be giving the first H-Net talk this afternoon at 4:30-5 PT at the ES-FoMo workshop! come support the fight against Big Token 🙏 
           Looking forward to seeing everyone for ES-FoMo part three tomorrow! We'll be in East Exhibition Hall A (the big one), and we've got an exciting schedule of invited talks, orals, and posters planned for you tomorrow. Let's meet some of our great speakers! 1/ 
            
                
                4
              
              
                
                11
              
              
                
                139
              
             1/So much of privacy research is designing post-hoc methods to make models mem. free. It’s time we turn that around with architectural changes. Excited to add Memorization Sinks to the transformer architecture this #ICML2025 to isolate memorization during LLM training🧵 
          
                
                1
              
              
                
                24
              
              
                
                61
              
             Just realized we forgot to link the code, check it out! Model checkpoints are included so you can play with it yourself and see what boundaries it's learning Code:  https://t.co/BtQaU383xJ  Paper:  https://t.co/AVW1Rtzpqw  12/10 
          
            
            arxiv.org
              Major progress on language models (LMs) in recent years has largely resulted from moving away from specialized models designed for specific tasks, to general models based on powerful architectures...
            
                
                2
              
              
                
                16
              
              
                
                99
              
             Albert has written amazing blog posts full of behind-the-scenes stories and wonderful insights about H-Net. You should check them out!  https://t.co/NL9Eus1YBa 
          
           This was an incredibly important project to me - I’ve wanted to solve it for years, but had no idea how. This was all @sukjun_hwang and @fluorane's amazing work! I wrote about the story of its development, and what might be coming next. The H-Net: 
          
                
                5
              
              
                
                5
              
              
                
                106
              
             We’re incredibly excited to see how H-Nets will allow models to learn more efficiently, with less priors and pre-processing, across all sorts of modalities! This work was a collaboration with @cartesia_ai 10/10 
          
                
                7
              
              
                
                4
              
              
                
                153
              
             Finally, a key ingredient of H-Net is using state space models (SSMs) such as Mamba layers in the outer stages. SSMs naturally compress data into their recurrent states, which is not only more efficient, but turns out to be crucial toward building higher-level abstractions. 9/ 
          
                
                1
              
              
                
                7
              
              
                
                117
              
             DNA is an unusual “language”, and previous architectures showed different modeling power on DNA sequences (e.g., Mamba > Transformer). But any of them can be wrapped inside an H-Net for much stronger scaling, learning nearly 4 times as efficiently with data! 8/ 
          
                
                2
              
              
                
                11
              
              
                
                149
              
             
             
             
             
               
             
             
             
             
               
             
             
             
             
             
              