
Shivam Duggal
@ShivamDuggal4
Followers
1K
Following
2K
Media
17
Statuses
162
PhD Student @MIT | Prev: Carnegie Mellon University @SCSatCMU | Research Scientist @UberATG
Joined June 2017
Compression is the heart of intelligence From Occam to Kolmogorov—shorter programs=smarter representations Meet KARL: Kolmogorov-Approximating Representation Learning. Given an image, token budget T & target quality 𝜖 —KARL finds the smallest t≤T to reconstruct it within 𝜖🧵
14
63
354
Over the past year, my lab has been working on fleshing out theory/applications of the Platonic Representation Hypothesis. Today I want to share two new works on this topic: Eliciting higher alignment: https://t.co/KY4fjNeCBd Unpaired rep learning: https://t.co/vJTMoyJj5J 1/9
6
61
396
LLMs, trained only on text, might already know more about other modalities than we realized; we just need to find ways elicit it. project page: https://t.co/8cIf1DW0OQ w/ @phillip_isola and @thisismyhat
16
63
571
[1/7] Paired multimodal learning shows that training with text can help vision models learn better image representations. But can unpaired data do the same? Our new work shows that the answer is yes! w/ @shobsund @ChenyuW64562111, Stefanie Jegelka and @phillip_isola
6
37
306
Reinforcement Learning (RL) has long been the dominant method for fine-tuning, powering many state-of-the-art LLMs. Methods like PPO and GRPO explore in action space. But can we instead explore directly in parameter space? YES we can. We propose a scalable framework for
88
387
3K
I wrote this blog post that tries to go further toward design principles for neural nets and optimizers The post presents a visual intro to optimization on normed manifolds and a Muon variant for the manifold of matrices with unit condition number https://t.co/EhhKN2Jylx
Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices.
23
52
465
nearly everything in AI can be understood through the lens of compression - the architecture is just schema for when & how to compress - optimization is a compression *process*, with its own compression level and duration - (architecture + data + optimization) = model - in other
68
67
1K
Amazing work @IdanShenfeld @jyo_pari SFT w/ per-token supervision is probably too constrained to map new/old data into a shared weight space. Wondering adding continuous thinking tokens (so still no RL) before supervised prediction could relax this, while staying off-policy?
For agents to improve over time, they can’t afford to forget what they’ve already mastered. We found that supervised fine-tuning forgets more than RL when training on a new task! Want to find out why? 👇
0
1
11
New paper! We explore a radical paradigm for AI evals: assessing LLMs on *unsolved* questions. Instead of contrived exams where progress ≠ value, we eval LLMs on organic, unsolved problems via reference-free LLM validation & community verification. LLMs solved ~10/500 so far:
15
74
369
Enjoying GPT-5 a lot! Research Q–Maybe intelligence is discovering the simplest algorithm that generalizes (5→N digit +). GPT-5 may be close for +/*, but what enables RL on top of (constrained) next-token pretraining to discover the least-KC algorithm for all tasks? Thoughts?
There will always exist an input of length N, for which multiplication using autoregressive models will break. Example of GPT-5 thinking getting 10-digit multiplication wrong below correct answer 88,723,505,107,555,515,626 vs GPT's answer 88,723,505,296,555,515,626
0
1
1
Talking about KARL today — our recent work on a Kolmogorov Complexity–inspired adaptive tokenizer. Details about the paper here: https://t.co/H10MkJyM8o More broadly, quite excited about representation learning — and understanding large models — through the lens of compression.
@ShivamDuggal4 of @MIT in a deep dive @ploutosai w/ @ceciletamura , Head of Community. Don't miss it! https://t.co/Sfu9LMNKNq
0
2
21
Strongest compressors might not be the best decoders for your task. RL can adapt pre-trained models into more "sophisticated" decoders, tuned to the task’s specific demands. Exciting thread & research! Question: is next-token prediction really the final chapter in pretraining?
Really cool thread! Cross entropy = optimal compressor for observed pretraining data RL objective = reward weighted compression objective.
0
2
10
We ran more experiments to better understand “why” diffusion models do better in data-constrained settings than autoregressive. Our findings support the hypothesis that diffusion models benefit from learning over multiple token orderings, which contributes to their robustness and
@mihirp98 Interesting, thanks! So this is your next hypothesis then, right? > Diffusion models randomly factorize the joint, which enables them to generate tokens in random orders, which we think can’t simply be recreated by just random input masking while still having the next token
8
62
545
One "Skild brain" powers all embodiments—amazing work! Huge congratulations to entire team. Excited to see what’s next. Miss you all <3 !
Modern AI is confined to the digital world. At Skild AI, we are building towards AGI for the real world, unconstrained by robot type or task — a single, omni-bodied brain. Today, we are sharing our journey, starting with early milestones, with more to come in the weeks ahead.
0
2
18
For @NeurIPSConf, we can't update the main PDF or upload a separate rebuttal PDF — so no way to include any new images or visual results? What if reviewers ask for more vision experiments? 🥲 Any suggestions or workarounds?
5
0
11
Great work from great people! @mihirp98 @pathak2206 AR aligns w/ compression theory (KC, MDL, arithmetic coding), but diffusion is MLE too. Can we interpret diffusion similarly? Curious how compression explains AR vs. diffusion scaling laws. (Ilya’s talk touches on this too.)
🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n
1
2
12
Indeed! I find H-Net to be closely related to KARL — and even our earlier work ALIT (the recurrent tokenizer in the figure below) shares strong connections. Loved reading H-Net, like all @_albertgu’s work. Congrats to @sukjun_hwang and team!
Single-pass Adaptive Image Tokenization for Minimum Program Search KARL is a single-pass adaptive image tokenizer that predicts how many tokens are needed based on Kolmogorov Complexity, without test-time search. It halts once enough information is captured, using token count as
1
3
31
Our new work on adaptive image tokenization: Image —> T tokens * variable T, based on image complexity * single forward pass both infers T and tokenizes to T tokens * approximates minimum description length encoding of the image
Compression is the heart of intelligence From Occam to Kolmogorov—shorter programs=smarter representations Meet KARL: Kolmogorov-Approximating Representation Learning. Given an image, token budget T & target quality 𝜖 —KARL finds the smallest t≤T to reconstruct it within 𝜖🧵
0
30
202
Excited to share this work on studying representation learning from a compression perspective! Grateful to my amazing advisors—Professors Bill Freeman, Antonio Torralba, @phillip_isola @MITCSAIL 📄 Paper: https://t.co/DkzCgYNhS5 💻 Code: https://t.co/zUHzgd79qb AIT meets AIT!
github.com
Single-pass Adaptive Image Tokenization for Minimum Program Search | What's the Kolmogorov Complexity of an Image? - ShivamDuggal4/karl
0
1
14
Hint at modeling interestingness! 👀 Adaptive image tokenizers may go beyond KC—capturing sophistication or logical depth? Measure Δ in reconstruction as tokens increase: Big Δ → structure Small Δ → trivial/noise Mid Δ → maybe… interesting? Future work awaits! (13/n)
1
0
4
KC isn’t everything! There is more to AIT Pure noise & rich structure can both have high KC—but only one is interesting. What’s truly interesting often lies in the middle: patterns that are partly predictable, partly surprising. (See @pbloemesquire’s excellent slide deck!)
1
2
6