Shivam Duggal Profile
Shivam Duggal

@ShivamDuggal4

Followers
1K
Following
2K
Media
17
Statuses
162

PhD Student @MIT | Prev: Carnegie Mellon University @SCSatCMU | Research Scientist @UberATG

Joined June 2017
Don't wanna be here? Send us removal request.
@ShivamDuggal4
Shivam Duggal
3 months
Compression is the heart of intelligence From Occam to Kolmogorov—shorter programs=smarter representations Meet KARL: Kolmogorov-Approximating Representation Learning. Given an image, token budget T & target quality 𝜖 —KARL finds the smallest t≤T to reconstruct it within 𝜖🧵
14
63
354
@phillip_isola
Phillip Isola
20 hours
Over the past year, my lab has been working on fleshing out theory/applications of the Platonic Representation Hypothesis. Today I want to share two new works on this topic: Eliciting higher alignment: https://t.co/KY4fjNeCBd Unpaired rep learning: https://t.co/vJTMoyJj5J 1/9
6
61
396
@SophieLWang
Sophie Wang
5 days
LLMs, trained only on text, might already know more about other modalities than we realized; we just need to find ways elicit it. project page: https://t.co/8cIf1DW0OQ w/ @phillip_isola and @thisismyhat
16
63
571
@sharut_gupta
Sharut Gupta
1 day
[1/7] Paired multimodal learning shows that training with text can help vision models learn better image representations. But can unpaired data do the same? Our new work shows that the answer is yes! w/ @shobsund @ChenyuW64562111, Stefanie Jegelka and @phillip_isola
6
37
306
@yule_gan
Yulu Gan
5 days
Reinforcement Learning (RL) has long been the dominant method for fine-tuning, powering many state-of-the-art LLMs. Methods like PPO and GRPO explore in action space. But can we instead explore directly in parameter space? YES we can. We propose a scalable framework for
88
387
3K
@jxbz
Jeremy Bernstein
15 days
I wrote this blog post that tries to go further toward design principles for neural nets and optimizers The post presents a visual intro to optimization on normed manifolds and a Muon variant for the manifold of matrices with unit condition number https://t.co/EhhKN2Jylx
@thinkymachines
Thinking Machines
15 days
Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices.
23
52
465
@jxmnop
Jack Morris
1 month
nearly everything in AI can be understood through the lens of compression - the architecture is just schema for when & how to compress - optimization is a compression *process*, with its own compression level and duration - (architecture + data + optimization) = model - in other
68
67
1K
@ShivamDuggal4
Shivam Duggal
1 month
Amazing work @IdanShenfeld @jyo_pari SFT w/ per-token supervision is probably too constrained to map new/old data into a shared weight space. Wondering adding continuous thinking tokens (so still no RL) before supervised prediction could relax this, while staying off-policy?
@jyo_pari
Jyo Pari
1 month
For agents to improve over time, they can’t afford to forget what they’ve already mastered. We found that supervised fine-tuning forgets more than RL when training on a new task! Want to find out why? 👇
0
1
11
@kenziyuliu
Ken Liu
2 months
New paper! We explore a radical paradigm for AI evals: assessing LLMs on *unsolved* questions. Instead of contrived exams where progress ≠ value, we eval LLMs on organic, unsolved problems via reference-free LLM validation & community verification. LLMs solved ~10/500 so far:
15
74
369
@ShivamDuggal4
Shivam Duggal
2 months
Enjoying GPT-5 a lot! Research Q–Maybe intelligence is discovering the simplest algorithm that generalizes (5→N digit +). GPT-5 may be close for +/*, but what enables RL on top of (constrained) next-token pretraining to discover the least-KC algorithm for all tasks? Thoughts?
@DimitrisPapail
Dimitris Papailiopoulos
2 months
There will always exist an input of length N, for which multiplication using autoregressive models will break. Example of GPT-5 thinking getting 10-digit multiplication wrong below correct answer 88,723,505,107,555,515,626 vs GPT's answer 88,723,505,296,555,515,626
0
1
1
@ShivamDuggal4
Shivam Duggal
2 months
Talking about KARL today — our recent work on a Kolmogorov Complexity–inspired adaptive tokenizer. Details about the paper here: https://t.co/H10MkJyM8o More broadly, quite excited about representation learning — and understanding large models — through the lens of compression.
@ceciletamura
Cecile Tamura
2 months
@ShivamDuggal4 of @MIT in a deep dive @ploutosai w/ @ceciletamura , Head of Community. Don't miss it! https://t.co/Sfu9LMNKNq
0
2
21
@ShivamDuggal4
Shivam Duggal
2 months
Strongest compressors might not be the best decoders for your task. RL can adapt pre-trained models into more "sophisticated" decoders, tuned to the task’s specific demands. Exciting thread & research! Question: is next-token prediction really the final chapter in pretraining?
@_arohan_
rohan anil
2 months
Really cool thread! Cross entropy = optimal compressor for observed pretraining data RL objective = reward weighted compression objective.
0
2
10
@mihirp98
Mihir Prabhudesai
2 months
We ran more experiments to better understand “why” diffusion models do better in data-constrained settings than autoregressive. Our findings support the hypothesis that diffusion models benefit from learning over multiple token orderings, which contributes to their robustness and
@giffmana
Lucas Beyer (bl16)
3 months
@mihirp98 Interesting, thanks! So this is your next hypothesis then, right? > Diffusion models randomly factorize the joint, which enables them to generate tokens in random orders, which we think can’t simply be recreated by just random input masking while still having the next token
8
62
545
@ShivamDuggal4
Shivam Duggal
2 months
One "Skild brain" powers all embodiments—amazing work! Huge congratulations to entire team. Excited to see what’s next. Miss you all <3 !
@SkildAI
Skild AI
2 months
Modern AI is confined to the digital world. At Skild AI, we are building towards AGI for the real world, unconstrained by robot type or task — a single, omni-bodied brain. Today, we are sharing our journey, starting with early milestones, with more to come in the weeks ahead.
0
2
18
@ShivamDuggal4
Shivam Duggal
3 months
For @NeurIPSConf, we can't update the main PDF or upload a separate rebuttal PDF — so no way to include any new images or visual results? What if reviewers ask for more vision experiments? 🥲 Any suggestions or workarounds?
5
0
11
@ShivamDuggal4
Shivam Duggal
3 months
Great work from great people! @mihirp98 @pathak2206 AR aligns w/ compression theory (KC, MDL, arithmetic coding), but diffusion is MLE too. Can we interpret diffusion similarly? Curious how compression explains AR vs. diffusion scaling laws. (Ilya’s talk touches on this too.)
@mihirp98
Mihir Prabhudesai
3 months
🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n
1
2
12
@ShivamDuggal4
Shivam Duggal
3 months
Indeed! I find H-Net to be closely related to KARL — and even our earlier work ALIT (the recurrent tokenizer in the figure below) shares strong connections. Loved reading H-Net, like all @_albertgu’s work. Congrats to @sukjun_hwang and team!
@gm8xx8
𝚐𝔪𝟾𝚡𝚡𝟾
3 months
Single-pass Adaptive Image Tokenization for Minimum Program Search KARL is a single-pass adaptive image tokenizer that predicts how many tokens are needed based on Kolmogorov Complexity, without test-time search. It halts once enough information is captured, using token count as
1
3
31
@phillip_isola
Phillip Isola
3 months
Our new work on adaptive image tokenization: Image —> T tokens * variable T, based on image complexity * single forward pass both infers T and tokenizes to T tokens * approximates minimum description length encoding of the image
@ShivamDuggal4
Shivam Duggal
3 months
Compression is the heart of intelligence From Occam to Kolmogorov—shorter programs=smarter representations Meet KARL: Kolmogorov-Approximating Representation Learning. Given an image, token budget T & target quality 𝜖 —KARL finds the smallest t≤T to reconstruct it within 𝜖🧵
0
30
202
@ShivamDuggal4
Shivam Duggal
3 months
Excited to share this work on studying representation learning from a compression perspective! Grateful to my amazing advisors—Professors Bill Freeman, Antonio Torralba, @phillip_isola @MITCSAIL 📄 Paper: https://t.co/DkzCgYNhS5 💻 Code: https://t.co/zUHzgd79qb AIT meets AIT!
Tweet card summary image
github.com
Single-pass Adaptive Image Tokenization for Minimum Program Search | What's the Kolmogorov Complexity of an Image? - ShivamDuggal4/karl
0
1
14
@ShivamDuggal4
Shivam Duggal
3 months
Hint at modeling interestingness! 👀 Adaptive image tokenizers may go beyond KC—capturing sophistication or logical depth? Measure Δ in reconstruction as tokens increase: Big Δ → structure Small Δ → trivial/noise Mid Δ → maybe… interesting? Future work awaits! (13/n)
1
0
4
@ShivamDuggal4
Shivam Duggal
3 months
KC isn’t everything! There is more to AIT Pure noise & rich structure can both have high KC—but only one is interesting. What’s truly interesting often lies in the middle: patterns that are partly predictable, partly surprising. (See @pbloemesquire’s excellent slide deck!)
1
2
6