Shivam Duggal @ShivamDuggal4 X Profile

Shivam Duggal

@ShivamDuggal4

Followers

1K

Following

2K

Media

17

Statuses

162

PhD Student @MIT | Prev: Carnegie Mellon University @SCSatCMU | Research Scientist @UberATG

https://t.co/G8RWJ3xmSf

Joined June 2017

Don't wanna be here? Send us removal request.

Shivam Duggal

@ShivamDuggal4

3 months

Compression is the heart of intelligence From Occam to Kolmogorov—shorter programs=smarter representations Meet KARL: Kolmogorov-Approximating Representation Learning. Given an image, token budget T & target quality 𝜖 —KARL finds the smallest t≤T to reconstruct it within 𝜖🧵

14

63

354

Phillip Isola

@phillip_isola

20 hours

Over the past year, my lab has been working on fleshing out theory/applications of the Platonic Representation Hypothesis. Today I want to share two new works on this topic: Eliciting higher alignment: https://t.co/KY4fjNeCBd Unpaired rep learning: https://t.co/vJTMoyJj5J 1/9

6

61

396

Sophie Wang

@SophieLWang

5 days

LLMs, trained only on text, might already know more about other modalities than we realized; we just need to find ways elicit it. project page: https://t.co/8cIf1DW0OQ w/ @phillip_isola and @thisismyhat

16

63

571

Sharut Gupta

@sharut_gupta

1 day

[1/7] Paired multimodal learning shows that training with text can help vision models learn better image representations. But can unpaired data do the same? Our new work shows that the answer is yes! w/ @shobsund @ChenyuW64562111, Stefanie Jegelka and @phillip_isola

6

37

306

Yulu Gan

@yule_gan

5 days

Reinforcement Learning (RL) has long been the dominant method for fine-tuning, powering many state-of-the-art LLMs. Methods like PPO and GRPO explore in action space. But can we instead explore directly in parameter space? YES we can. We propose a scalable framework for

88

387

3K

Jeremy Bernstein

@jxbz

15 days

I wrote this blog post that tries to go further toward design principles for neural nets and optimizers The post presents a visual intro to optimization on normed manifolds and a Muon variant for the manifold of matrices with unit condition number https://t.co/EhhKN2Jylx

Thinking Machines

@thinkymachines

15 days

Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices.

23

52

465

Jack Morris

@jxmnop

1 month

nearly everything in AI can be understood through the lens of compression - the architecture is just schema for when & how to compress - optimization is a compression *process*, with its own compression level and duration - (architecture + data + optimization) = model - in other

68

67

1K

Shivam Duggal

@ShivamDuggal4

1 month

Amazing work @IdanShenfeld @jyo_pari SFT w/ per-token supervision is probably too constrained to map new/old data into a shared weight space. Wondering adding continuous thinking tokens (so still no RL) before supervised prediction could relax this, while staying off-policy?

Jyo Pari

@jyo_pari

1 month

For agents to improve over time, they can’t afford to forget what they’ve already mastered. We found that supervised fine-tuning forgets more than RL when training on a new task! Want to find out why? 👇

0

1

11

Ken Liu

@kenziyuliu

2 months

New paper! We explore a radical paradigm for AI evals: assessing LLMs on *unsolved* questions. Instead of contrived exams where progress ≠ value, we eval LLMs on organic, unsolved problems via reference-free LLM validation & community verification. LLMs solved ~10/500 so far:

15

74

369

Shivam Duggal

@ShivamDuggal4

2 months

Enjoying GPT-5 a lot! Research Q–Maybe intelligence is discovering the simplest algorithm that generalizes (5→N digit +). GPT-5 may be close for +/*, but what enables RL on top of (constrained) next-token pretraining to discover the least-KC algorithm for all tasks? Thoughts?

Dimitris Papailiopoulos

@DimitrisPapail

2 months

There will always exist an input of length N, for which multiplication using autoregressive models will break. Example of GPT-5 thinking getting 10-digit multiplication wrong below correct answer 88,723,505,107,555,515,626 vs GPT's answer 88,723,505,296,555,515,626

0

1

Shivam Duggal

@ShivamDuggal4

2 months

Talking about KARL today — our recent work on a Kolmogorov Complexity–inspired adaptive tokenizer. Details about the paper here: https://t.co/H10MkJyM8o More broadly, quite excited about representation learning — and understanding large models — through the lens of compression.

Cecile Tamura

@ceciletamura

2 months

@ShivamDuggal4 of @MIT in a deep dive @ploutosai w/ @ceciletamura , Head of Community. Don't miss it! https://t.co/Sfu9LMNKNq

0

2

21

Shivam Duggal

@ShivamDuggal4

2 months

Strongest compressors might not be the best decoders for your task. RL can adapt pre-trained models into more "sophisticated" decoders, tuned to the task’s specific demands. Exciting thread & research! Question: is next-token prediction really the final chapter in pretraining?

rohan anil

@_arohan_

2 months

Really cool thread! Cross entropy = optimal compressor for observed pretraining data RL objective = reward weighted compression objective.

0

2

10

Mihir Prabhudesai

@mihirp98

2 months

We ran more experiments to better understand “why” diffusion models do better in data-constrained settings than autoregressive. Our findings support the hypothesis that diffusion models benefit from learning over multiple token orderings, which contributes to their robustness and

Lucas Beyer (bl16)

@giffmana

3 months

@mihirp98 Interesting, thanks! So this is your next hypothesis then, right? > Diffusion models randomly factorize the joint, which enables them to generate tokens in random orders, which we think can’t simply be recreated by just random input masking while still having the next token

8

62

545

Shivam Duggal

@ShivamDuggal4

2 months

One "Skild brain" powers all embodiments—amazing work! Huge congratulations to entire team. Excited to see what’s next. Miss you all <3 !

Skild AI

@SkildAI

2 months

Modern AI is confined to the digital world. At Skild AI, we are building towards AGI for the real world, unconstrained by robot type or task — a single, omni-bodied brain. Today, we are sharing our journey, starting with early milestones, with more to come in the weeks ahead.

0

2

18

Shivam Duggal

@ShivamDuggal4

3 months

For @NeurIPSConf, we can't update the main PDF or upload a separate rebuttal PDF — so no way to include any new images or visual results? What if reviewers ask for more vision experiments? 🥲 Any suggestions or workarounds?

5

0

11

Shivam Duggal

@ShivamDuggal4

3 months

Great work from great people! @mihirp98 @pathak2206 AR aligns w/ compression theory (KC, MDL, arithmetic coding), but diffusion is MLE too. Can we interpret diffusion similarly? Curious how compression explains AR vs. diffusion scaling laws. (Ilya’s talk touches on this too.)

Mihir Prabhudesai

@mihirp98

3 months

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

1

2

12

Shivam Duggal

@ShivamDuggal4

3 months

Indeed! I find H-Net to be closely related to KARL — and even our earlier work ALIT (the recurrent tokenizer in the figure below) shares strong connections. Loved reading H-Net, like all @_albertgu’s work. Congrats to @sukjun_hwang and team!

𝚐𝔪𝟾𝚡𝚡𝟾

@gm8xx8

3 months

Single-pass Adaptive Image Tokenization for Minimum Program Search KARL is a single-pass adaptive image tokenizer that predicts how many tokens are needed based on Kolmogorov Complexity, without test-time search. It halts once enough information is captured, using token count as

1

3

31

Phillip Isola

@phillip_isola

3 months

Our new work on adaptive image tokenization: Image —> T tokens * variable T, based on image complexity * single forward pass both infers T and tokenizes to T tokens * approximates minimum description length encoding of the image

Shivam Duggal

@ShivamDuggal4

3 months

Compression is the heart of intelligence From Occam to Kolmogorov—shorter programs=smarter representations Meet KARL: Kolmogorov-Approximating Representation Learning. Given an image, token budget T & target quality 𝜖 —KARL finds the smallest t≤T to reconstruct it within 𝜖🧵

0

30

202

Shivam Duggal

@ShivamDuggal4

3 months

Excited to share this work on studying representation learning from a compression perspective! Grateful to my amazing advisors—Professors Bill Freeman, Antonio Torralba, @phillip_isola @MITCSAIL 📄 Paper: https://t.co/DkzCgYNhS5 💻 Code: https://t.co/zUHzgd79qb AIT meets AIT!

github.com

Single-pass Adaptive Image Tokenization for Minimum Program Search | What's the Kolmogorov Complexity of an Image? - ShivamDuggal4/karl

0

1

14

Shivam Duggal

@ShivamDuggal4

3 months

Hint at modeling interestingness! 👀 Adaptive image tokenizers may go beyond KC—capturing sophistication or logical depth? Measure Δ in reconstruction as tokens increase: Big Δ → structure Small Δ → trivial/noise Mid Δ → maybe… interesting? Future work awaits! (13/n)

1

0

4

Shivam Duggal

@ShivamDuggal4

3 months

KC isn’t everything! There is more to AIT Pure noise & rich structure can both have high KC—but only one is interesting. What’s truly interesting often lies in the middle: patterns that are partly predictable, partly surprising. (See @pbloemesquire’s excellent slide deck!)

1

2

6