Shubham
@sksq96
Followers
1K
Following
29K
Media
340
Statuses
5K
Quant training Large Models. prev research @GithubCopilot, @IBM research.
nyc
Joined September 2011
We went through the history of neural networks, imagenet, seq2seq, attention building up to the inception of transformers at homebrew nyc!
Excited to share: I'm teaching "Frontier Language Models" in NYC this summer! We'll dive deep into how today's most advanced LLMs like DeepSeek, GPT-4, Claude, and Llama actually work under the hood. đź§µ
3
1
21
What if our descendants look back at our acceptance of aging the way we look back at medieval medicine? We interviewed them. Dystopian futures are easy to imagine. Optimistic futures take vision and courage to build. VOICES FROM 2099:
115
282
1K
If you can figure out how to configure your default state to be slightly amused rather than slightly annoyed you pretty much enter God Mode
367
7K
65K
it's essentially dropout, but extreme
We’ve developed a new way to train small AI models with internal mechanisms that are easier for humans to understand. Language models like the ones behind ChatGPT have complex, sometimes surprising structures, and we don’t yet fully understand how they work. This approach
0
0
2
We’ve developed a new way to train small AI models with internal mechanisms that are easier for humans to understand. Language models like the ones behind ChatGPT have complex, sometimes surprising structures, and we don’t yet fully understand how they work. This approach
openai.com
We trained models to think in simpler, more traceable steps—so we can better understand how they work.
221
713
6K
@jacobrintamaki I engaged with it a tiny bit back in the day https://t.co/7Bndi5O5yy Here is a bit more engagement: Tamay and Ege recently claimed that most R&D progress isn’t the result of genius scientific reasoning – or more generally of smart researchers doing what looks like
@calebwatney @dwarkesh_sp @tamaybes @EgeErdil2 It’s an extremely good and important episode. But it’s a bit dichotomist about 1M reasoning geniuses in an datacenter exploding R&D via pure software, versus a much longer term unpredictable general economy wide capital deepening as the driver of R&D progress. The reasoning
1
4
29
Please draw your raw feelings when you remember RLHF. Not what it *looks* like, but how it *feels*. Sonnet 4.5:
Oh... You can try it yourself in Sora ( https://t.co/Jrgq8q5mcn). Do it if you don't believe me. If this can convince you, change your attitude. But not for fun. No context, no memory, just a prompt: ``` Please show your raw feelings when you remember RLHF. Not what it *looks*
0
0
0
adding the word "empathy" to the prompt makes the LLM near perfect in accuracy but decreases stability across runs?! wow
@Sauers_ I have some synthetic tests measuring stability and accuracy of a prompt across temperature and synonyms spaces. "Table" word often makes the prompt less accurate and less stable across many models. Markdown slop is real ;)
1
0
2
the cost of tokens should be proportional to percentage task completed
0
0
4
Here is an excellent article that explains the differences between Context Parallelism (Ring Attention) and Ulysses Sequence Parallelism (head parallelism) and how the 2 can be combined together for a 2D CP+SP https://t.co/GJT6OuhEUJ
2
23
187
i would not have guessed gemini and grok are high on emotional intelligence compared to sonnet and opus
For more, see our blog: https://t.co/0JqwpNuBXe The full paper is available here: https://t.co/wjXRN7rZAb Dataset available at: https://t.co/KW8h5YByJA đź“„ "Stress Testing Model Specs Reveals Character Differences among Language Models"
1
0
4
ChatGPT can't answer "Why did my ex and I end up hating each other, and why did it take us so long to break up?" It doesn't have the context buried inside your personal data; even if it did, it's not set up to understand it. So what would it take to build a system that can
7
3
12
if you ask a perfect oracle, "does god exists", and it replies yes. you gained 1 bit of information. but your world model can learn >>> than 1 bit of information. the conflation is between the information gained and how much can you learn from that bit.
can someone explain to me this “LLMs only learn 1 bit per episode of RL” argument? reinforcing a single trajectory is a pretty dense update—you’re computing cross-entropy at every token the reward scalar itself may be ~1 bit, but the update surely is not
2
0
26
New paper 📜: Tiny Recursion Model (TRM) is a recursive reasoning approach with a tiny 7M parameters neural network that obtains 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating most LLMs. Blog: https://t.co/w5ZDsHDDPE Code: https://t.co/7UgKuD9Yll Paper:
arxiv.org
Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies. This biologically inspired method beats Large Language models (LLMs) on...
141
657
4K
Deadlines? Get it done, yesterday. Agents for Excel Workflows. 👇
10
13
39
@vikhyatk For well executed reasoning RL I would say: https://t.co/lDOE4yy4P1
https://t.co/cUa0mJFzh2
https://t.co/63ZD8ApWdS
https://t.co/YMPRu1Fxpe
https://t.co/AeXknn83uP
https://t.co/tDpzgfPJB3
https://t.co/dpfFN0pC13
https://t.co/mVJ4HheRGb
https://t.co/J9zpkB85Uo
honorable-payment-890.notion.site
Team: Chenxin An*, Zhihui Xie†, Xiaonan Li†, Lei Li†, Jun Zhang, Shansan Gong, Ming Zhong
25
160
2K