Shruti Joshi
@_shruti_joshi_
Followers
404
Following
2K
Media
3
Statuses
181
phd student in identifiable repl. prev. research programmer @MPI_IS Tรผbingen, undergrad @IITKanpur '19.
Montreal, Canada
Joined August 2018
1\ Hi, can I get an unsupervised sparse autoencoder for steering, please? I only have unlabeled data varying across multiple unknown concepts. Oh, and make sure it learns the same features each time! Yes! A freshly brewed Sparse Shift Autoencoder (SSAE) coming right up. ๐งถ
1
10
46
[1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. ๐Predict a learned
10
47
221
Iโm at CoLM this week! Come check out our work on evaluating RMs for agent trajectories! These days, Iโm thinking about forecasting generalization, scaling laws, and safety/adversarial attacks. Ping me if you wanna chat about research!
i will be presenting AgentRewardBench at #COLM2025 next week! session: #3 date: wednesday 11am to 1pm poster: #545 come learn more about the paper, my recent works or just chat about anything (montreal, mila, etc.) here's a teaser of my poster :)
0
5
7
I will be at the Actionable Interpretability Workshop (@ActInterp, #ICML) presenting *SSAEs* in the East Ballroom A from 1-2pm. Drop by (or send a DM) to chat about (actionable) interpretability, (actionable) identifiability, and everything in between!
1\ Hi, can I get an unsupervised sparse autoencoder for steering, please? I only have unlabeled data varying across multiple unknown concepts. Oh, and make sure it learns the same features each time! Yes! A freshly brewed Sparse Shift Autoencoder (SSAE) coming right up. ๐งถ
1
6
24
๐จ New Paper! ๐จ Guard models slow, language-specific, and modality-limited? Meet OmniGuard that detects harmful prompts across multiple languages & modalities all using one approach with SOTA performance in all 3 modalities!! while being 120X faster ๐ https://t.co/r6DGPDfwle
1
43
79
โกโก Llama-Nemotron-Ultra-253B just dropped: our most advanced open reasoning model ๐งต๐
3
13
44
๐๐ก๐จ๐ฎ๐ ๐ก๐ญ๐จ๐ฅ๐จ๐ ๐ฒ paper is out! ๐ฅ๐ We study the reasoning chains of DeepSeek-R1 across a variety of tasks and settings and find several surprising and interesting phenomena! Incredible effort by the entire team! ๐: https://t.co/CDlFHD28xQ
Models like DeepSeek-R1 ๐ mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1โs reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour. ๐: https://t.co/Cyy18kYQ45
1
5
26
Presenting โจ ๐๐๐๐๐: ๐๐๐ง๐๐ซ๐๐ญ๐ข๐ง๐ ๐๐ก๐๐ฅ๐ฅ๐๐ง๐ ๐ข๐ง๐ ๐ฌ๐ฒ๐ง๐ญ๐ก๐๐ญ๐ข๐ ๐๐๐ญ๐ ๐๐จ๐ซ ๐๐ฏ๐๐ฅ๐ฎ๐๐ญ๐ข๐จ๐ง โจ Work w/ fantastic advisors @DBahdanau and @sivareddyg Thread ๐งต:
1
18
41
Curious to find out more? Check out our pre-print at: https://t.co/iKlCxwNhgC. Work done with an amazing set of researchers: @andrea_dittadi, @seblachap, and @dhanya_sridhar!
0
0
6
5\ So, does it actually work? We show that SSAE accurately steers embeddings on both semi-synthetic and real-world datasets (like TruthfulQA) using Llama-3.1-8B, handling in- and out-of-distribution data with ease.
1
0
1
4\ What does this mean for steering? You get access to steering vectors for individual concepts, such that each vector consistently steers only a single concept, and can be scaled according to the context.
1
0
1
3\ With sufficiently diverse data (such as in the real world), SSAEs remain identifiable up to permutation and scalingโ repeated runs yield consistent representations, differing only by trivial indeterminacies.
1
0
3
2\ The SSAE is designed to map the difference between text embeddings (varying across multiple unknown concepts) to a sparse representation. Unlike standard SAEs, which impose sparsity on the concept representations themselves, we focus on sparsity of the shifts to the concepts.
1
0
2
๐ฃ ๐ฃ ๐ฃ Our new paper investigates the question of how many images ๐ผ๏ธ of a concept are required by a diffusion model ๐ค to imitate it. This question is critical for understanding and mitigating the copyright and privacy infringements of these models! https://t.co/bvdVU1M0Hh
10
61
225
๐จNEW PAPER OUT ๐จ Excited to share our latest research initiative on in-context learning and meta-learning through the lens of Information theory !๐ง ๐ https://t.co/Tj5cYudDwy Check out our insights and empirical experiments! ๐
arxiv.org
A central goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees for generalization without further assumptions, in...
Introducing our new paper explaining in-context learning through the lens of Occamโs razor, giving a normative account of next-token prediction objectives. This was with @Tom__Marty @tejaskasetty @le0gagn0n @sarthmit @MahanFathi @dhanya_sridhar @g_lajoie_
0
3
7
I am thrilled to announce that I will be joining the Gatsby Computational Neuroscience Unit at UCL as a Lecturer (Assistant Professor) in Feb 2025! Looking forward to working with the exceptional talent at @GatsbyUCL on cutting-edge problems in deep learning and causality.
We are delighted to announce that Dr Leena Chennuru Vankadara will join the Unit as Lecturer in Feb 2025, developing theoretical understandings of scaling and generalization in deep learning and causality. Welcome aboard @leenaCvankadara! Learn more at https://t.co/jASvmzGZFP
10
6
66
Presenting tomorrow at #NAACL2024: ๐ถ๐๐ ๐ฟ๐ฟ๐๐ ๐๐-๐๐๐๐ก๐๐ฅ๐ก ๐๐๐๐๐ ๐ก๐ ๐ข๐ ๐ ๐๐๐ค ๐๐๐๐๐๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐ ๐๐๐ ๐๐๐๐๐ข๐๐๐๐ ? ๐๐๐ . ๐พ๐๐๐ ๐๐. Internship @allen_ai work with @pdasigi and my advisors @DBahdanau and @sivareddyg.
3
21
74
Adversarial Triggers For LLMs Are ๐ก๐ข๐ง ๐จ๐ป๐ถ๐๐ฒ๐ฟ๐๐ฎ๐น!๐ฒ It is believed that adversarial triggers that jailbreak a model transfer universally to other models. But we show triggers don't reliably transfer, especially to RLHF/DPO models. Paper: https://t.co/nRdw2h1rgS
3
32
98
๐ข Exciting new work on AI safety! Do adversarial triggers transfer universally across models (as has been claimed)? ๐ก๐ผ. Are models aligned by supervised fine-tuning safe against adversarial triggers? ๐ก๐ผ. RLHF and DPO are far better!
Adversarial Triggers For LLMs Are ๐ก๐ข๐ง ๐จ๐ป๐ถ๐๐ฒ๐ฟ๐๐ฎ๐น!๐ฒ It is believed that adversarial triggers that jailbreak a model transfer universally to other models. But we show triggers don't reliably transfer, especially to RLHF/DPO models. Paper: https://t.co/nRdw2h1rgS
0
6
20