NeelBhandari9 Profile Banner
Neel Bhandari Profile
Neel Bhandari

@NeelBhandari9

Followers
308
Following
90
Media
14
Statuses
609

Masters Student @LTIatCMU | ML Scientist @PayPal | Open Research @CohereForAI Community | Previously External Research Student @MITIBMLab. Views my own.

Bengaluru South, India
Joined September 2016
Don't wanna be here? Send us removal request.
@NeelBhandari9
Neel Bhandari
3 months
Our work has been accepted at the EACL 2026 main conference!
@NeelBhandari9
Neel Bhandari
11 months
1/๐Ÿšจ ๐—ก๐—ฒ๐˜„ ๐—ฝ๐—ฎ๐—ฝ๐—ฒ๐—ฟ ๐—ฎ๐—น๐—ฒ๐—ฟ๐˜ ๐Ÿšจ RAG systems excel on academic benchmarks - but are they robust to variations in linguistic style? We find RAG systems are brittle. Small shifts in phrasing trigger cascading errors, driven by the complexity of the RAG pipeline ๐Ÿงต
0
0
7
@liweijianglw
Liwei Jiang
4 months
Super happy to receive the Best Paper Award at #NeurIPS2025 for our Artificial Hivemind paper!! (Really enjoyed giving oral talk at NeurIPS as well!)
@liweijianglw
Liwei Jiang
5 months
โš ๏ธDifferent models. Same thoughts.โš ๏ธ Todayโ€™s AI models converge into an ๐€๐ซ๐ญ๐ข๐Ÿ๐ข๐œ๐ข๐š๐ฅ ๐‡๐ข๐ฏ๐ž๐ฆ๐ข๐ง๐ ๐Ÿ, a striking case of mode collapse that persists even across heterogeneous ensembles. Our #neurips2025 ๐ƒ&๐ ๐Ž๐ซ๐š๐ฅ ๐ฉ๐š๐ฉ๐ž๐ซ (โœจ๐ญ๐จ๐ฉ ๐ŸŽ.๐Ÿ‘๐Ÿ“%โœจ) dives deep into
37
67
781
@wellecks
Sean Welleck
7 months
Excited to teach Advanced NLP at CMU again this semester! Slides are on the course page as the course proceeds: https://t.co/xsqARaZEK9 Lectures will be uploaded to Youtube: https://t.co/4kfXvS2MCb
5
91
582
@MaartenSap
Maarten Sap (he/him)
7 months
Super excited and honored to received this award! ๐Ÿฅฐ
@LTIatCMU
Language Technologies Institute | @CarnegieMellon
7 months
A hearty congratulations to the LTI's @MaartenSap, who's been awarded an @OkawaFoundation Research Grant for his work in his work in socially-aware artificial intelligence.
12
6
99
@singhshiviii
Shivalika Singh
11 months
LMArena is widely used for model evaluation, but is it measuring true progress? ๐Ÿ”ฎ In our work, "The Leaderboard Illusion", we reveal: ๐Ÿ”’ Private testing ๐Ÿ“Š Data access asymmetries โš ๏ธ Overfitting risks ๐Ÿšซ Silent deprecations Despite best intentions, arena policies favor a few!
9
38
201
@GhateKshitish
Kshitish Ghate
11 months
Excited to announce our #NAACL2025 Oral paper! ๐ŸŽ‰โœจ We carried out the largest systematic study so far to map the links between upstream choices, intrinsic bias, and downstream zero-shot performance across 131 CLIP Vision-language encoders, 26 datasets, and 55 architectures!
1
10
32
@sarahookr
Sara Hooker
11 months
Very proud of this work which is being presented @iclr_conf later today. While I will not be there โ€” Catch up with @viraataryabumi and @ahmetustun89 who are both fantastic and can share more about our work at both @Cohere_Labs and @cohere. ๐Ÿ”ฅโœจ
@Cohere_Labs
Cohere Labs
2 years
In our latest work, we ask โ€œwhat is the impact of code data used in pre-training on non-code tasks?โ€ Work w @viraataryabumi, @yixuan_su, @rayhascode, @adrien_morisot, @1vnzh, @acyr_l, @mziizm, @ahmetustun89 @sarahookr ๐Ÿ“œ https://t.co/CxkgHqZEGB
4
17
89
@viraataryabumi
Viraat Aryabumi
2 years
๐ŸšจNew preprint ๐Ÿšจ Iโ€™m super excited to share our work: To Code, or Not To Code? Exploring the Impact of Code in Pre-training ๐Ÿ“œ: https://t.co/HCOvCz6hfp w/ @yixuan_su, @rayhascode, @adrien_morisot, @1vnzh, @acyr_l, @mziizm, @ahmetustun89, @sarahookr [1/n]
10
36
182
@MaartenSap
Maarten Sap (he/him)
11 months
Very excited, obviously about the work, but also because I finally got to make a Taylor Swift reference in a paper title!!
@NeelBhandari9
Neel Bhandari
11 months
1/๐Ÿšจ ๐—ก๐—ฒ๐˜„ ๐—ฝ๐—ฎ๐—ฝ๐—ฒ๐—ฟ ๐—ฎ๐—น๐—ฒ๐—ฟ๐˜ ๐Ÿšจ RAG systems excel on academic benchmarks - but are they robust to variations in linguistic style? We find RAG systems are brittle. Small shifts in phrasing trigger cascading errors, driven by the complexity of the RAG pipeline ๐Ÿงต
2
1
35
@devanshrjain
Devansh Jain
11 months
Excited to share PolyGuard ๐Ÿ›ก๏ธ, our new state-of-the-art multilingual safety detector. PolyGuard supports 17 languages and outperforms all open-source and commercial moderation tools!
@kpriyanshu256
Priyanshu Kumar
11 months
Need a multilingual safety detector? ๐ŸšจIntroducing PolyGuard๐Ÿšจ โš™๏ธ supports 17 languages โš™๏ธ generates structured output for prompt safety, response safety, and model refusal ๐Ÿš€ outperforms existing SOTA open and commercial safety detectors by 5.5% ๐Ÿ“œ https://t.co/lz8R1nnjFd๐Ÿงต
1
6
15
@AkariAsai
Akari Asai
11 months
Real user queries often look different from the clean, concise ones in academic benchmarks - ambiguity, full of typos, and much less readable. We show that even strong RAG systems quickly break under these conditions. Awesome project led by @NeelBhandari9 and @tianyu_cao_24!!
@NeelBhandari9
Neel Bhandari
11 months
1/๐Ÿšจ ๐—ก๐—ฒ๐˜„ ๐—ฝ๐—ฎ๐—ฝ๐—ฒ๐—ฟ ๐—ฎ๐—น๐—ฒ๐—ฟ๐˜ ๐Ÿšจ RAG systems excel on academic benchmarks - but are they robust to variations in linguistic style? We find RAG systems are brittle. Small shifts in phrasing trigger cascading errors, driven by the complexity of the RAG pipeline ๐Ÿงต
1
8
38
@akhila_yerukola
Akhila Yerukola
11 months
These days RAG systems have gotten popular for boosting LLMsโ€”but they're brittle๐Ÿ’”. Minor shifts in phrasing (โœ๏ธ style, politeness, typos) can wreck the pipeline. Even advanced components donโ€™t fix the issue. Check out this extensive eval by @NeelBhandari9 and @tianyu_cao_24!
@NeelBhandari9
Neel Bhandari
11 months
1/๐Ÿšจ ๐—ก๐—ฒ๐˜„ ๐—ฝ๐—ฎ๐—ฝ๐—ฒ๐—ฟ ๐—ฎ๐—น๐—ฒ๐—ฟ๐˜ ๐Ÿšจ RAG systems excel on academic benchmarks - but are they robust to variations in linguistic style? We find RAG systems are brittle. Small shifts in phrasing trigger cascading errors, driven by the complexity of the RAG pipeline ๐Ÿงต
1
2
6
@NeelBhandari9
Neel Bhandari
11 months
11/ This paper has been an incredible effort across institutions @LTIatCMU @uwcse . Huge thanks to my co-first author @tianyu_cao_24 and co-authors @akhila_yerukola @AkariAsai @MaartenSap โœจ๐Ÿš€
0
0
7
@NeelBhandari9
Neel Bhandari
11 months
10/ ๐Ÿ”ฌCode: https://t.co/f1o0WViWGu ๐Ÿ“œPaper: "Out of Style: RAGโ€™s Fragility to Linguistic Variation": https://t.co/yaC3h0FoHu Read our paper for more details on impact of scaling retrieved documents, specific effects of each linguistic variation on RAG pipelines and much more!
Tweet card summary image
arxiv.org
Despite the impressive performance of Retrieval-augmented Generation (RAG) systems across various NLP benchmarks, their robustness in handling real-world user-LLM interaction queries remains...
1
0
3
@NeelBhandari9
Neel Bhandari
11 months
9/ ๐Ÿšจ Takeaway RAG systems suffer major performance drops from simple linguistic variations. Advanced techniques offer temporary relief, but real robustness demands fundamental changes - more resilient components and fewer cascading error in order to serve all users effectively.
1
0
2
@NeelBhandari9
Neel Bhandari
11 months
8/๐Ÿ› ๏ธ Adding advanced techniques to vanilla RAG improve robustness... sometimes๐Ÿซ  โœ… Reranking improves performance on rewrites, but gaps in performance with original queries remain. โš ๏ธ HyDE helps rewritten queries but hurts original queries-creating a false sense of robustness
1
0
2
@NeelBhandari9
Neel Bhandari
11 months
7/๐Ÿค”Well, maybe scaling generation model size helps? Scaling up LLM size helps narrow the performance gap between original and rewritten queries. However, this is not consistent across variations. Larger models occasionally worsen the impact, particularly with RTT variations.
1
0
3
@NeelBhandari9
Neel Bhandari
11 months
6/โš–๏ธ RAG is more fragile than LLM-only setups RAGโ€™s retrieval-generation pipeline amplifies linguistic errors, leading to greater performance drops. On PopQA, RAG degrades by 23% vs. just 11% for the LLM-only setup. โš ๏ธThe main culprit? Retrieval emerges as the weakest link.
1
0
1
@NeelBhandari9
Neel Bhandari
11 months
5/๐Ÿงฉ Generation Fragility Linguistic variations lead to generation accuracy drops-Exact Match score down by up to ~41%, Answer Match score by up to ~17%. Structural changes from RTT are particularly damaging, significantly reducing response accuracy.
1
0
1
@NeelBhandari9
Neel Bhandari
11 months
4/๐Ÿ“ŒRetrieval Robustness Retrieval recall plummets up to 40.41% due to linguistic variations, especially when exposed to informal queries. Grammatical errors like RTT and typos notably degrade performance, highlighting retrieversโ€™ sensitivity to a number of linguistic variations
1
0
2