Darshan Deshpande Profile
Darshan Deshpande

@getdarshan

Followers
196
Following
395
Media
19
Statuses
154

Research Scientist working on RL environments and evals @PatronusAI | ex-Research @USC_ISI

San Francisco, CA
Joined July 2020
Don't wanna be here? Send us removal request.
@VarunGangal
Varun Gangal
15 days
👋 Folks at #NEURIPS2025, come check out & stop by the poster of our Memtrack env at the SEA workshop happening at Upper Level 23ABC, 3:50pm onwards. Our env studies how well an agent dropped into a workplace can context engineer by composing tool calls to access intertwined
Tweet card summary image
arxiv.org
Recent works on context and memory benchmarking have primarily focused on conversational instances but the need for evaluating memory in dynamic enterprise environments is crucial for its...
@getdarshan
Darshan Deshpande
15 days
🚨We will be presenting Memtrack today at the SEA workshop from 3:50pm onwards at #NeurIPS2025 Memtrack is a SoTA eval env to study an agent's ability to memorize and retrieve facts using exploration over interleaved enterprise slack, linear and git threads in a multi-QA setting
0
3
7
@getdarshan
Darshan Deshpande
15 days
🚨We will be presenting Memtrack today at the SEA workshop from 3:50pm onwards at #NeurIPS2025 Memtrack is a SoTA eval env to study an agent's ability to memorize and retrieve facts using exploration over interleaved enterprise slack, linear and git threads in a multi-QA setting
0
4
13
@getdarshan
Darshan Deshpande
22 days
I will be at #NeurIPS2025 from 2nd-7th Dec. Happy to meet old and new friends and chat about non-deterministic evals, long horizon RL and world building 🌍
1
0
1
@getdarshan
Darshan Deshpande
2 months
Creating a bounty program out of benchmark datasets that restrict training on to then create RL environments that can be trained on using Prime's "open source" training services. This is scammy practice under the name of open science!
@willccbb
will brown
2 months
if you or a loved one is looking to learn about building environments and get a bag in the process, inquire within our bounty list is bigger and better than ever
3
0
10
@getdarshan
Darshan Deshpande
2 months
Excited to have contributed to OpenEnv before its release today! Thanks to @Meta and @huggingface for working towards standardizing RL environment creation!
@PatronusAI
PatronusAI
2 months
We’re excited to support @Meta and @huggingface's OpenEnv launch today! OpenEnv provides an open-source framework for building and interacting with agentic execution environments. This allows researchers and developers to create isolated, secure, deployable, and usable
0
0
2
@PatronusAI
PatronusAI
5 months
Thank you, @BerkeleyRDI, for hosting the Agentic AI Summit and having us! @getdarshan, one of our research scientists, who leads agent evaluation here at Patronus, presented at the summit! Here are a few takeaways: * Given context explosion and increasing domain depth and
0
1
2
@clefourrier
Clémentine Fourrier 🍊 is off till Dec 2026 hiking
7 months
Check out the very cool work from our friends @PatronusAI 🔥 work here! https://t.co/gINwBGZAtn
huggingface.co
1
7
17
@getdarshan
Darshan Deshpande
7 months
Non-deterministic trajectories need autonomous supervision. Introducing Percival, a SoTA system to detect issues with long context agentic problems and suggest fixes to systems. The time to make a move towards autonomous evaluations is now! 🔥
@PatronusAI
PatronusAI
7 months
1/ 🔥🔥 Big news: We’re launching Percival, the first AI agent that can evaluate and fix other AI agents! 🤖 Percival is an evaluation agent that doesn’t just detect failures in agent traces — it can fix them. Percival outperformed SOTA LLMs by 2.9x on the TRAIL dataset,
1
4
10
@AnnieFranco
Annie Franco
9 months
Building good benchmarks is hard, and @PatronusAI has released what may be the coolest agent eval yet: ✅ Realistic and objectively useful task ✅ Multilingual, multimodal, and multi-domain ✅ Easy for humans, still challenging for agents
@PatronusAI
PatronusAI
9 months
1/ Ever tried to remember the name of a movie you’ve seen – you can picture the scenes clearly, but the movie name won’t come to you? Introducing BLUR: the first agent benchmark for tip-of-the-tongue search and reasoning 🔥 We benchmarked SOTA agents and found that the
1
4
6
@AnnieFranco
Annie Franco
9 months
My colleague Chris McConnell and I greatly enjoyed seeing @skychwang @getdarshan @rebeccatqian @anandnk24 bring this project to life. We’re excited to finally see it out in the world, and look forward to collaborating on the next one!
1
2
4
@PatronusAI
PatronusAI
9 months
We're excited to introduce the BLUR Leaderboard on @huggingface 🔥 Earlier today, we open sourced BLUR: the first agent benchmark for tip-of-the-tongue search and reasoning. It measures how effectively agents can help you identify something you vaguely remember, but can’t
Tweet card summary image
huggingface.co
2
11
43
@PatronusAI
PatronusAI
9 months
1/ Ever tried to remember the name of a movie you’ve seen – you can picture the scenes clearly, but the movie name won’t come to you? Introducing BLUR: the first agent benchmark for tip-of-the-tongue search and reasoning 🔥 We benchmarked SOTA agents and found that the
1
6
46
@getdarshan
Darshan Deshpande
1 year
While experimenting with alignment methods, we observed that APO was more robust to noise in synthetic training data as compared to DPO or KTO. Thanks for the excellent contribution to the community @KarelDoostrlnck and team 🚀
@KarelDoostrlnck
Karel
1 year
Happy to see @PatronusAI use our Anchored Preference Optimization (APO) objective in their study!
1
1
6
@getdarshan
Darshan Deshpande
1 year
I'm calling it right now - distilling reasoning chains is going to be the next big thing! ⛏️ @OpenAI #OpenAi #o3
0
0
1
@getdarshan
Darshan Deshpande
1 year
I am excited to announce the release of our Glider model - small size, multi metric evals, explainable highlight spans, multilingual generalization, amazing subjective metric performance - Check it out!! Paper:
@PatronusAI
PatronusAI
1 year
1/ Introducing Glider - the smallest model to beat GPT-4o-mini on eval tasks ⚡🚀 - Open source, open weights, open code - Explainable evaluations by nature - Trained on 183 criteria and 685 domains Try it out for free at https://t.co/ZZai84VulJ 🔥
0
0
1
@PatronusAI
PatronusAI
1 year
1/ Introducing Lynx v2.0: an 8B State-of-the-Art RAG hallucination detection model 🚀 - Beats Claude-3.5-Sonnet on HaluBench by 2.2% - 3.4% higher accuracy than Lynx v1.1 on HaluBench - Optimized for long context use cases - Detects 8 types of common hallucinations, including
3
10
22
@getdarshan
Darshan Deshpande
1 year
Hey everyone, I am at #EMNLP2024 this week, co-presenting our work on Prototype based Networks with @ZSourati. Please reach out if you are interested in AI evaluations, interpretability or model alignment!
0
0
1
@NECLabsEU
NEC Laboratories Europe
1 year
Prototype-based networks can greatly enhance the robustness of #languagemodels in text classification, addressing real-world needs by combining robustness & interpretability for #trustworthyAI. Learn how in our Findings of #EMNLP24 accepted paper. https://t.co/fszqgkgXQy #NECLabs
0
4
7
@PatronusAI
PatronusAI
1 year
Llama Guard is Off Duty 😲 It’s weak at toxicity detection! We benchmarked popular toxicity datasets spanning languages like Portuguese, Ukrainian, and Turkish, and found that Llama Guard has a very high false negative rate for toxic content! We found that base models like
1
2
18