
Amit Sharma
@amt_shrma
Followers
4K
Following
2K
Media
54
Statuses
1K
Researcher @MSFTResearch. Co-founder pywhy/dowhy. Work on causality & machine learning. Searching for a path to causal AI https://t.co/tn9kMAmlKw
Bengaluru
Joined October 2010
New paper: On the unreasonable effectiveness of LLMs for causal inference. GPT4 achieves new SoTA on a wide range of causal tasks: graph discovery (97%, 13 pts gain), counterfactual reasoning (92%, 20 pts gain) & actual causality. How is this possible?🧵.
arxiv.org
The causal capabilities of large language models (LLMs) are a matter of significant debate, with critical implications for the use of LLMs in societally impactful domains such as medicine,...
28
285
1K
Overall, the benchmark is challenging & motivates improvements to the search and reasoning process. Joint w/ @naga86 @abhinav_java, @ashmitkx, Sukruta @MSFTResearch. Big Q: What would be the key to improving performance further?.
huggingface.co
0
3
9
We also quantify the search, branching, and backtracking behavior using models' reasoning traces. Among DR models, OpenAI model tends to have the highest branching and backtracking events. Number of searches average between 20-40 for each query.
1
0
2
Finally, if you are curious, OpenAI DR model performs the best among SoTA models, with an F1=0.55, followed by Perplexity DR. Both needle-in-a-haystack and broad search tasks are challenging, the hardest being materials identification. Non-reasoning models are unable to do well.
1
0
1
Other tasks include prior art search--is a given idea novel?--and finding datasets that satisfy certain properties, another useful task for scientists. We also have general interest tasks such as cultural awards & flight incidents. Each task includes output claims for evaluation.
1
0
1
For example, a reasoning problem may ask for properties of a material mentioned in a scientific article. We invert the problem to ask: which material has exactly these properties? With some work, the properties can be extended so that the material(s) can be uniquely identified.
1
0
1
Our key idea: problem inversion. Take an existing long-context or document reasoning problem and invert it, turning its answer into the question!.And now the task is to search the web to find this info. This allows easy addition of new DR problems: the 1st live benchmark for DR.
1
1
4
But there’s one issue. Whether a task is DR depends on the corpus. “Oscar movies from books by women authors” may be a DR query, but if a webpage comes up providing exactly this info, it no longer requires research. So how to benchmark DR models as the web continually updates?.
1
1
3
Thus, a deep research task can be defined as a <query, claims> tuple. "Claims" can be a nested list, with subclaims supporting each claim. Key Insight: As long as a model can generate the claims, writing the report is a long-form generation task that can be evaluated separately.
1
0
2
Another perspective: Deep research is like an extreme form of multi-hop QA. Some problems require intensive searching, some demand deep reasoning. Deep research combines both, corresponding to important scientific & business tasks, e.g., material identification & prior art search
1
0
2
Our core thesis: The defining element of deep research is not the report, but the *information synthesis* process used to generate the claims within a report. And we show how the claim synthesis process can be objectively evaluated. LiveDRBench:
1
0
2
Deep research has emerged as a popular task with many recently released models. But beyond lengthy reports, what exactly defines the task? And how to quantify progress?. [New Paper!] We provide an objective defn. centered on claim discovery & a 100-problem benchmark spanning.
1
9
28
RT @abhinav_java: 🚀 Meet FrugalRAG at #ICML2025 in Vancouver 🇨🇦!.📍 July 18 – VecDB Workshop, West 208–209.📍 July 19 – ES-FoMO Workshop, Eas….
0
4
0
RT @AniketVashisht8: Extremely happy to have our work on Teaching Transformers Causal Reasoning through Axiomatic Training accepted at ICML….
0
19
0
RT @sirbayes: @amt_shrma Sounds very cool. Here is link to paper (hard to find since it seems to be a TMLR paper, not an official ICLR pape….
arxiv.org
The causal capabilities of large language models (LLMs) are a matter of significant debate, with critical implications for the use of LLMs in societally impactful domains such as medicine,...
0
4
0
PywhyLLM: Creating an API for language models to interact with causal methods and vice versa. v0.1 out, welcome feedback. If you are at #iclr2025, come check out our poster today at 10am-12:30pm.
1
14
71
What changes for causality research in the age of LLMs and what does not? Enjoyed this conversation with Alex Molak on how LLMs are accelerating causal discovery, how diverse environments can learn help causal agents, and how causality is critical for verifying AI actions. Link👇.
1
7
27
Job Alert: @MSFTResearch India is hiring postdocs! A chance to work with some amazing colleagues while doing world-class research. Apply here: DM me if interested in ML/reasoning/causality.
0
17
72
Excited to present Axiomatic Training at #NeurIPS2024, a new paradigm to teach causal reasoning to language models!.I try to summarize what LLM systems can do today and what new training paradigms we need to improve their causal reasoning. Slides:.
1
8
43
RT @CaLM_Workshop: We are happy 😁 to announce 📢 the First Workshop on Causality and Large Models (C♥️LM) at #NeurIPS2024 . 📜 Submission dea….
0
15
0