Avi Caciularu
@clu_avi
Followers
576
Following
1K
Media
13
Statuses
280
Research Scientist @GoogleAI | previously ML & NLP PhD student @biunlp, intern at @allen_ai, @Microsoft, @AIatMeta.
Joined July 2009
🚨 New Paper 🚨 Are current LLMs up to the task of solving *complex* instructions based on content-rich text? Our new dataset, TACT, sheds some light on this challenge. How does it work? https://t.co/4u3iTC087B Work by @GoogleAI & @GoogleDeepMind 👇🧵
2
41
106
Save the date for ISCOL'25! The conference will be held on December 18th at Bar-Ilan University. The call for papers is now live on our website:
iscol-meeting.github.io
Join ISCOL 2025 on December 18th at Bar-Ilan University for exchanging ideas in Computational Linguistics and NLP across academia and industry in Israel.
0
7
30
Today, we are announcing the latest advancements to Google Earth AI — a platform designed to unlock a new level of planetary understanding. This includes new research on Geospatial Reasoning and expanded access to our specialized models. 🧵↓
14
85
702
Excited to present this at #EMNLP2025 in just over a month! It turns out that even flagship models like GPT-5 still struggle at faithfully expressing uncertainty 🤔 📊 Full results for the newest models are now live👇 https://t.co/CAqplLQ5s1
arxiv.org
A critical component in the trustworthiness of LLMs is reliable uncertainty communication, yet LLMs often use assertive language when conveying false claims, leading to over-reliance and eroded...
🎉 Delighted to announce that MetaFaith has been accepted to #EMNLP2025 Main! In this work we systematically study how well LLMs can express their internal uncertainty in words, offering a metacognition-inspired way to improve this ability 🧠✨ Check out more details below!👇
0
1
6
🎉 Delighted to announce that MetaFaith has been accepted to #EMNLP2025 Main! In this work we systematically study how well LLMs can express their internal uncertainty in words, offering a metacognition-inspired way to improve this ability 🧠✨ Check out more details below!👇
🔥 Excited to share MetaFaith: Understanding and Improving Faithful Natural Language Uncertainty Expression in LLMs🔥 How can we make LLMs talk about uncertainty in a way that truly reflects what they internally "know"? Check out our new preprint to find out! Details in 🧵(1/n):
0
2
11
Many factual QA benchmarks have become saturated, yet factuality still poses a very real issue! ✨We present MoNaCo, an Ai2 benchmark of human-written time-consuming questions that, on average, require 43.3 documents per question!✨ 📣Blogpost: https://t.co/GQD83gdHgg 🧵(1/5)
1
15
41
LLMs power research, decision‑making, and exploration—but most benchmarks don’t test how well they stitch together evidence across dozens (or hundreds) of sources. Meet MoNaCo, our new eval for question-answering cross‑source reasoning. 👇
10
39
226
Producing reasoning texts boosts the capabilities of AI models, but do we humans correctly understand these texts? Our latest research suggests that we do not. This highlights a new angle on the "Are they transparent?" debate: they might be, but we misinterpret them. 🧵
8
29
141
I will be presenting our work 𝗠𝗗𝗖𝘂𝗿𝗲 at #ACL2025NLP in Vienna this week! 🇦🇹 Come by if you’re interested in multi-doc reasoning and/or scalable creation of high-quality post-training data! 📍 Poster Session 4 @ Hall 4/5 🗓️ Wed, July 30 | 11-12:30 🔗
aclanthology.org
Gabrielle Kaili-May Liu, Bowen Shi, Avi Caciularu, Idan Szpektor, Arman Cohan. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.
🔥Thrilled to introduce MDCure: A Scalable Pipeline for Multi-Document Instruction-Following 🔥 How can we systematically and scalably improve LLMs' ability to handle complex multi-document tasks? Check out our new preprint to find out! Details in 🧵 (1/n):
1
5
26
This new benchmark created by @valentina__py should be the new default replacing IFEval. Some of the best frontier models get <50% and it comes with separate training prompts so people don’t effectively train on test. Wild gap from o3 to Gemini 2.5 pro of like 30 points.
Introducing IFBench, a benchmark to measure how well AI models follow new, challenging, and diverse verifiable instructions. Top models like Gemini 2.5 Pro or Claude 4 Sonnet are only able to score up to 50%, presenting an open frontier for post-training. 🧵
10
24
197
Excited for the release of SciArena with @allen_ai! LLMs are now an integral part of research workflows, and SciArena helps measure progress on scientific literature tasks. Also checkout the preprint for a lot more results/analyses. Led by: @YilunZhao_NLP, @kaiyan_z 📄 paper:
Introducing SciArena, a platform for benchmarking models across scientific literature tasks. Inspired by Chatbot Arena, SciArena applies a crowdsourced LLM evaluation approach to the scientific domain. 🧵
1
12
82
Gemini 2.5 Pro + 2.5 Flash are now stable and generally available. Plus, get a preview of Gemini 2.5 Flash-Lite, our fastest + most cost-efficient 2.5 model yet. 🔦 Exciting steps as we expand our 2.5 series of hybrid reasoning models that deliver amazing performance at the
259
464
4K
🚨 RAG is a popular approach but what happens when the retrieved sources provide conflicting information?🤔 We're excited to introduce our paper: “DRAGged into CONFLICTS: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs”🚀 A thread 🧵👇
2
14
36
🔥 Excited to share MetaFaith: Understanding and Improving Faithful Natural Language Uncertainty Expression in LLMs🔥 How can we make LLMs talk about uncertainty in a way that truly reflects what they internally "know"? Check out our new preprint to find out! Details in 🧵(1/n):
2
4
13
🚨 Introducing LAQuer, accepted to #ACL2025 (main conf)! LAQuer provides more granular attribution for LLM generations: users can just highlight any output fact (top), and get attribution for that input snippet (bottom). This reduces the amount of text the user has to read by 2
3
36
84
RefVNLI Towards Scalable Evaluation of Subject-driven Text-to-image Generation
1
52
135
Wanna check how well a model can share knowledge between languages? Of course you do! 🤩 But can you do it without access to the model’s weights? Now you can with ECLeKTic 🤯
1
16
43
New #ICLR2024 paper! The KoLMogorov Test: can CodeLMs compress data by code generation? The optimal compression for a sequence is the shortest program that generates it. Empirically, LMs struggle even on simple sequences, but can be trained to outperform current methods! 🧵1/7
8
47
292
How can we interpret LLM features at scale? 🤔 Current pipelines use activating inputs, which is costly and ignores how features causally affect model outputs! We propose efficient output-centric methods that better predict how steering a feature will affect model outputs. New
6
27
114
Today we published FACTS Grounding, a benchmark and leaderboard for evaluating the factuality of LLMs when grounding to the input context. The leaderboard is on Kaggle and we plan to maintain it and track progress. https://t.co/kqQvasZ57n
https://t.co/C5G3aicRR8
deepmind.google
Our comprehensive benchmark and online leaderboard offer a much-needed measure of how accurately LLMs ground their responses in provided source material and avoid hallucinations
1
8
26