Avi Caciularu Profile
Avi Caciularu

@clu_avi

Followers
576
Following
1K
Media
13
Statuses
280

Research Scientist @GoogleAI | previously ML & NLP PhD student @biunlp, intern at @allen_ai, @Microsoft, @AIatMeta.

Joined July 2009
Don't wanna be here? Send us removal request.
@clu_avi
Avi Caciularu
1 year
🚨 New Paper 🚨 Are current LLMs up to the task of solving *complex* instructions based on content-rich text? Our new dataset, TACT, sheds some light on this challenge. How does it work? https://t.co/4u3iTC087B Work by @GoogleAI & @GoogleDeepMind 👇🧵
2
41
106
@iscol_meeting
ISCOL 2025
18 days
Save the date for ISCOL'25! The conference will be held on December 18th at Bar-Ilan University. The call for papers is now live on our website:
Tweet card summary image
iscol-meeting.github.io
Join ISCOL 2025 on December 18th at Bar-Ilan University for exchanging ideas in Computational Linguistics and NLP across academia and industry in Israel.
0
7
30
@GoogleResearch
Google Research
21 days
Today, we are announcing the latest advancements to Google Earth AI — a platform designed to unlock a new level of planetary understanding. This includes new research on Geospatial Reasoning and expanded access to our specialized models. 🧵↓
14
85
702
@pybeebee
Gabrielle Kaili-May Liu
1 month
Excited to present this at #EMNLP2025 in just over a month! It turns out that even flagship models like GPT-5 still struggle at faithfully expressing uncertainty 🤔 📊 Full results for the newest models are now live👇 https://t.co/CAqplLQ5s1
Tweet card summary image
arxiv.org
A critical component in the trustworthiness of LLMs is reliable uncertainty communication, yet LLMs often use assertive language when conveying false claims, leading to over-reliance and eroded...
@pybeebee
Gabrielle Kaili-May Liu
3 months
🎉 Delighted to announce that MetaFaith has been accepted to #EMNLP2025 Main! In this work we systematically study how well LLMs can express their internal uncertainty in words, offering a metacognition-inspired way to improve this ability 🧠✨ Check out more details below!👇
0
1
6
@pybeebee
Gabrielle Kaili-May Liu
3 months
🎉 Delighted to announce that MetaFaith has been accepted to #EMNLP2025 Main! In this work we systematically study how well LLMs can express their internal uncertainty in words, offering a metacognition-inspired way to improve this ability 🧠✨ Check out more details below!👇
@pybeebee
Gabrielle Kaili-May Liu
5 months
🔥 Excited to share MetaFaith: Understanding and Improving Faithful Natural Language Uncertainty Expression in LLMs🔥 How can we make LLMs talk about uncertainty in a way that truly reflects what they internally "know"? Check out our new preprint to find out! Details in 🧵(1/n):
0
2
11
@TomerWolfson
Tomer Wolfson
3 months
Many factual QA benchmarks have become saturated, yet factuality still poses a very real issue! ✨We present MoNaCo, an Ai2 benchmark of human-written time-consuming questions that, on average, require 43.3 documents per question!✨ 📣Blogpost: https://t.co/GQD83gdHgg 🧵(1/5)
1
15
41
@allen_ai
Ai2
3 months
LLMs power research, decision‑making, and exploration—but most benchmarks don’t test how well they stitch together evidence across dozens (or hundreds) of sources. Meet MoNaCo, our new eval for question-answering cross‑source reasoning. 👇
10
39
226
@mosh_levy
Mosh Levy
3 months
Producing reasoning texts boosts the capabilities of AI models, but do we humans correctly understand these texts? Our latest research suggests that we do not. This highlights a new angle on the "Are they transparent?" debate: they might be, but we misinterpret them. 🧵
8
29
141
@pybeebee
Gabrielle Kaili-May Liu
4 months
I will be presenting our work 𝗠𝗗𝗖𝘂𝗿𝗲 at #ACL2025NLP in Vienna this week! 🇦🇹 Come by if you’re interested in multi-doc reasoning and/or scalable creation of high-quality post-training data! 📍 Poster Session 4 @ Hall 4/5 🗓️ Wed, July 30 | 11-12:30 🔗
Tweet card summary image
aclanthology.org
Gabrielle Kaili-May Liu, Bowen Shi, Avi Caciularu, Idan Szpektor, Arman Cohan. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.
@pybeebee
Gabrielle Kaili-May Liu
1 year
🔥Thrilled to introduce MDCure: A Scalable Pipeline for Multi-Document Instruction-Following 🔥 How can we systematically and scalably improve LLMs' ability to handle complex multi-document tasks? Check out our new preprint to find out! Details in 🧵 (1/n):
1
5
26
@natolambert
Nathan Lambert
4 months
This new benchmark created by @valentina__py should be the new default replacing IFEval. Some of the best frontier models get <50% and it comes with separate training prompts so people don’t effectively train on test. Wild gap from o3 to Gemini 2.5 pro of like 30 points.
@allen_ai
Ai2
4 months
Introducing IFBench, a benchmark to measure how well AI models follow new, challenging, and diverse verifiable instructions. Top models like Gemini 2.5 Pro or Claude 4 Sonnet are only able to score up to 50%, presenting an open frontier for post-training. 🧵
10
24
197
@armancohan
Arman Cohan
5 months
Excited for the release of SciArena with @allen_ai! LLMs are now an integral part of research workflows, and SciArena helps measure progress on scientific literature tasks. Also checkout the preprint for a lot more results/analyses. Led by: @YilunZhao_NLP, @kaiyan_z 📄 paper:
@allen_ai
Ai2
5 months
Introducing SciArena, a platform for benchmarking models across scientific literature tasks. Inspired by Chatbot Arena, SciArena applies a crowdsourced LLM evaluation approach to the scientific domain. 🧵
1
12
82
@sundarpichai
Sundar Pichai
5 months
Gemini 2.5 Pro + 2.5 Flash are now stable and generally available. Plus, get a preview of Gemini 2.5 Flash-Lite, our fastest + most cost-efficient 2.5 model yet. 🔦 Exciting steps as we expand our 2.5 series of hybrid reasoning models that deliver amazing performance at the
259
464
4K
@ArieCattan
Arie Cattan
5 months
🚨 RAG is a popular approach but what happens when the retrieved sources provide conflicting information?🤔 We're excited to introduce our paper: “DRAGged into CONFLICTS: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs”🚀 A thread 🧵👇
2
14
36
@pybeebee
Gabrielle Kaili-May Liu
5 months
🔥 Excited to share MetaFaith: Understanding and Improving Faithful Natural Language Uncertainty Expression in LLMs🔥 How can we make LLMs talk about uncertainty in a way that truly reflects what they internally "know"? Check out our new preprint to find out! Details in 🧵(1/n):
2
4
13
@hirscheran
Eran Hirsch
5 months
🚨 Introducing LAQuer, accepted to #ACL2025 (main conf)! LAQuer provides more granular attribution for LLM generations: users can just highlight any output fact (top), and get attribution for that input snippet (bottom). This reduces the amount of text the user has to read by 2
3
36
84
@_akhaliq
AK
7 months
RefVNLI Towards Scalable Evaluation of Subject-driven Text-to-image Generation
1
52
135
@omerNLP
omer goldman
8 months
Wanna check how well a model can share knowledge between languages? Of course you do! 🤩 But can you do it without access to the model’s weights? Now you can with ECLeKTic 🤯
1
16
43
@OriYoran
Ori Yoran
8 months
New #ICLR2024 paper! The KoLMogorov Test: can CodeLMs compress data by code generation? The optimal compression for a sequence is the shortest program that generates it. Empirically, LMs struggle even on simple sequences, but can be trained to outperform current methods! 🧵1/7
8
47
292
@megamor2
Mor Geva
10 months
How can we interpret LLM features at scale? 🤔 Current pipelines use activating inputs, which is costly and ignores how features causally affect model outputs! We propose efficient output-centric methods that better predict how steering a feature will affect model outputs. New
6
27
114
@clu_avi
Avi Caciularu
11 months
🤔🤔🤔
@OfficialLoganK
Logan Kilpatrick
11 months
Just when you thought it was over... we’re introducing Gemini 2.0 Flash Thinking, a new experimental model that unlocks stronger reasoning capabilities and shows its thoughts. The model plans (with thoughts visible), can solve complex problems with Flash speeds, and more 🧵
0
0
0
@goldshtn
Sasha Goldshtein
11 months
Today we published FACTS Grounding, a benchmark and leaderboard for evaluating the factuality of LLMs when grounding to the input context. The leaderboard is on Kaggle and we plan to maintain it and track progress. https://t.co/kqQvasZ57n https://t.co/C5G3aicRR8
Tweet card summary image
deepmind.google
Our comprehensive benchmark and online leaderboard offer a much-needed measure of how accurately LLMs ground their responses in provided source material and avoid hallucinations
1
8
26