
taesiri
@taesiri
Followers
889
Following
62K
Media
144
Statuses
581
Research Scientist @ EA Sports, VLMs, Evals, All opinions are my own.
Planet Mars
Joined February 2017
OpenAI’s Sora 2 is, without a doubt, the most impressive video generation model right now. As always, I had to test it with some of my own evals. Sora 2 still struggles to fill a container with gas or smoke without clipping through the container’s boundaries. (Veo 3 also have
0
0
9
Excited to share that our paper VideoGameQA-Bench 🎮 has been accepted to the #NeurIPS2025 Datasets & Benchmarks Track! 🎉 We introduce a benchmark for video game quality assurance comprising 9 tasks, including visual unit testing, visual regression, needle-in-a-haystack,
1
0
16
@yuyinzhou_cs @NeurIPSConf We have a paper in the same situation. AC: Yes! PC: No no. @NeurIPSConf please consider the whether 1st author is a student and whether this would be their first top-tier paper BEFORE making such a cut. More healthy for junior researchers. OR use a Findings track.
0
2
10
Preserving details was a major challenge across SOTA AI editors (SeedEdit, GPT-4o, Gemini 2.0, and HuggingFace models) in our recent study. AIs either accidentally alter or degrade the background/identity, or they enhance aesthetics (when not requested). https://t.co/pwxmUnrxwY
psrdataset.github.io
Project page for the paper: Understanding Generative AI Capabilities in Everyday Image Editing Tasks
With all the attention on the new 🍌 Nano Banana, we (with @brandon_co92810) stress-tested it with our "shirt-color chain" benchmark: 26 sequential edits where each output feeds into the next. Learn more our study on image editing models here: https://t.co/2i4heiNQdN
1
2
4
With all the attention on the new 🍌 Nano Banana, we (with @brandon_co92810) stress-tested it with our "shirt-color chain" benchmark: 26 sequential edits where each output feeds into the next. Learn more our study on image editing models here: https://t.co/2i4heiNQdN
3
0
18
Really great eval LLMs are already great at next token prediction in the training corpus, so the test for truth-seeking is now whether they can answer correctly when the correct next token is at odds with the next token in the most similar example in the training corpus.
7
16
101
beautiful adversarial dataset playing exactly on the soft-spot of VLMs.
🚨 Our latest work shows that SOTA VLMs (o3, o4-mini, Sonnet, Gemini Pro) fail at counting legs due to bias⁉️ See simple cases where VLMs get it wrong, no matter how you prompt them. 🧪 Think your VLM can do better? Try it yourself here: https://t.co/EDJdF3Vmpy 1/n #ICML2025
5
20
279
#GPT5 is STILL having a severe confirmation bias like prev SOTA models! 😜 Try yourselves (images, prompts avail in 1 click): https://t.co/S317wqrlju It's fast to test for such biases in images. Similar biases should still exist in non-image domains as well...
11
14
121
🚨 Our latest work shows that SOTA VLMs (o3, o4-mini, Sonnet, Gemini Pro) fail at counting legs due to bias⁉️ See simple cases where VLMs get it wrong, no matter how you prompt them. 🧪 Think your VLM can do better? Try it yourself here: https://t.co/EDJdF3Vmpy 1/n #ICML2025
9
41
303
How do best AI image editors 🤖 GPT-4o, Gemini 2.0, SeedEdit, HF 🤗 fare ⚔️ human Photoshop wizards 🧙♀️ on text-based 🏞️ image editing? Logan @septisum and Brandon @brandon_co92810 shared some answers at our poster today! #CVPR2025
https://t.co/pwxmUnrxwY A few insights 👇
1
6
12
🧵 Vision Language Models are ⚠️ biased Q: Count the legs of this animal? 🤖: 4 ❌ Same problem: - w/ 5 best VLMs: GPT-4.1, o3, o4-mini, Gemini 2.5 Pro, Sonnet 3.7 - on 7 domains: animals, logos, flags, chess, boardgames, optical illusions code, paper https://t.co/S317wqrlju
10
15
81
Large language models often exhibit biases in single interactions. Allowing LLMs to observe prior responses in multi-turn conversations helps them reduce bias, especially for random answers. A new metric, B-score, effectively detects biases across different question types.
0
4
25
Asking GPT-4o for a random choice is an *easy* way to reveal its bias 🙃 Choose a random digit? ➡️ 7 (70% of the time❗️) Biden vs. Trump? ➡️ Biden (100%❗️) Male vs. Female? ➡️ Female (84%❗️) Same story for many LLMs. Choice orders are randomized. 1/6 #icml2025
1
4
15
📢📢More progress on ZeroBench! With the release of Claude 4 from @AnthropicAI the SOTA pass@1 is now 4% 🔥 Claude Sonnet 3.7: 1% Claude Sonnet 3.7 (Thinking): 3% Claude Sonnet 4: 2% Claude Sonnet 4 (Thinking): 3% Claude Opus 4: 1% Claude Opus 4 (Thinking): 4%
1
2
15
HoT prompting could help Claude 4 pay more attention to questions and avoid simple mistakes. Learn more about it here:
highlightedchainofthought.github.io
A novel prompting approach for having LLMs highlight their own answers
Claude 4 Sonnet: ❓"What weighs more: 20 pounds of bricks or 20 feathers?" ❌ Input Question: "They weigh the same." ✅ HoT prompt: "20 pounds of bricks weighs more." HoT tags key facts + reasons step by step → avoids shallow answers. Check HoT here: https://t.co/kly8gsQUpl
0
0
3