taesiri @taesiri X Profile

taesiri

@taesiri

Followers

889

Following

62K

Media

144

Statuses

581

Research Scientist @ EA Sports, VLMs, Evals, All opinions are my own.

https://t.co/ujgGXoLrB2

Planet Mars

Joined February 2017

Don't wanna be here? Send us removal request.

taesiri

@taesiri

16 days

OpenAI’s Sora 2 is, without a doubt, the most impressive video generation model right now. As always, I had to test it with some of my own evals. Sora 2 still struggles to fill a container with gas or smoke without clipping through the container’s boundaries. (Veo 3 also have

0

9

taesiri

@taesiri

28 days

Excited to share that our paper VideoGameQA-Bench 🎮 has been accepted to the #NeurIPS2025 Datasets & Benchmarks Track! 🎉 We introduce a benchmark for video game quality assurance comprising 9 tasks, including visual unit testing, visual regression, needle-in-a-haystack,

1

0

16

Anh Totti Nguyen

@anh_ng8

28 days

@yuyinzhou_cs @NeurIPSConf We have a paper in the same situation. AC: Yes! PC: No no. @NeurIPSConf please consider the whether 1st author is a student and whether this would be their first top-tier paper BEFORE making such a cut. More healthy for junior researchers. OR use a Findings track.

0

2

10

Anh Totti Nguyen

@anh_ng8

1 month

Preserving details was a major challenge across SOTA AI editors (SeedEdit, GPT-4o, Gemini 2.0, and HuggingFace models) in our recent study. AIs either accidentally alter or degrade the background/identity, or they enhance aesthetics (when not requested). https://t.co/pwxmUnrxwY

psrdataset.github.io

Project page for the paper: Understanding Generative AI Capabilities in Everyday Image Editing Tasks

taesiri

@taesiri

1 month

With all the attention on the new 🍌 Nano Banana, we (with @brandon_co92810) stress-tested it with our "shirt-color chain" benchmark: 26 sequential edits where each output feeds into the next. Learn more our study on image editing models here: https://t.co/2i4heiNQdN

1

2

4

taesiri

@taesiri

1 month

With all the attention on the new 🍌 Nano Banana, we (with @brandon_co92810) stress-tested it with our "shirt-color chain" benchmark: 26 sequential edits where each output feeds into the next. Learn more our study on image editing models here: https://t.co/2i4heiNQdN

3

0

18

Mehran Jalali

@mehran__jalali

2 months

Really great eval LLMs are already great at next token prediction in the training corpus, so the test for truth-seeking is now whether they can answer correctly when the correct next token is at odds with the next token in the most similar example in the training corpus.

7

16

101

(((ل()(ل() 'yoav))))👾

@yoavgo

2 months

beautiful adversarial dataset playing exactly on the soft-spot of VLMs.

An Vo

@an_vo12

3 months

🚨 Our latest work shows that SOTA VLMs (o3, o4-mini, Sonnet, Gemini Pro) fail at counting legs due to bias⁉️ See simple cases where VLMs get it wrong, no matter how you prompt them. 🧪 Think your VLM can do better? Try it yourself here: https://t.co/EDJdF3Vmpy 1/n #ICML2025

5

20

279

Lucas Beyer (bl16)

@giffmana

2 months

Oh wow, this VLM benchmark is pure evil, and I love it! "Vision Language Models are Biased" by @an_vo12, @taesiri, @anh_ng8, etal. Also really good idea to have one-click copy-paste of images and prompts, makes trying it super easy.

32

75

942

Mathew

@mrnuu

2 months

@anh_ng8 @taesiri @an_vo12 GPT5 Pro, the most advanced version, reasoned for 1 minute 19 seconds and came up with the wrong answer. Same with GPT5 Thinking. Amazing level of incompetence from the smartest model yet.

0

3

7

Anh Totti Nguyen

@anh_ng8

2 months

#GPT5 is STILL having a severe confirmation bias like prev SOTA models! 😜 Try yourselves (images, prompts avail in 1 click): https://t.co/S317wqrlju It's fast to test for such biases in images. Similar biases should still exist in non-image domains as well...

11

14

121

Cohere Labs

@Cohere_Labs

3 months

Supported by one of our grants, @an_vo12, Mohammad Reza Taesiri, and @anh_ng8 from @kaist_ai, tackled bias in LLMs. Their research shows that LLMs exhibit fewer biases when they can see their previous answers, leading to the development of the B-score metric.

2

3

8

An Vo

@an_vo12

3 months

🚨 Our latest work shows that SOTA VLMs (o3, o4-mini, Sonnet, Gemini Pro) fail at counting legs due to bias⁉️ See simple cases where VLMs get it wrong, no matter how you prompt them. 🧪 Think your VLM can do better? Try it yourself here: https://t.co/EDJdF3Vmpy 1/n #ICML2025

9

41

303

Anh Totti Nguyen

@anh_ng8

4 months

Pooyan @Pooyanrg presenting our Transformer Attention Bottleneck paper at @CVPR 💡 We **simplify** MHSA (e.g. 12 heads -> 1 head) to create an attention **bottleneck** where users can debug Vision Language Models by editing the bottleneck and observe expected VLM text outputs.

1

5

6

Anh Totti Nguyen

@anh_ng8

4 months

How do best AI image editors 🤖 GPT-4o, Gemini 2.0, SeedEdit, HF 🤗 fare ⚔️ human Photoshop wizards 🧙‍♀️ on text-based 🏞️ image editing? Logan @septisum and Brandon @brandon_co92810 shared some answers at our poster today! #CVPR2025 https://t.co/pwxmUnrxwY A few insights 👇

1

6

12

Anh Totti Nguyen

@anh_ng8

5 months

🧵 Vision Language Models are ⚠️ biased Q: Count the legs of this animal? 🤖: 4 ❌ Same problem: - w/ 5 best VLMs: GPT-4.1, o3, o4-mini, Gemini 2.5 Pro, Sonnet 3.7 - on 7 domains: animals, logos, flags, chess, boardgames, optical illusions code, paper https://t.co/S317wqrlju

10

15

81

Rohan Paul

@rohanpaul_ai

5 months

Large language models often exhibit biases in single interactions. Allowing LLMs to observe prior responses in multi-turn conversations helps them reduce bias, especially for random answers. A new metric, B-score, effectively detects biases across different question types.

0

4

25

Anh Totti Nguyen

@anh_ng8

5 months

Asking GPT-4o for a random choice is an *easy* way to reveal its bias 🙃 Choose a random digit? ➡️ 7 (70% of the time❗️) Biden vs. Trump? ➡️ Biden (100%❗️) Male vs. Female? ➡️ Female (84%❗️) Same story for many LLMs. Choice orders are randomized. 1/6 #icml2025

1

4

15

Jonathan Roberts

@JRobertsAI

5 months

📢📢More progress on ZeroBench! With the release of Claude 4 from @AnthropicAI the SOTA pass@1 is now 4% 🔥 Claude Sonnet 3.7: 1% Claude Sonnet 3.7 (Thinking): 3% Claude Sonnet 4: 2% Claude Sonnet 4 (Thinking): 3% Claude Opus 4: 1% Claude Opus 4 (Thinking): 4%

1

2

15

taesiri

@taesiri

5 months

HoT prompting could help Claude 4 pay more attention to questions and avoid simple mistakes. Learn more about it here:

highlightedchainofthought.github.io

A novel prompting approach for having LLMs highlight their own answers

Tin (Kevin) Nguyen

@tin_ng_qn

5 months

Claude 4 Sonnet: ❓"What weighs more: 20 pounds of bricks or 20 feathers?" ❌ Input Question: "They weigh the same." ✅ HoT prompt: "20 pounds of bricks weighs more." HoT tags key facts + reasons step by step → avoids shallow answers. Check HoT here: https://t.co/kly8gsQUpl

0

3