taesiri Profile Banner
taesiri Profile
taesiri

@taesiri

Followers
889
Following
62K
Media
144
Statuses
581

Research Scientist @ EA Sports, VLMs, Evals, All opinions are my own.

Planet Mars
Joined February 2017
Don't wanna be here? Send us removal request.
@taesiri
taesiri
16 days
OpenAI’s Sora 2 is, without a doubt, the most impressive video generation model right now. As always, I had to test it with some of my own evals. Sora 2 still struggles to fill a container with gas or smoke without clipping through the container’s boundaries. (Veo 3 also have
0
0
9
@taesiri
taesiri
28 days
Excited to share that our paper VideoGameQA-Bench 🎮 has been accepted to the #NeurIPS2025 Datasets & Benchmarks Track! 🎉 We introduce a benchmark for video game quality assurance comprising 9 tasks, including visual unit testing, visual regression, needle-in-a-haystack,
1
0
16
@anh_ng8
Anh Totti Nguyen
28 days
@yuyinzhou_cs @NeurIPSConf We have a paper in the same situation. AC: Yes! PC: No no. @NeurIPSConf please consider the whether 1st author is a student and whether this would be their first top-tier paper BEFORE making such a cut. More healthy for junior researchers. OR use a Findings track.
0
2
10
@anh_ng8
Anh Totti Nguyen
1 month
Preserving details was a major challenge across SOTA AI editors (SeedEdit, GPT-4o, Gemini 2.0, and HuggingFace models) in our recent study. AIs either accidentally alter or degrade the background/identity, or they enhance aesthetics (when not requested). https://t.co/pwxmUnrxwY
psrdataset.github.io
Project page for the paper: Understanding Generative AI Capabilities in Everyday Image Editing Tasks
@taesiri
taesiri
1 month
With all the attention on the new 🍌 Nano Banana, we (with @brandon_co92810) stress-tested it with our "shirt-color chain" benchmark: 26 sequential edits where each output feeds into the next. Learn more our study on image editing models here: https://t.co/2i4heiNQdN
1
2
4
@taesiri
taesiri
1 month
With all the attention on the new 🍌 Nano Banana, we (with @brandon_co92810) stress-tested it with our "shirt-color chain" benchmark: 26 sequential edits where each output feeds into the next. Learn more our study on image editing models here: https://t.co/2i4heiNQdN
3
0
18
@mehran__jalali
Mehran Jalali
2 months
Really great eval LLMs are already great at next token prediction in the training corpus, so the test for truth-seeking is now whether they can answer correctly when the correct next token is at odds with the next token in the most similar example in the training corpus.
7
16
101
@yoavgo
(((ل()(ل() 'yoav))))👾
2 months
beautiful adversarial dataset playing exactly on the soft-spot of VLMs.
@an_vo12
An Vo
3 months
🚨 Our latest work shows that SOTA VLMs (o3, o4-mini, Sonnet, Gemini Pro) fail at counting legs due to bias⁉️ See simple cases where VLMs get it wrong, no matter how you prompt them. 🧪 Think your VLM can do better? Try it yourself here: https://t.co/EDJdF3Vmpy 1/n #ICML2025
5
20
279
@giffmana
Lucas Beyer (bl16)
2 months
Oh wow, this VLM benchmark is pure evil, and I love it! "Vision Language Models are Biased" by @an_vo12, @taesiri, @anh_ng8, etal. Also really good idea to have one-click copy-paste of images and prompts, makes trying it super easy.
32
75
942
@mrnuu
Mathew
2 months
@anh_ng8 @taesiri @an_vo12 GPT5 Pro, the most advanced version, reasoned for 1 minute 19 seconds and came up with the wrong answer. Same with GPT5 Thinking. Amazing level of incompetence from the smartest model yet.
0
3
7
@anh_ng8
Anh Totti Nguyen
2 months
#GPT5 is STILL having a severe confirmation bias like prev SOTA models! 😜 Try yourselves (images, prompts avail in 1 click): https://t.co/S317wqrlju It's fast to test for such biases in images. Similar biases should still exist in non-image domains as well...
11
14
121
@Cohere_Labs
Cohere Labs
3 months
Supported by one of our grants, @an_vo12, Mohammad Reza Taesiri, and @anh_ng8 from @kaist_ai, tackled bias in LLMs. Their research shows that LLMs exhibit fewer biases when they can see their previous answers, leading to the development of the B-score metric.
2
3
8
@an_vo12
An Vo
3 months
🚨 Our latest work shows that SOTA VLMs (o3, o4-mini, Sonnet, Gemini Pro) fail at counting legs due to bias⁉️ See simple cases where VLMs get it wrong, no matter how you prompt them. 🧪 Think your VLM can do better? Try it yourself here: https://t.co/EDJdF3Vmpy 1/n #ICML2025
9
41
303
@anh_ng8
Anh Totti Nguyen
4 months
Pooyan @Pooyanrg presenting our Transformer Attention Bottleneck paper at @CVPR 💡 We **simplify** MHSA (e.g. 12 heads -> 1 head) to create an attention **bottleneck** where users can debug Vision Language Models by editing the bottleneck and observe expected VLM text outputs.
1
5
6
@anh_ng8
Anh Totti Nguyen
4 months
How do best AI image editors 🤖 GPT-4o, Gemini 2.0, SeedEdit, HF 🤗 fare ⚔️ human Photoshop wizards 🧙‍♀️ on text-based 🏞️ image editing? Logan @septisum and Brandon @brandon_co92810 shared some answers at our poster today! #CVPR2025 https://t.co/pwxmUnrxwY A few insights 👇
1
6
12
@anh_ng8
Anh Totti Nguyen
5 months
🧵 Vision Language Models are ⚠️ biased Q: Count the legs of this animal? 🤖: 4 ❌ Same problem: - w/ 5 best VLMs: GPT-4.1, o3, o4-mini, Gemini 2.5 Pro, Sonnet 3.7 - on 7 domains: animals, logos, flags, chess, boardgames, optical illusions code, paper https://t.co/S317wqrlju
10
15
81
@rohanpaul_ai
Rohan Paul
5 months
Large language models often exhibit biases in single interactions. Allowing LLMs to observe prior responses in multi-turn conversations helps them reduce bias, especially for random answers. A new metric, B-score, effectively detects biases across different question types.
0
4
25
@anh_ng8
Anh Totti Nguyen
5 months
Asking GPT-4o for a random choice is an *easy* way to reveal its bias 🙃 Choose a random digit? ➡️ 7 (70% of the time❗️) Biden vs. Trump? ➡️ Biden (100%❗️) Male vs. Female? ➡️ Female (84%❗️) Same story for many LLMs. Choice orders are randomized. 1/6 #icml2025
1
4
15
@JRobertsAI
Jonathan Roberts
5 months
📢📢More progress on ZeroBench! With the release of Claude 4 from @AnthropicAI the SOTA pass@1 is now 4% 🔥 Claude Sonnet 3.7: 1% Claude Sonnet 3.7 (Thinking): 3% Claude Sonnet 4: 2% Claude Sonnet 4 (Thinking): 3% Claude Opus 4: 1% Claude Opus 4 (Thinking): 4%
1
2
15
@taesiri
taesiri
5 months
HoT prompting could help Claude 4 pay more attention to questions and avoid simple mistakes. Learn more about it here:
highlightedchainofthought.github.io
A novel prompting approach for having LLMs highlight their own answers
@tin_ng_qn
Tin (Kevin) Nguyen
5 months
Claude 4 Sonnet: ❓"What weighs more: 20 pounds of bricks or 20 feathers?" ❌ Input Question: "They weigh the same." ✅ HoT prompt: "20 pounds of bricks weighs more." HoT tags key facts + reasons step by step → avoids shallow answers. Check HoT here: https://t.co/kly8gsQUpl
0
0
3