arlo_son @gson_AI X Profile

arlo_son

@gson_AI

Followers

186

Following

2K

Media

24

Statuses

267

Undergraduate @ Yonsei. UIC Economics.

Joined February 2023

Don't wanna be here? Send us removal request.

arlo_son

@gson_AI

2 months

#NLProc.AI Co-Scientists 🤖 can generate ideas, but can they spot mistakes? (not yet! 🚫). In my recent paper, we introduce SPOT, a dataset of STEM manuscripts (math, materials science, chemistry, physics, etc), annotated with real errors. SOTA models like o3, gemini-2.5-pro

4

38

162

arlo_son

@gson_AI

2 months

For more details,.paper: project: I'm planning follow-up works, so let me know if you are interested! 🔥.

1

3

arlo_son

@gson_AI

2 months

Imagine you’re collaborating with an AI co-scientist: you ask it to proofread your manuscript and flag any errors. Which LLM would you choose? 🤔. We evaluated the new Claude 4 models on SPOT. It looks like o3 is still the best model for this.

2

5

8

arlo_son

@gson_AI

2 months

RT @BlancheMinerva: People are really eager to use AIs "to accelerate science" (whatever that means). Designing meaningful tests tailored t….

0

9

0

arlo_son

@gson_AI

2 months

Last but not least, I’d like to thank all coauthors for their help 👍👍👍. @jiwoohong98 @Void13950782 @hazel_heejeong @cartinoe__5930 @sngwonlim @jinyeop_song @GoncaloSPaulo @YoungjaeYu3 @stella.

0

8

arlo_son

@gson_AI

2 months

🔥 SPOT drives home a crucial point – verification must catch up with generation if AI co-scientists are to earn our trust. It’s time to build smarter error detectors before we rely on AI in labs 🛠️. Check out the paper for more details!.

1

0

6

arlo_son

@gson_AI

2 months

Our deep-dive case studies show typical AI blind spots: models struggle with long-tail knowledge absent from web data and extremely long contexts; without fully spelled-out derivations, they misinterpret calculations and ignore domain-specific conventions, leading to student-like

2

1

14

arlo_son

@gson_AI

2 months

Models even struggle to be consistent. They rarely rediscover the same error across 8 trials and assign near-zero confidence, making automated review unreliable 🤷

1

6

arlo_son

@gson_AI

2 months

We benchmarked 10 top models from closed to open source. Results are sobering – recall under 21 percent and precision under 6 percent highlight a massive shortfall 📉

1

6

arlo_son

@gson_AI

2 months

To make sure these errors are genuine, we only include those that have been acknowledged by the original authors themselves!

1

0

5

arlo_son

@gson_AI

2 months

1️⃣ SPOT comprises 83 research papers and 91 author-validated errors across 10 STEM fields and 6 error types – equation/proof, figure duplication, Data inconsistency, Statistical reporting, Reagent identity, and Experiment setup. 🧮. 2️⃣ The dataset spans long documents (avg 12,000

1

0

8

arlo_son

@gson_AI

2 months

I'll be presenting KMMLU, the-most used korean benchmark by Korean big techs at the moment with @seungonekim today, at 2pm!.

EleutherAI

@AiEleuther

2 months

If you're at #NAACL25, don't miss @gson_AI presenting our paper on localizing MMLU to Korea! Session C: Oral/Poster 2: 2pm-3.30pm.

0

2

11

arlo_son

@gson_AI

3 months

RT @seungonekim: @naaclmeeting I'll also be presenting our KMMLU paper with @gson_AI! It is one of the most widely adopted benchmarks used….

0

1

0

arlo_son

@gson_AI

5 months

RT @TrelisResearch: + GRPO is Poor and for the GPU-Rich +.-------------------------------. *A specific GRPO vs SFT video will be out next w….

0

58

0

arlo_son

@gson_AI

6 months

RT @lifan__yuan: lessons learned: (1) *capable* (small) base models are good enough to start rl, where (2) reasoning patterns *tailored to….

0

14

0

arlo_son

@gson_AI

6 months

RT @TheTuringPost: 10 Free Comprehensive Datasets for Supervised Fine-Tuning:. ��️ Awesome ChatGPT Prompts.▪️ FineWeb from @huggingface.▪️ F….

0

30

0

arlo_son

@gson_AI

7 months

RT @seungonekim: #NLProc .Just because GPT-4o is 17 times more expensive than GPT-4o-mini, does that mean it generates synthetic data 17 ti….

0

52

0

arlo_son

@gson_AI

7 months

Link to Dataset: 📷

1

0

1

arlo_son

@gson_AI

7 months

Are you fascinated by O1, QwQ, and Deepseek-R1? Why not try training your own⁉️ I’m sharing QwQ-LongCoT-130K, an SFT-style dataset for training O1-like language models in Apache 2.0 License. 🔥 Feel free to use and let me know what you think about it!

1

3

6