arlo_son Profile
arlo_son

@gson_AI

Followers
186
Following
2K
Media
24
Statuses
267

Undergraduate @ Yonsei. UIC Economics.

Joined February 2023
Don't wanna be here? Send us removal request.
@gson_AI
arlo_son
2 months
#NLProc.AI Co-Scientists 🤖 can generate ideas, but can they spot mistakes? (not yet! 🚫). In my recent paper, we introduce SPOT, a dataset of STEM manuscripts (math, materials science, chemistry, physics, etc), annotated with real errors. SOTA models like o3, gemini-2.5-pro
Tweet media one
4
38
162
@gson_AI
arlo_son
2 months
For more details,.paper: project: I'm planning follow-up works, so let me know if you are interested! 🔥.
1
1
3
@gson_AI
arlo_son
2 months
Imagine you’re collaborating with an AI co-scientist: you ask it to proofread your manuscript and flag any errors. Which LLM would you choose? 🤔. We evaluated the new Claude 4 models on SPOT. It looks like o3 is still the best model for this.
Tweet media one
2
5
8
@gson_AI
arlo_son
2 months
RT @BlancheMinerva: People are really eager to use AIs "to accelerate science" (whatever that means). Designing meaningful tests tailored t….
0
9
0
@gson_AI
arlo_son
2 months
Last but not least, I’d like to thank all coauthors for their help 👍👍👍. @jiwoohong98 @Void13950782 @hazel_heejeong @cartinoe__5930 @sngwonlim @jinyeop_song @GoncaloSPaulo @YoungjaeYu3 @stella.
0
0
8
@gson_AI
arlo_son
2 months
🔥 SPOT drives home a crucial point – verification must catch up with generation if AI co-scientists are to earn our trust. It’s time to build smarter error detectors before we rely on AI in labs 🛠️. Check out the paper for more details!.
1
0
6
@gson_AI
arlo_son
2 months
Our deep-dive case studies show typical AI blind spots: models struggle with long-tail knowledge absent from web data and extremely long contexts; without fully spelled-out derivations, they misinterpret calculations and ignore domain-specific conventions, leading to student-like
Tweet media one
2
1
13
@gson_AI
arlo_son
2 months
Models even struggle to be consistent. They rarely rediscover the same error across 8 trials and assign near-zero confidence, making automated review unreliable 🤷
Tweet media one
1
1
6
@gson_AI
arlo_son
2 months
We benchmarked 10 top models from closed to open source. Results are sobering – recall under 21 percent and precision under 6 percent highlight a massive shortfall 📉
Tweet media one
1
1
6
@gson_AI
arlo_son
2 months
To make sure these errors are genuine, we only include those that have been acknowledged by the original authors themselves!
Tweet media one
1
0
5
@gson_AI
arlo_son
2 months
1️⃣ SPOT comprises 83 research papers and 91 author-validated errors across 10 STEM fields and 6 error types – equation/proof, figure duplication, Data inconsistency, Statistical reporting, Reagent identity, and Experiment setup. 🧮. 2️⃣ The dataset spans long documents (avg 12,000
Tweet media one
1
0
8
@gson_AI
arlo_son
2 months
I'll be presenting KMMLU, the-most used korean benchmark by Korean big techs at the moment with @seungonekim today, at 2pm!.
@AiEleuther
EleutherAI
2 months
If you're at #NAACL25, don't miss @gson_AI presenting our paper on localizing MMLU to Korea! Session C: Oral/Poster 2: 2pm-3.30pm.
0
2
11
@gson_AI
arlo_son
3 months
RT @seungonekim: @naaclmeeting I'll also be presenting our KMMLU paper with @gson_AI! It is one of the most widely adopted benchmarks used….
0
1
0
@gson_AI
arlo_son
5 months
RT @TrelisResearch: + GRPO is Poor and for the GPU-Rich +.-------------------------------. *A specific GRPO vs SFT video will be out next w….
0
58
0
@gson_AI
arlo_son
6 months
RT @lifan__yuan: lessons learned: (1) *capable* (small) base models are good enough to start rl, where (2) reasoning patterns *tailored to….
0
14
0
@gson_AI
arlo_son
6 months
RT @TheTuringPost: 10 Free Comprehensive Datasets for Supervised Fine-Tuning:. ▪️ Awesome ChatGPT Prompts.▪️ FineWeb from @huggingface.▪️ F….
0
30
0
@gson_AI
arlo_son
7 months
RT @seungonekim: #NLProc .Just because GPT-4o is 17 times more expensive than GPT-4o-mini, does that mean it generates synthetic data 17 ti….
0
52
0
@gson_AI
arlo_son
7 months
Link to Dataset: 📷
1
0
1
@gson_AI
arlo_son
7 months
Are you fascinated by O1, QwQ, and Deepseek-R1? Why not try training your own⁉️ I’m sharing QwQ-LongCoT-130K, an SFT-style dataset for training O1-like language models in Apache 2.0 License. 🔥 Feel free to use and let me know what you think about it!
Tweet media one
1
3
6