arlo_son Profile
arlo_son

@gson_AI

Followers
207
Following
3K
Media
24
Statuses
268

Undergraduate @ Yonsei. UIC Economics.

Joined February 2023
Don't wanna be here? Send us removal request.
@gson_AI
arlo_son
6 months
#NLProc AI Co-Scientists 🤖 can generate ideas, but can they spot mistakes? (not yet! 🚫) In my recent paper, we introduce SPOT, a dataset of STEM manuscripts (math, materials science, chemistry, physics, etc), annotated with real errors. SOTA models like o3, gemini-2.5-pro
4
38
162
@AiEleuther
EleutherAI
2 months
We are announcing an opportunity for paid question writers to contribute to a new PhD-level math benchmark. Accepted contributors will be paid per question and will be invited to be authors on the resulting dataset paper. Check out the link below for more information!
1
5
24
@gson_AI
arlo_son
6 months
Imagine you’re collaborating with an AI co-scientist: you ask it to proofread your manuscript and flag any errors. Which LLM would you choose? 🤔 We evaluated the new Claude 4 models on SPOT. It looks like o3 is still the best model for this.
2
5
8
@BlancheMinerva
Stella Biderman ✈️ NeurIPS 2025
6 months
People are really eager to use AIs "to accelerate science" (whatever that means). Designing meaningful tests tailored to proposed use-cases is a lot of work, but it's work I'm quite excited about. Bottom line: Current models aren't usable at identifying major flaws in papers.
@gson_AI
arlo_son
6 months
#NLProc AI Co-Scientists 🤖 can generate ideas, but can they spot mistakes? (not yet! 🚫) In my recent paper, we introduce SPOT, a dataset of STEM manuscripts (math, materials science, chemistry, physics, etc), annotated with real errors. SOTA models like o3, gemini-2.5-pro
1
8
76
@gson_AI
arlo_son
6 months
Last but not least, I’d like to thank all coauthors for their help 👍👍👍 @jiwoohong98 @Void13950782 @hazel_heejeong @cartinoe__5930 @sngwonlim @jinyeop_song @GoncaloSPaulo @YoungjaeYu3 @stella
0
0
8
@gson_AI
arlo_son
6 months
🔥 SPOT drives home a crucial point – verification must catch up with generation if AI co-scientists are to earn our trust. It’s time to build smarter error detectors before we rely on AI in labs 🛠️ Check out the paper for more details! https://t.co/ve2eMs0wRd
Tweet card summary image
arxiv.org
Recent advances in large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists. To date, prior work casts these systems as generative...
1
0
6
@gson_AI
arlo_son
6 months
Our deep-dive case studies show typical AI blind spots: models struggle with long-tail knowledge absent from web data and extremely long contexts; without fully spelled-out derivations, they misinterpret calculations and ignore domain-specific conventions, leading to student-like
2
1
14
@gson_AI
arlo_son
6 months
Models even struggle to be consistent. They rarely rediscover the same error across 8 trials and assign near-zero confidence, making automated review unreliable 🤷
1
1
6
@gson_AI
arlo_son
6 months
We benchmarked 10 top models from closed to open source. Results are sobering – recall under 21 percent and precision under 6 percent highlight a massive shortfall 📉
1
1
6
@gson_AI
arlo_son
6 months
To make sure these errors are genuine, we only include those that have been acknowledged by the original authors themselves!
1
0
5
@gson_AI
arlo_son
6 months
1️⃣ SPOT comprises 83 research papers and 91 author-validated errors across 10 STEM fields and 6 error types – equation/proof, figure duplication, Data inconsistency, Statistical reporting, Reagent identity, and Experiment setup. 🧮 2️⃣ The dataset spans long documents (avg 12,000
1
0
8
@gson_AI
arlo_son
7 months
I'll be presenting KMMLU, the-most used korean benchmark by Korean big techs at the moment with @seungonekim today, at 2pm!
@AiEleuther
EleutherAI
7 months
If you're at #NAACL25, don't miss @gson_AI presenting our paper on localizing MMLU to Korea! Session C: Oral/Poster 2: 2pm-3.30pm
0
2
11
@seungonekim
Seungone Kim
7 months
@naaclmeeting I'll also be presenting our KMMLU paper with @gson_AI! It is one of the most widely adopted benchmarks used by companies such as @official_naver @LG_AI_Research @kakaocorpglobal that develop Korean LLMs. 📅 Session C: Wednesday April 30th, 14:00-15:30 https://t.co/BMJYsS9fbP
@gson_AI
arlo_son
2 years
🌟 KMMLU 🌟This benchmark replicates the methodology that produced MMLU, but using examinations common in Korea. We manually annotate a subset of the questions as to whether they require Korea-specific knowledge and also designate a KMMLU-Hard subset that current models find
1
1
5
@TrelisResearch
Trelis Research
10 months
+ GRPO is Poor and for the GPU-Rich + ------------------------------- *A specific GRPO vs SFT video will be out next week, but I'm putting initial results here* I trained Llama 3.2 1B on GSM8K with: 1. SFT 2. ORPO 3. GRPO For SFT and ORPO, I generated training data using Llama
@TrelisResearch
Trelis Research
10 months
++ Reinforcement Learning for LLMs in 2025 ++ === How to elicit improved reasoning from models? - Is reasoning innately in pre-training datasets and just needs the right examples to be brought out? - Why does GPRO make sense, as opposed to Supervised Fine-tuning with the right
18
58
410
@lifan__yuan
Lifan Yuan
10 months
lessons learned: (1) *capable* (small) base models are good enough to start rl, where (2) reasoning patterns *tailored to each task* just emerge, e.g. self-verification for countdown and decomposition for multiplication. will keep working on demystifying long cot, stay tuned🫡
@jiayi_pirate
Jiayi Pan
10 months
We reproduced DeepSeek R1-Zero in the CountDown game, and it just works Through RL, the 3B base LM develops self-verification and search abilities all on its own You can experience the Ahah moment yourself for < $30 Code: https://t.co/B2IsN1PrXV Here's what we learned 🧵
6
14
139
@TheTuringPost
Ksenia_TuringPost
11 months
10 Free Comprehensive Datasets for Supervised Fine-Tuning: ▪️ Awesome ChatGPT Prompts ▪️ FineWeb from @huggingface ▪️ FineWeb 2 ▪️ OpenO1-SFT ▪️ Cleaned Alpaca Dataset ▪️ LMSYS-Chat-1M ▪️ Dolma from @allen_ai Math datasets: ▪️ FineMath ▪️ QwQ-LongCoT-130K ▪️ GSM8K Save the
3
29
116
@seungonekim
Seungone Kim
1 year
#NLProc Just because GPT-4o is 17 times more expensive than GPT-4o-mini, does that mean it generates synthetic data 17 times better? Introducing the AgoraBench, a benchmark for evaluating data generation capabilities of LMs.
3
54
194
@gson_AI
arlo_son
1 year
Link to Dataset: 📷
huggingface.co
1
0
1
@gson_AI
arlo_son
1 year
Are you fascinated by O1, QwQ, and Deepseek-R1? Why not try training your own⁉️ I’m sharing QwQ-LongCoT-130K, an SFT-style dataset for training O1-like language models in Apache 2.0 License. 🔥 Feel free to use and let me know what you think about it!
1
3
6