arlo_son
@gson_AI
Followers
207
Following
3K
Media
24
Statuses
268
Undergraduate @ Yonsei. UIC Economics.
Joined February 2023
#NLProc AI Co-Scientists 🤖 can generate ideas, but can they spot mistakes? (not yet! 🚫) In my recent paper, we introduce SPOT, a dataset of STEM manuscripts (math, materials science, chemistry, physics, etc), annotated with real errors. SOTA models like o3, gemini-2.5-pro
4
38
162
We are announcing an opportunity for paid question writers to contribute to a new PhD-level math benchmark. Accepted contributors will be paid per question and will be invited to be authors on the resulting dataset paper. Check out the link below for more information!
1
5
24
For more details, paper: https://t.co/ve2eMrZZ1F project: https://t.co/AdYUJ5wPkg I'm planning follow-up works, so let me know if you are interested! 🔥
arxiv.org
Recent advances in large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists. To date, prior work casts these systems as generative...
1
1
3
Imagine you’re collaborating with an AI co-scientist: you ask it to proofread your manuscript and flag any errors. Which LLM would you choose? 🤔 We evaluated the new Claude 4 models on SPOT. It looks like o3 is still the best model for this.
2
5
8
People are really eager to use AIs "to accelerate science" (whatever that means). Designing meaningful tests tailored to proposed use-cases is a lot of work, but it's work I'm quite excited about. Bottom line: Current models aren't usable at identifying major flaws in papers.
#NLProc AI Co-Scientists 🤖 can generate ideas, but can they spot mistakes? (not yet! 🚫) In my recent paper, we introduce SPOT, a dataset of STEM manuscripts (math, materials science, chemistry, physics, etc), annotated with real errors. SOTA models like o3, gemini-2.5-pro
1
8
76
Last but not least, I’d like to thank all coauthors for their help 👍👍👍 @jiwoohong98 @Void13950782 @hazel_heejeong @cartinoe__5930 @sngwonlim @jinyeop_song @GoncaloSPaulo @YoungjaeYu3 @stella
0
0
8
🔥 SPOT drives home a crucial point – verification must catch up with generation if AI co-scientists are to earn our trust. It’s time to build smarter error detectors before we rely on AI in labs 🛠️ Check out the paper for more details! https://t.co/ve2eMs0wRd
arxiv.org
Recent advances in large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists. To date, prior work casts these systems as generative...
1
0
6
Our deep-dive case studies show typical AI blind spots: models struggle with long-tail knowledge absent from web data and extremely long contexts; without fully spelled-out derivations, they misinterpret calculations and ignore domain-specific conventions, leading to student-like
2
1
14
Models even struggle to be consistent. They rarely rediscover the same error across 8 trials and assign near-zero confidence, making automated review unreliable 🤷
1
1
6
We benchmarked 10 top models from closed to open source. Results are sobering – recall under 21 percent and precision under 6 percent highlight a massive shortfall 📉
1
1
6
To make sure these errors are genuine, we only include those that have been acknowledged by the original authors themselves!
1
0
5
1️⃣ SPOT comprises 83 research papers and 91 author-validated errors across 10 STEM fields and 6 error types – equation/proof, figure duplication, Data inconsistency, Statistical reporting, Reagent identity, and Experiment setup. 🧮 2️⃣ The dataset spans long documents (avg 12,000
1
0
8
I'll be presenting KMMLU, the-most used korean benchmark by Korean big techs at the moment with @seungonekim today, at 2pm!
0
2
11
@naaclmeeting I'll also be presenting our KMMLU paper with @gson_AI! It is one of the most widely adopted benchmarks used by companies such as @official_naver @LG_AI_Research @kakaocorpglobal that develop Korean LLMs. 📅 Session C: Wednesday April 30th, 14:00-15:30 https://t.co/BMJYsS9fbP
🌟 KMMLU 🌟This benchmark replicates the methodology that produced MMLU, but using examinations common in Korea. We manually annotate a subset of the questions as to whether they require Korea-specific knowledge and also designate a KMMLU-Hard subset that current models find
1
1
5
+ GRPO is Poor and for the GPU-Rich + ------------------------------- *A specific GRPO vs SFT video will be out next week, but I'm putting initial results here* I trained Llama 3.2 1B on GSM8K with: 1. SFT 2. ORPO 3. GRPO For SFT and ORPO, I generated training data using Llama
++ Reinforcement Learning for LLMs in 2025 ++ === How to elicit improved reasoning from models? - Is reasoning innately in pre-training datasets and just needs the right examples to be brought out? - Why does GPRO make sense, as opposed to Supervised Fine-tuning with the right
18
58
410
lessons learned: (1) *capable* (small) base models are good enough to start rl, where (2) reasoning patterns *tailored to each task* just emerge, e.g. self-verification for countdown and decomposition for multiplication. will keep working on demystifying long cot, stay tuned🫡
We reproduced DeepSeek R1-Zero in the CountDown game, and it just works Through RL, the 3B base LM develops self-verification and search abilities all on its own You can experience the Ahah moment yourself for < $30 Code: https://t.co/B2IsN1PrXV Here's what we learned 🧵
6
14
139
10 Free Comprehensive Datasets for Supervised Fine-Tuning: ▪️ Awesome ChatGPT Prompts ▪️ FineWeb from @huggingface ▪️ FineWeb 2 ▪️ OpenO1-SFT ▪️ Cleaned Alpaca Dataset ▪️ LMSYS-Chat-1M ▪️ Dolma from @allen_ai Math datasets: ▪️ FineMath ▪️ QwQ-LongCoT-130K ▪️ GSM8K Save the
3
29
116
#NLProc Just because GPT-4o is 17 times more expensive than GPT-4o-mini, does that mean it generates synthetic data 17 times better? Introducing the AgoraBench, a benchmark for evaluating data generation capabilities of LMs.
3
54
194
Are you fascinated by O1, QwQ, and Deepseek-R1? Why not try training your own⁉️ I’m sharing QwQ-LongCoT-130K, an SFT-style dataset for training O1-like language models in Apache 2.0 License. 🔥 Feel free to use and let me know what you think about it!
1
3
6