Simone Balloccu
@simoneballoccu
Followers
319
Following
2K
Media
63
Statuses
742
(he/him) ExpNLP lab leader @TUDarmstadt. Researching AI w.r.t human evaluation, behaviour change, safety and controllability, expert domains. Opnions my own.
Darmstadt
Joined July 2020
🚨Happy to share that our paper "Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs" has been accepted at #eacl2024.🚨 Huge thanks to my co-authors @PSchmidtova, @LangoMateusz and @tuetschek. Link: https://t.co/wD1HayuGPJ
3
20
82
This release has SO MUCH • New pretrain corpus, new midtrain data, 380B+ long context tokens • 7B & 32B, Base, Instruct, Think, RL Zero • Close to Qwen 3 performance, but fully open!!
Announcing Olmo 3, a leading fully open LM suite built for reasoning, chat, & tool use, and an open model flow—not just the final weights, but the entire training journey. Best fully open 32B reasoning model & best 32B base model. 🧵
22
42
411
I'm disturbing reports about chatbots encouraging children to kill themselves. such as https://t.co/PdgvTaYPHi . Shame that the AI Safety community in general, and the @AISecurityInst in particular, seem to have little interest in this, very disappointing...
bbc.co.uk
In her first UK interview Megan Garcia speaks to Laura Kuenssberg about the death of her teenage son.
1
1
2
Imagine losing first authorship because you got hit by a blue shell on the last lap 💀
LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)
87
4K
52K
Excited to share that our paper "When LLMs Can’t Help: Real-World Evaluation of LLMs in Nutrition" (with @simoneballoccu, @tuetschek and @EhudReiter) will be at #INLG2025 in Hanoi! Ours is the first (7-week) RCT testing if LLMs can help improve eating habits. 🥕
4
2
5
Life once you start supervising: 9-10: meeting 10-10.30 meeting 10.30-11 meeting 11-11:30 meeting ✨ coffee ✨ 13.30-14 meeting 14-15 meeting 15.30-16 meeting 16-17 meeting
0
0
1
New blog: Good diagrams for research papers Ive seen a number of diagrams recently which are too complicated and difficult to understand. I explain some of the problems I see and give advice. https://t.co/4Lp5UWU06g
ehudreiter.com
Ive seen a number of diagrams recently which are too complicated and difficult to understand. I explain some of the problems I see and give advice.
0
2
9
As we fall in love with yet another "superintelligent" AGI whatever, let's remind ourselves that text prediction on steroids still is text prediction on steroids
0
0
1
New blog: More on evaluating impact I got great feedback from recent paper and talk on eval impact, and summarise some of the suggested papers (including more examples of impact eval) and insightful comments (eg, about eval “ecosystem”) I received. https://t.co/zZxsVJBtfD
ehudreiter.com
I recently published a paper and gave a talk about evaluating real-world impact. I got some great feedback from this, and summarise some of the suggested papers (including more examples of impact e…
0
1
10
Motivated by recent discussion with my group: Ignore subjective statements such as "I find LLMs to be incredibly useful for XX", especially when made by people (such as AI companies or gurus) who have strong biases/incentives/COI .
0
1
2
You have a budget to human-evaluate 100 inputs to your models, but your dataset is 10,000 inputs. Do not just pick 100 randomly!🙅 We can do better. "How to Select Datapoints for Efficient Human Evaluation of NLG Models?" shows how.🕵️ (random is still a devilishly good baseline)
2
16
73
did people get greedy and sloppy and ruin it like with almost everything ever? you tell me
2
3
8
Writing a rebuttal is 30% technical and 70% reviewers' psychology.
13
10
310
Microsoft claims their new AI framework diagnoses 4x better than doctors. I'm a medical doctor and I actually read the paper. Here's my perspective on why this is both impressive AND misleading ... 🧵
275
1K
9K
I love this analysis of the limitations of the experimental setting/design. This is the kind of expert insight and methodological rigor we need when evaluating LLMs!
Microsoft claims their new AI framework diagnoses 4x better than doctors. I'm a medical doctor and I actually read the paper. Here's my perspective on why this is both impressive AND misleading ... 🧵
0
1
4
as a parent, i will never push a career path onto my kids. i would give them full freedom to decide which AI lab they want to join for $100 mil
72
611
11K
Remember my tweet from the other day? Well, this is not what I meant.
1
0
3
We just received some reviews for EMNLP and I'm filled with an immense amount of rage.
2
0
13