bethgelab Profile Banner
Bethge Lab Profile
Bethge Lab

@bethgelab

Followers
3K
Following
319
Media
61
Statuses
311

Perceiving Neural Networks

Tübingen, Germany
Joined July 2017
Don't wanna be here? Send us removal request.
@bethgelab
Bethge Lab
17 days
RT @adhiraj_ghosh98: Excited to be in Vienna for #ACL2025🇦🇹! You'll find @sbdzdz and I by our ONEBench poster, so do drop by!. 🗓️Wed, July….
0
4
0
@bethgelab
Bethge Lab
1 month
RT @ori_press: Do language models have algorithmic creativity?. To find out, we built AlgoTune, a benchmark challenging agents to optimize….
0
59
0
@bethgelab
Bethge Lab
1 month
🧠🤖 We’re hiring a Postdoc in NeuroAI!. Join CRC1233 "Robust Vision" (Uni Tübingen) to build benchmarks & evaluation methods for vision models, bridging brain & AI. Work with top faculty & shape vision research. Apply:  #NeuroAI #Jobs.
0
2
8
@bethgelab
Bethge Lab
4 months
Recent work from our lab trying to ask questions on how to fairly evaluate and measure progress in language model reasoning!. Check out the full thread below!.
@vishaal_urao
Vishaal Udandarao
4 months
🚀New Paper!. Everyone’s celebrating rapid progress in math reasoning with RL/SFT. But how real is this progress?. We re-evaluated recently released popular reasoning models—and found reported gains often vanish under rigorous testing!! 👀. 🧵👇
Tweet media one
0
2
16
@bethgelab
Bethge Lab
4 months
RT @lukas_thede: 🧠 Keeping LLMs factually up to date is a common motivation for knowledge editing. But what would it actually take to supp….
0
6
0
@bethgelab
Bethge Lab
5 months
RT @CgtyYldz: For our "Automated Assessment of Teaching Quality" project, we are looking for two PhD students: one in educational/cognitive….
0
7
0
@bethgelab
Bethge Lab
6 months
RT @shiven_sinha: AI can generate correct-seeming hypotheses (and papers!). Brandolini's law states BS is harder to refute than generate. C….
0
38
0
@bethgelab
Bethge Lab
6 months
Checkout this cool new work from Bethgelab & friends!. Falsifying flawed solutions is key to science—but LMs aren't there yet. Even advanced models produce counterexamples for <9% of mistakes, despite solving ~48% of problems. Full thread below:.
@shiven_sinha
Shiven Sinha
6 months
AI can generate correct-seeming hypotheses (and papers!). Brandolini's law states BS is harder to refute than generate. Can LMs falsify incorrect solutions? o3-mini (high) scores just 9% on our new benchmark REFUTE. Verification is not necessarily easier than generation 🧵
Tweet media one
0
2
15
@bethgelab
Bethge Lab
6 months
Check out some cool data-centric analysis on reasoning datasets! More to come from our lab!.
@ahochlehnert
Andreas Hochlehnert
6 months
CuratedThoughts: Data Curation for RL Datasets 🚀. Since DeepSeek-R1 introduced reasoning-based RL, datasets like Open-R1 & OpenThoughts emerged for fine-tuning & GRPO. Our deep dive found major flaws — 25% of OpenThoughts needed elimination by data curation. Here's why 👇🧵.
0
1
6
@bethgelab
Bethge Lab
6 months
RT @ahochlehnert: CuratedThoughts: Data Curation for RL Datasets 🚀. Since DeepSeek-R1 introduced reasoning-based RL, datasets like Open-R1….
0
12
0
@bethgelab
Bethge Lab
6 months
RT @ShashwatGoel7: 🚨Great Models Think Alike and this Undermines AI Oversight🚨.New paper quantifies LM similarity.(1) LLM-as-a-judge favor….
0
29
0
@bethgelab
Bethge Lab
8 months
Check out the latest work from our lab on how to merge your multimodal models over time?. We find several exciting insights with implications for model merging, continual pretraining and distributed/federated training!. Full thread below:.
@vishaal_urao
Vishaal Udandarao
8 months
🚀New Paper. Model merging is the rage these days: simply fine-tune multiple task-specific models and merge them at the end. Guaranteed perf boost!. But wait, what if you get new tasks over time, sequentially? How to merge your models over time?. 🧵👇
Tweet media one
0
0
12
@bethgelab
Bethge Lab
10 months
RT @marcel_binz: Excited to announce Centaur -- the first foundation model of human cognition. Centaur can predict and simulate human behav….
Tweet card summary image
huggingface.co
0
246
0
@bethgelab
Bethge Lab
10 months
Great work by all our lab members and other collaborators!.@explainableml @GoogleDeepMind @CAML_Lab.
0
0
1
@bethgelab
Bethge Lab
10 months
Object segmentation from common fate: Motion energy processing enables human-like zero-shot generalization to random dot stimuli. We show striking differences b/w SoTA optical flow and human motion perception wrt generalization, and how to close this gap.
2
0
2
@bethgelab
Bethge Lab
10 months
A Practitioner's Guide to Real-World Continual Multimodal Pretraining. We provide practical insights into how to continually pretraining contrastive multimodal models under compute and data constraints. 🧵👇.
@vishaal_urao
Vishaal Udandarao
1 year
🚀New Paper: "A Practitioner's Guide to Continual Multimodal Pretraining"!. 🌐Foundation models like CLIP need constant updates to stay relevant. How to do this in the real-world?.Answer: Continual Pretraining!!. We studied how to effectively do this.🧵👇
Tweet media one
1
0
2
@bethgelab
Bethge Lab
10 months
Efficient Lifelong Model Evaluation in an Era of Rapid Progress. TLDR: This work addresses the challenge of spiraling evaluation cost through an efficient evaluation framework called Sort & Search (S&S), reducing evaluation costs by 99x. 🧵👇.
@vishaal_urao
Vishaal Udandarao
1 year
🚀New Paper Alert!.Ever faced the challenge of ML models overfitting to benchmarks? Have computational difficulties evaluating models on large benchmarks? We introduce Lifelong Benchmarks, a dynamic approach for model evaluation with 1000x efficiency! .
Tweet media one
1
0
1
@bethgelab
Bethge Lab
10 months
CiteME: Can Language Models Accurately Cite Scientific Claims?. TLDR: CiteME is a benchmark designed to test the abilities of language models in finding papers that are cited in scientific texts. 🧵👇.
@ori_press
Ori Press
1 year
Can AI help you cite papers?.We built the CiteME benchmark to answer that. Given the text:."We evaluate our model on [CITATION], a dataset consisting of black and white handwritten digits".The answer is: MNIST. CiteME has 130 questions; our best agent gets just 35.3% acc (1/5)🧵
Tweet media one
1
0
4
@bethgelab
Bethge Lab
10 months
(1) No "Zero-Shot" Without Exponential Data. TLDR: We show that multimodal models require exponentially more data on a concept to linearly improve their performance on tasks pertaining to that concept, highlighting extreme sample inefficiency. 🧵👇.
@vishaal_urao
Vishaal Udandarao
1 year
🚀New Preprint Alert! . 📊Exploring the notion of "Zero-Shot" Generalization in Foundation Models. Is it all just a myth? Our latest preprint dives deep. Check it out!🔍.
Tweet media one
1
0
6
@bethgelab
Bethge Lab
10 months
Excited to announce 5 accepted papers from our lab at #neurips2024!!🥳🎉. Brief details of each below!. Excited for an insightful December in Vancouver!.
1
8
57