Bethge Lab @bethgelab X Profile

Bethge Lab

@bethgelab

Followers

3K

Following

319

Media

61

Statuses

311

Perceiving Neural Networks

Tübingen, Germany

Joined July 2017

Don't wanna be here? Send us removal request.

Bethge Lab

@bethgelab

17 days

RT @adhiraj_ghosh98: Excited to be in Vienna for #ACL2025🇦🇹! You'll find @sbdzdz and I by our ONEBench poster, so do drop by!. 🗓️Wed, July….

0

4

0

Bethge Lab

@bethgelab

1 month

RT @ori_press: Do language models have algorithmic creativity?. To find out, we built AlgoTune, a benchmark challenging agents to optimize….

0

59

0

Bethge Lab

@bethgelab

1 month

🧠🤖 We’re hiring a Postdoc in NeuroAI!. Join CRC1233 "Robust Vision" (Uni Tübingen) to build benchmarks & evaluation methods for vision models, bridging brain & AI. Work with top faculty & shape vision research. Apply: #NeuroAI #Jobs.

0

2

8

Bethge Lab

@bethgelab

4 months

Recent work from our lab trying to ask questions on how to fairly evaluate and measure progress in language model reasoning!. Check out the full thread below!.

Vishaal Udandarao

@vishaal_urao

4 months

🚀New Paper!. Everyone’s celebrating rapid progress in math reasoning with RL/SFT. But how real is this progress?. We re-evaluated recently released popular reasoning models—and found reported gains often vanish under rigorous testing!! 👀. 🧵👇

0

2

16

Bethge Lab

@bethgelab

4 months

RT @lukas_thede: 🧠 Keeping LLMs factually up to date is a common motivation for knowledge editing. But what would it actually take to supp….

0

6

0

Bethge Lab

@bethgelab

5 months

RT @CgtyYldz: For our "Automated Assessment of Teaching Quality" project, we are looking for two PhD students: one in educational/cognitive….

0

7

0

Bethge Lab

@bethgelab

6 months

RT @shiven_sinha: AI can generate correct-seeming hypotheses (and papers!). Brandolini's law states BS is harder to refute than generate. C….

0

38

0

Bethge Lab

@bethgelab

6 months

Checkout this cool new work from Bethgelab & friends!. Falsifying flawed solutions is key to science—but LMs aren't there yet. Even advanced models produce counterexamples for <9% of mistakes, despite solving ~48% of problems. Full thread below:.

Shiven Sinha

@shiven_sinha

6 months

AI can generate correct-seeming hypotheses (and papers!). Brandolini's law states BS is harder to refute than generate. Can LMs falsify incorrect solutions? o3-mini (high) scores just 9% on our new benchmark REFUTE. Verification is not necessarily easier than generation 🧵

0

2

15

Bethge Lab

@bethgelab

6 months

Check out some cool data-centric analysis on reasoning datasets! More to come from our lab!.

Andreas Hochlehnert

@ahochlehnert

6 months

CuratedThoughts: Data Curation for RL Datasets 🚀. Since DeepSeek-R1 introduced reasoning-based RL, datasets like Open-R1 & OpenThoughts emerged for fine-tuning & GRPO. Our deep dive found major flaws — 25% of OpenThoughts needed elimination by data curation. Here's why 👇🧵.

0

1

6

Bethge Lab

@bethgelab

6 months

RT @ahochlehnert: CuratedThoughts: Data Curation for RL Datasets 🚀. Since DeepSeek-R1 introduced reasoning-based RL, datasets like Open-R1….

0

12

0

Bethge Lab

@bethgelab

6 months

RT @ShashwatGoel7: 🚨Great Models Think Alike and this Undermines AI Oversight🚨.New paper quantifies LM similarity.(1) LLM-as-a-judge favor….

0

29

0

Bethge Lab

@bethgelab

8 months

Check out the latest work from our lab on how to merge your multimodal models over time?. We find several exciting insights with implications for model merging, continual pretraining and distributed/federated training!. Full thread below:.

Vishaal Udandarao

@vishaal_urao

8 months

🚀New Paper. Model merging is the rage these days: simply fine-tune multiple task-specific models and merge them at the end. Guaranteed perf boost!. But wait, what if you get new tasks over time, sequentially? How to merge your models over time?. 🧵👇

0

12

Bethge Lab

@bethgelab

10 months

RT @marcel_binz: Excited to announce Centaur -- the first foundation model of human cognition. Centaur can predict and simulate human behav….

huggingface.co

0

246

0

Bethge Lab

@bethgelab

10 months

Great work by all our lab members and other collaborators!.@explainableml @GoogleDeepMind @CAML_Lab.

0

1

Bethge Lab

@bethgelab

10 months

Object segmentation from common fate: Motion energy processing enables human-like zero-shot generalization to random dot stimuli. We show striking differences b/w SoTA optical flow and human motion perception wrt generalization, and how to close this gap.

2

0

2

Bethge Lab

@bethgelab

10 months

A Practitioner's Guide to Real-World Continual Multimodal Pretraining. We provide practical insights into how to continually pretraining contrastive multimodal models under compute and data constraints. 🧵👇.

Vishaal Udandarao

@vishaal_urao

1 year

🚀New Paper: "A Practitioner's Guide to Continual Multimodal Pretraining"!. 🌐Foundation models like CLIP need constant updates to stay relevant. How to do this in the real-world?.Answer: Continual Pretraining!!. We studied how to effectively do this.🧵👇

1

0

2

Bethge Lab

@bethgelab

10 months

Efficient Lifelong Model Evaluation in an Era of Rapid Progress. TLDR: This work addresses the challenge of spiraling evaluation cost through an efficient evaluation framework called Sort & Search (S&S), reducing evaluation costs by 99x. 🧵👇.

Vishaal Udandarao

@vishaal_urao

1 year

🚀New Paper Alert!.Ever faced the challenge of ML models overfitting to benchmarks? Have computational difficulties evaluating models on large benchmarks? We introduce Lifelong Benchmarks, a dynamic approach for model evaluation with 1000x efficiency! .

1

0

1

Bethge Lab

@bethgelab

10 months

CiteME: Can Language Models Accurately Cite Scientific Claims?. TLDR: CiteME is a benchmark designed to test the abilities of language models in finding papers that are cited in scientific texts. 🧵👇.

Ori Press

@ori_press

1 year

Can AI help you cite papers?.We built the CiteME benchmark to answer that. Given the text:."We evaluate our model on [CITATION], a dataset consisting of black and white handwritten digits".The answer is: MNIST. CiteME has 130 questions; our best agent gets just 35.3% acc (1/5)🧵

1

0

4

Bethge Lab

@bethgelab

10 months

(1) No "Zero-Shot" Without Exponential Data. TLDR: We show that multimodal models require exponentially more data on a concept to linearly improve their performance on tasks pertaining to that concept, highlighting extreme sample inefficiency. 🧵👇.

Vishaal Udandarao

@vishaal_urao

1 year

🚀New Preprint Alert! . 📊Exploring the notion of "Zero-Shot" Generalization in Foundation Models. Is it all just a myth? Our latest preprint dives deep. Check it out!🔍.

1

0

6

Bethge Lab

@bethgelab

10 months

Excited to announce 5 accepted papers from our lab at #neurips2024!!🥳🎉. Brief details of each below!. Excited for an insightful December in Vancouver!.

1

8

57