Benjamin Minixhofer @bminixhofer X Profile

Benjamin Minixhofer

@bminixhofer

Followers

1K

Following

2K

Media

74

Statuses

442

PhD Student @CambridgeLTL / Research Associate @allen_ai

https://t.co/oZZS0v4KET

Cambridge, England

Joined August 2015

Don't wanna be here? Send us removal request.

Tiancheng Hu

@tiancheng_hu

7 days

Instruction tuning unlocks incredible skills in LLMs, but at a cost: they become dangerously overconfident. You face a choice: a well-calibrated base model or a capable but unreliable instruct model. What if you didn't have to choose? What if you could navigate the trade-off?

3

4

14

Edoardo Ponti

@PontiEdoardo

9 days

@_AndrewZhao That’s where cross-tokenizer distillation comes in handy!

arxiv.org

Distillation has shown remarkable success in transferring knowledge from a Large Language Model (LLM) teacher to a student LLM. However, current distillation methods require similar tokenizers...

0

2

39

Edoardo Ponti

@PontiEdoardo

1 month

⚠️ Only 2 days remaining to apply for a postdoc at @EdinburghNLP! ⚠️

Edoardo Ponti

@PontiEdoardo

2 months

I am looking for a 2-year 𝗽𝗼𝘀𝘁𝗱𝗼𝗰 to work on efficient foundation models at @InfAtEd and @EPCCed! This is part of the @ARIA_research funding for Scaling Compute: AI at 1/1000th the cost

0

5

16

Valentin Hofmann

@vjhofmann

2 months

📢 New #COLM2025 paper 📢 Standard benchmarks give every LLM the same questions. This is like testing 5th graders and college seniors with *one* exam! 🥴 Meet Fluid Benchmarking, a capability-adaptive eval method delivering lower variance, higher validity, and reduced cost. 🧵

Ai2

@allen_ai

2 months

🚀 Introducing Fluid Benchmarking—an adaptive way to evaluate LLMs. Inspired by psychometrics, it tailors which questions to ask based on each model’s capability, making evals more efficient & reliable. 🧵

5

41

192

Edoardo Ponti

@PontiEdoardo

2 months

I am looking for a 2-year 𝗽𝗼𝘀𝘁𝗱𝗼𝗰 to work on efficient foundation models at @InfAtEd and @EPCCed! This is part of the @ARIA_research funding for Scaling Compute: AI at 1/1000th the cost

1

18

30

Edoardo Ponti

@PontiEdoardo

2 months

I've been awarded a Starting Grant from @ERC_Research! As part of AToM-FM ⚛️, I'll study efficient architectures for foundation models with end-to-end tokenisation and adaptive+permanent memory Building a greener, more democratic AI

European Research Council (ERC)

@ERC_Research

2 months

📣 The ERC Starting Grant call results are out! Find out which early-career researchers will receive funding, what they will be investigating, where they will be based... plus lots of other #ERCStG facts & figures for 2025! ➡️ https://t.co/cGctMhcJos 🇪🇺 #HorizonEurope

14

18

142

David Heineman

@heinemandavidj

3 months

Evaluating language models is tricky, how do we know if our results are real, or due to random chance? We find an answer with two simple metrics: signal, a benchmark’s ability to separate models, and noise, a benchmark’s random variability between training steps 🧵

Ai2

@allen_ai

3 months

📢 New paper from Ai2: Signal & Noise asks a simple question—can language model benchmarks detect a true difference in model performance? 🧵

4

53

236

Benjamin Minixhofer

@bminixhofer

3 months

This one is also pretty good

Horace He

@cHHillee

3 months

You're no match for OpenAI's marketing team.

1

0

11

CambridgeLTL

@CambridgeLTL

4 months

- Retrofitting Large Language Models with Dynamic Tokenization @licwu @bminixhofer - Culturally Aware and Adapted NLP: A Taxonomy and a Survey of the State of the Art @ChenLiu47008770 - DARE: Diverse Visual Question Answering with Robustness Evaluation Hannah Sterz, @licwu

1

sasuke⚡420

@sasuke___420

4 months

In a weird way, I think this work supports @bminixhofer's idea of tokenizer transfer for a single model by equalizing the log prob of spans of text from a corpus (rather than trying to equalize the lob prob of some huge tree of continuations)

Owain Evans

@OwainEvans_UK

4 months

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

0

1

2

Owain Evans

@OwainEvans_UK

4 months

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

289

1K

8K

Piotr Nawrot

@p_nawrot

5 months

We built sparse-frontier — a clean abstraction that lets you focus on your custom sparse attention implementation while automatically inheriting vLLM’s optimizations and model support. As a PhD student, I've learned that sometimes the bottleneck in research isn't ideas — it's

9

52

321

Benjamin Minixhofer

@bminixhofer

6 months

There are also many other things which didn’t have space in this thread. This includes using our method to ensemble heterogeneous models and improve zero-shot tokenizer transfer. Details on this are in the paper!

0

2

Benjamin Minixhofer

@bminixhofer

6 months

🤗Models: https://t.co/hXnMWMS6pl 🧑‍💻Code: https://t.co/W9TalIKfid 📜Paper:

arxiv.org

Distillation has shown remarkable success in transferring knowledge from a Large Language Model (LLM) teacher to a student LLM. However, current distillation methods require similar tokenizers...

1

4

Benjamin Minixhofer

@bminixhofer

6 months

This allows cheaply creating new byte-level models, instead of needing to expensively train from scratch (where many things can go wrong). It could finally allow us to get rid of subword tokenization. Our paper, models and easy-to-use code are public today.

1

0

3

Benjamin Minixhofer

@bminixhofer

6 months

The resulting byte-level models perform competitively on benchmarks even though we are using a very short (~300M subword token) training regime.

1

0

1

Benjamin Minixhofer

@bminixhofer

6 months

Distilling across tokenizers was previously only possible heuristically between very similar tokenizers. We introduce a more capable and principled method which minimizes the likelihood difference between aligned chunks of tokens of the teacher and the student.

1

0

2

Benjamin Minixhofer

@bminixhofer

6 months

BLT showed that it is possible to create a byte-level model by retrofitting an existing subword-based model. However, they simply train via next-byte prediction. This is (as we show) very destructive. We can do much better if we instead *distil* from subwords to bytes.

1

0

3

Benjamin Minixhofer

@bminixhofer

6 months

We achieved the first instance of successful subword-to-byte distillation in our (just updated) paper. This enables creating byte-level models at a fraction of the cost of what was needed previously. As a proof-of-concept, we created byte-level Gemma2 and Llama3 models. 🧵

1

15

69

Valentin Hofmann

@vjhofmann

6 months

Excited to see our study on linguistic generalization in LLMs featured by @UniofOxford News!

University of Oxford

@UniofOxford

6 months

NEW: Large language models (LLMs) – the AI systems behind chatbots like ChatGPT – generalise language patterns in a surprisingly human-like way: through analogy, rather than strict grammatical rules.

0

2

20