Benjamin Minixhofer
@bminixhofer
Followers
1K
Following
2K
Media
74
Statuses
442
PhD Student @CambridgeLTL / Research Associate @allen_ai
Cambridge, England
Joined August 2015
Instruction tuning unlocks incredible skills in LLMs, but at a cost: they become dangerously overconfident. You face a choice: a well-calibrated base model or a capable but unreliable instruct model. What if you didn't have to choose? What if you could navigate the trade-off?
3
4
14
@_AndrewZhao That’s where cross-tokenizer distillation comes in handy!
arxiv.org
Distillation has shown remarkable success in transferring knowledge from a Large Language Model (LLM) teacher to a student LLM. However, current distillation methods require similar tokenizers...
0
2
39
⚠️ Only 2 days remaining to apply for a postdoc at @EdinburghNLP! ⚠️
I am looking for a 2-year 𝗽𝗼𝘀𝘁𝗱𝗼𝗰 to work on efficient foundation models at @InfAtEd and @EPCCed! This is part of the @ARIA_research funding for Scaling Compute: AI at 1/1000th the cost
0
5
16
📢 New #COLM2025 paper 📢 Standard benchmarks give every LLM the same questions. This is like testing 5th graders and college seniors with *one* exam! 🥴 Meet Fluid Benchmarking, a capability-adaptive eval method delivering lower variance, higher validity, and reduced cost. 🧵
🚀 Introducing Fluid Benchmarking—an adaptive way to evaluate LLMs. Inspired by psychometrics, it tailors which questions to ask based on each model’s capability, making evals more efficient & reliable. 🧵
5
41
192
I am looking for a 2-year 𝗽𝗼𝘀𝘁𝗱𝗼𝗰 to work on efficient foundation models at @InfAtEd and @EPCCed! This is part of the @ARIA_research funding for Scaling Compute: AI at 1/1000th the cost
1
18
30
I've been awarded a Starting Grant from @ERC_Research! As part of AToM-FM ⚛️, I'll study efficient architectures for foundation models with end-to-end tokenisation and adaptive+permanent memory Building a greener, more democratic AI
📣 The ERC Starting Grant call results are out! Find out which early-career researchers will receive funding, what they will be investigating, where they will be based... plus lots of other #ERCStG facts & figures for 2025! ➡️ https://t.co/cGctMhcJos 🇪🇺 #HorizonEurope
14
18
142
Evaluating language models is tricky, how do we know if our results are real, or due to random chance? We find an answer with two simple metrics: signal, a benchmark’s ability to separate models, and noise, a benchmark’s random variability between training steps 🧵
📢 New paper from Ai2: Signal & Noise asks a simple question—can language model benchmarks detect a true difference in model performance? 🧵
4
53
236
This one is also pretty good
1
0
11
- Retrofitting Large Language Models with Dynamic Tokenization @licwu @bminixhofer - Culturally Aware and Adapted NLP: A Taxonomy and a Survey of the State of the Art @ChenLiu47008770 - DARE: Diverse Visual Question Answering with Robustness Evaluation Hannah Sterz, @licwu
1
1
1
In a weird way, I think this work supports @bminixhofer's idea of tokenizer transfer for a single model by equalizing the log prob of spans of text from a corpus (rather than trying to equalize the lob prob of some huge tree of continuations)
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
0
1
2
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
289
1K
8K
We built sparse-frontier — a clean abstraction that lets you focus on your custom sparse attention implementation while automatically inheriting vLLM’s optimizations and model support. As a PhD student, I've learned that sometimes the bottleneck in research isn't ideas — it's
9
52
321
There are also many other things which didn’t have space in this thread. This includes using our method to ensemble heterogeneous models and improve zero-shot tokenizer transfer. Details on this are in the paper!
0
0
2
This allows cheaply creating new byte-level models, instead of needing to expensively train from scratch (where many things can go wrong). It could finally allow us to get rid of subword tokenization. Our paper, models and easy-to-use code are public today.
1
0
3
The resulting byte-level models perform competitively on benchmarks even though we are using a very short (~300M subword token) training regime.
1
0
1
Distilling across tokenizers was previously only possible heuristically between very similar tokenizers. We introduce a more capable and principled method which minimizes the likelihood difference between aligned chunks of tokens of the teacher and the student.
1
0
2
BLT showed that it is possible to create a byte-level model by retrofitting an existing subword-based model. However, they simply train via next-byte prediction. This is (as we show) very destructive. We can do much better if we instead *distil* from subwords to bytes.
1
0
3
We achieved the first instance of successful subword-to-byte distillation in our (just updated) paper. This enables creating byte-level models at a fraction of the cost of what was needed previously. As a proof-of-concept, we created byte-level Gemma2 and Llama3 models. 🧵
1
15
69
Excited to see our study on linguistic generalization in LLMs featured by @UniofOxford News!
NEW: Large language models (LLMs) – the AI systems behind chatbots like ChatGPT – generalise language patterns in a surprisingly human-like way: through analogy, rather than strict grammatical rules.
0
2
20