Pieter Delobelle @pieterdelobelle X Profile

Pieter Delobelle

@pieterdelobelle

Followers

502

Following

528

Media

42

Statuses

142

Postdoctoral AI researcher on LLM pretraining, tokenization & safety - Prev: @apple, @aleph__alpha, @milaNLProc, PhD & postdoc @KU_Leuven

https://t.co/2X9ZXpRy8Q

Berlin

Joined April 2012

Don't wanna be here? Send us removal request.

Pieter Delobelle

@pieterdelobelle

3 months

Over the last weeks I worked on synthetic datasets, so I made a small LLM scheduler to process large batches reliably, called LLMQ. It's a simple CLI tool that submits jobs (from jsonl or a HF dataset) to RabbitMQ, where multiple workers can take jobs from their queue.

1

10

Pieter Delobelle

@pieterdelobelle

6 days

More info (+blogpost) soon! preprint:

0

Pieter Delobelle

@pieterdelobelle

6 days

🥳 Our paper "ProbLog4Fairness: A Neurosymbolic Approach to Modeling and Mitigating Bias" is accepted at #AAAI2026! ProbLog4Fairness lets you write down how you think bias enters your dataset using probabilistic logic, then automatically corrects for it during neural network

1

2

LAGoM NLP

@lagom_nlp

23 days

When is a language hard to model? Previous research has suggested that morphological complexity both does and does not play a role, but it does so by relating the performance of language models to corpus statistics of words or subword tokens in isolation.

1

4

0

Pieter Delobelle

@pieterdelobelle

14 days

We start with the MLP layer, as that provides the biggest VRAM wins and also parallelize the attention heads for a good speedup. Full blog post: https://t.co/Tfgwi3bMiM

0

1

Pieter Delobelle

@pieterdelobelle

14 days

I wrote a new blogpost on implementing tensor parallelism for my "llm inference from scratch" series. Now our inference engine can finally serve models that don't fit into one GPU.

Pieter Delobelle

@pieterdelobelle

28 days

I built an LLM inference engine from scratch to learn what goes into serving models efficiently. Starting from @karpathy's nanoGPT with a simple generate() function, I added KV caching, fused sampling (from Flashinfer), CUDA graphs, etc... Let me share some insights. (1/4 🧵)

1

0

6

Alexander Doria

@Dorialexander

15 days

Breaking: we release a fully synthetic generalist dataset for pretraining, SYNTH and two new SOTA reasoning models exclusively trained on it. Despite having seen only 200 billion tokens, Baguettotron is currently best-in-class in its size range.

80

151

1K

Pieter Delobelle

@pieterdelobelle

28 days

The gap between training models and serving them efficiently is often underestimated. Most researchers don't realize what goes into production inference. Code and detailed analysis available here: ⚙️ https://t.co/FZEU86AyxF 🧑‍💻 https://t.co/WhYZXHUsid More optimizations

pieter.ai

LLM inference from scratch

0

Pieter Delobelle

@pieterdelobelle

28 days

But KV caching alone isn't enough. Sampling was taking 1.77ms per step, since we were sorting the entire vocabulary. Fused rejection sampling kernels eliminated this bottleneck. I also captured the CUDA graphs for prefill and decode phases, which helps a lot for smaller batches.

1

0

Pieter Delobelle

@pieterdelobelle

28 days

The most important performance gain is KV caching. Before: recompute attention for every token (compute-bound). After: cache keys/values (memory-bound). This change enables batch parallelism and is foundational for serving LLMs profitably.

1

0

Pieter Delobelle

@pieterdelobelle

28 days

I built an LLM inference engine from scratch to learn what goes into serving models efficiently. Starting from @karpathy's nanoGPT with a simple generate() function, I added KV caching, fused sampling (from Flashinfer), CUDA graphs, etc... Let me share some insights. (1/4 🧵)

2

4

16

Pieter Delobelle

@pieterdelobelle

2 months

This is gonna be a fun course to teach 🔥

1

0

5

Pieter Delobelle

@pieterdelobelle

2 months

I also added support for YAML to configure the pipelines, so processing an entire HF dataset is as easy as: $ llmq submit -p example-pipeline.yaml epfl-llm/guidelines Here, each sample will get translated and formatted into markdown. https://t.co/YqsLr3fRzQ

github.com

A Scheduler for Batched LLM Inference. Contribute to iPieter/llmq development by creating an account on GitHub.

0

Pieter Delobelle

@pieterdelobelle

2 months

Just merged pipeline support into LLMQ, my distributed LLM inference scheduler. You can now define multi-stage workflows where each stage can use different models/workers. Results go through queues with independent scaling per stage.

1

0

2

Pieter Delobelle

@pieterdelobelle

3 months

Link to the globe view:

deadlines.pieter.ai

Find CS conference deadlines for 2025. Interactive world map showing Computer Science conference submission dates.

0

2

Pieter Delobelle

@pieterdelobelle

3 months

Finally got around to updating my AI conference deadline tracker. I added some new deadlines and a globe view

2

5

19

Pablo Iyu Guerrero

@pabloiyu

3 months

First high-performance inference for hierarchical byte models. @LukasBluebaum and I developed batched inference for tokenizer-free HAT (Hierarchical Autoregressive Transformers) models, developed by @Aleph__Alpha Research. In some settings, we outcompete the baseline Llama.🧵

2

7

28

Pieter Delobelle

@pieterdelobelle

3 months

Serving an LLM efficiently (=profitably) is highly non-trivial and involves a lot of different design choices. Mixture of experts, as used by Deepseek, complicates this a lot. I really learned to appreciate this from @tugot17 while I was at @Aleph__Alpha, so check out this deep

Piotr Mazurek @ NeurIPS 🇺🇸

@tugot17

3 months

What are the profit margins of serving DeepSeek 🐳? @schreiberic and I discuss large-scale MoE inference in depth. Blog post link below

0

1

9

Pieter Delobelle

@pieterdelobelle

3 months

@thomas_wint Thanks to EuroEval for the evaluations. Their dataset is here:

0

2

Pieter Delobelle

@pieterdelobelle

3 months

5 year old BERT-style models are still winning for Dutch. I was looking at the EuroEval benchmarks and to my surprise are the models we trained in 2019 (RobBERT w/ @thomas_wint ) still SOTA. It takes 70x larger generative models (24B+) to match our 355M parameter encoder model.

2

1

13

Pieter Delobelle

@pieterdelobelle

3 months

I also release some synthetic datasets I made with LLMQ by translating fineweb to Dutch and German. And with a permissive license (ODC-by). 🇩🇪 500k rows translated with @Unbabel's Tower+ 72B: https://t.co/rbpIv3aDME 🇳🇱 1.5M rows translated with Tower+ 9B

huggingface.co

Pieter Delobelle

@pieterdelobelle

3 months

Over the last weeks I worked on synthetic datasets, so I made a small LLM scheduler to process large batches reliably, called LLMQ. It's a simple CLI tool that submits jobs (from jsonl or a HF dataset) to RabbitMQ, where multiple workers can take jobs from their queue.

0

7