Pablo Iyu Guerrero Profile
Pablo Iyu Guerrero

@pabloiyu

Followers
32
Following
158
Media
4
Statuses
11

AI Inference Engineer @ AlephAlpha

Joined September 2022
Don't wanna be here? Send us removal request.
@pabloiyu
Pablo Iyu Guerrero
1 day
Amazing read, check it out
@tugot17
Piotr Mazurek @ NeurIPS 🇺🇸
1 day
RL is cool, but what do you actually need to know about hardware and infra to predict its future? Check out our new piece on tensoreconomics:
0
0
1
@pabloiyu
Pablo Iyu Guerrero
2 months
Enjoyed the read, check it out!
@tugot17
Piotr Mazurek @ NeurIPS 🇺🇸
2 months
Ever wondered why embedding models are offered so cheaply? It is because you can process literally billions of tokens a day even on a consumer-grade GPU like the 4090. Check out our new text investigating the economics of embedding model inference. Link in the next tweet
0
0
1
@pabloiyu
Pablo Iyu Guerrero
2 months
Very cool read!
@SzymonOzog_
SzymonOzog (NeurIPS)
2 months
“What I cannot create I do not understand” - This is why I started Penny, my own version of NCCL. Today I'm releasing a first part of a worklog of creating it. It explains GPU communication and shows progress on coding a fast AllReduce(Inter&Intranode) algorithm using NVSHMEM🧵
0
0
0
@pabloiyu
Pablo Iyu Guerrero
3 months
Want to learn more? - On September 9th, we will be presenting at @CologneAIML: https://t.co/SMkh8wF05r - The code for batched inference can be found here: https://t.co/p5ypE7znoM - For a further deep-dive, an accompanying research paper will soon be published
Tweet card summary image
github.com
A high-throughput and memory-efficient inference and serving engine for LLMs - Aleph-Alpha/vllm
0
0
2
@pabloiyu
Pablo Iyu Guerrero
3 months
HNet @_albertgu, BLT @ArtidoroPagnoni, and other byte-hierarchical models face similar batched inference hurdles; an area we’re thrilled to contribute to.
1
0
2
@pabloiyu
Pablo Iyu Guerrero
3 months
The integration into @vllm_project revealed several fundamental architectural challenges unique to hierarchical models that required careful adaptation of vLLM's components.
1
0
1
@pabloiyu
Pablo Iyu Guerrero
3 months
This variability means that sequences within the same batch may require different computational patterns at any given step, some continuing byte-level generation while others are ready for word-level processing.
1
0
1
@pabloiyu
Pablo Iyu Guerrero
3 months
Unlike traditional autoregressive models that generate one token per step, HAT generates a variable number of bytes before reaching word boundaries, creating synchronisation complexities for batches.
1
0
1
@pabloiyu
Pablo Iyu Guerrero
3 months
HAT operates on two abstraction layers: a standard Llama-style word-level transformer as the backbone, along with 2 lightweight byte-level modules: an encoder and a decoder.
1
0
2
@pabloiyu
Pablo Iyu Guerrero
3 months
First high-performance inference for hierarchical byte models. @LukasBluebaum and I developed batched inference for tokenizer-free HAT (Hierarchical Autoregressive Transformers) models, developed by @Aleph__Alpha Research. In some settings, we outcompete the baseline Llama.🧵
2
7
28
@pabloiyu
Pablo Iyu Guerrero
3 months
Highly recommend this deep-dive by @tugot17 and @schreiberic
@tugot17
Piotr Mazurek @ NeurIPS 🇺🇸
3 months
What are the profit margins of serving DeepSeek 🐳? @schreiberic and I discuss large-scale MoE inference in depth. Blog post link below
0
0
2