Pablo Iyu Guerrero @pabloiyu X Profile

Pablo Iyu Guerrero

@pabloiyu

Followers

32

Following

158

Media

4

Statuses

11

AI Inference Engineer @ AlephAlpha

Joined September 2022

Don't wanna be here? Send us removal request.

Pablo Iyu Guerrero

@pabloiyu

1 day

Amazing read, check it out

Piotr Mazurek @ NeurIPS 🇺🇸

@tugot17

1 day

RL is cool, but what do you actually need to know about hardware and infra to predict its future? Check out our new piece on tensoreconomics:

0

1

Pablo Iyu Guerrero

@pabloiyu

2 months

Enjoyed the read, check it out!

Piotr Mazurek @ NeurIPS 🇺🇸

@tugot17

2 months

Ever wondered why embedding models are offered so cheaply? It is because you can process literally billions of tokens a day even on a consumer-grade GPU like the 4090. Check out our new text investigating the economics of embedding model inference. Link in the next tweet

0

1

Pablo Iyu Guerrero

@pabloiyu

2 months

Very cool read!

SzymonOzog (NeurIPS)

@SzymonOzog_

2 months

“What I cannot create I do not understand” - This is why I started Penny, my own version of NCCL. Today I'm releasing a first part of a worklog of creating it. It explains GPU communication and shows progress on coding a fast AllReduce(Inter&Intranode) algorithm using NVSHMEM🧵

0

Pablo Iyu Guerrero

@pabloiyu

3 months

Want to learn more? - On September 9th, we will be presenting at @CologneAIML: https://t.co/SMkh8wF05r - The code for batched inference can be found here: https://t.co/p5ypE7znoM - For a further deep-dive, an accompanying research paper will soon be published

github.com

A high-throughput and memory-efficient inference and serving engine for LLMs - Aleph-Alpha/vllm

0

2

Pablo Iyu Guerrero

@pabloiyu

3 months

HNet @_albertgu, BLT @ArtidoroPagnoni, and other byte-hierarchical models face similar batched inference hurdles; an area we’re thrilled to contribute to.

1

0

2

Pablo Iyu Guerrero

@pabloiyu

3 months

The integration into @vllm_project revealed several fundamental architectural challenges unique to hierarchical models that required careful adaptation of vLLM's components.

1

0

1

Pablo Iyu Guerrero

@pabloiyu

3 months

This variability means that sequences within the same batch may require different computational patterns at any given step, some continuing byte-level generation while others are ready for word-level processing.

1

0

1

Pablo Iyu Guerrero

@pabloiyu

3 months

Unlike traditional autoregressive models that generate one token per step, HAT generates a variable number of bytes before reaching word boundaries, creating synchronisation complexities for batches.

1

0

1

Pablo Iyu Guerrero

@pabloiyu

3 months

HAT operates on two abstraction layers: a standard Llama-style word-level transformer as the backbone, along with 2 lightweight byte-level modules: an encoder and a decoder.

1

0

2

Pablo Iyu Guerrero

@pabloiyu

3 months

First high-performance inference for hierarchical byte models. @LukasBluebaum and I developed batched inference for tokenizer-free HAT (Hierarchical Autoregressive Transformers) models, developed by @Aleph__Alpha Research. In some settings, we outcompete the baseline Llama.🧵

2

7

28

Pablo Iyu Guerrero

@pabloiyu

3 months

Highly recommend this deep-dive by @tugot17 and @schreiberic

Piotr Mazurek @ NeurIPS 🇺🇸

@tugot17

3 months

What are the profit margins of serving DeepSeek 🐳? @schreiberic and I discuss large-scale MoE inference in depth. Blog post link below

0

2