Pablo Iyu Guerrero
@pabloiyu
Followers
32
Following
158
Media
4
Statuses
11
AI Inference Engineer @ AlephAlpha
Joined September 2022
Enjoyed the read, check it out!
Ever wondered why embedding models are offered so cheaply? It is because you can process literally billions of tokens a day even on a consumer-grade GPU like the 4090. Check out our new text investigating the economics of embedding model inference. Link in the next tweet
0
0
1
Want to learn more? - On September 9th, we will be presenting at @CologneAIML: https://t.co/SMkh8wF05r - The code for batched inference can be found here: https://t.co/p5ypE7znoM - For a further deep-dive, an accompanying research paper will soon be published
github.com
A high-throughput and memory-efficient inference and serving engine for LLMs - Aleph-Alpha/vllm
0
0
2
HNet @_albertgu, BLT @ArtidoroPagnoni, and other byte-hierarchical models face similar batched inference hurdles; an area we’re thrilled to contribute to.
1
0
2
The integration into @vllm_project revealed several fundamental architectural challenges unique to hierarchical models that required careful adaptation of vLLM's components.
1
0
1
This variability means that sequences within the same batch may require different computational patterns at any given step, some continuing byte-level generation while others are ready for word-level processing.
1
0
1
Unlike traditional autoregressive models that generate one token per step, HAT generates a variable number of bytes before reaching word boundaries, creating synchronisation complexities for batches.
1
0
1
HAT operates on two abstraction layers: a standard Llama-style word-level transformer as the backbone, along with 2 lightweight byte-level modules: an encoder and a decoder.
1
0
2
First high-performance inference for hierarchical byte models. @LukasBluebaum and I developed batched inference for tokenizer-free HAT (Hierarchical Autoregressive Transformers) models, developed by @Aleph__Alpha Research. In some settings, we outcompete the baseline Llama.🧵
2
7
28
Highly recommend this deep-dive by @tugot17 and @schreiberic
What are the profit margins of serving DeepSeek 🐳? @schreiberic and I discuss large-scale MoE inference in depth. Blog post link below
0
0
2