Eric @Ex0byt X Profile

Eric

@Ex0byt

Followers

11K

Following

812

Media

145

Statuses

789

Strategic Technologist. Doing my little part to democratize foundational, truly open AI for all. Knowledge is a passion. Lead by example. (Opinions are my own.)

https://t.co/A7aeOjzXP8

N.Y New York

Joined March 2009

Don't wanna be here? Send us removal request.

Eric

@Ex0byt

4 days

as predicted.. there it is. Good Stuff @Prince_Canuma

Prince Canuma

@Prince_Canuma

4 days

Here is a draft PR, there is still lots to improve and change. I will get to it later today. If you have a better idea or solution, benchmark it, send us an issue and PR. Enjoy! https://t.co/ctXtw0Syc7

1

0

9

Eric

@Ex0byt

5 days

It's the small acts of kindness that count. Thank you HuggingFace team 🤗. received some credits out of the blue to help offset out-of-pocket job running, research hosting, and open model storage costs. Not a sponsorship, genuinely appreciated all the same. What an amazing

1

0

14

Eric

@Ex0byt

6 days

was heads down this weekend solving for much of the currently noted "disappointments" with the "Flash-MoE" hype, but on an NVIDIA GB10 — zero-copy GPU reads direct from mmap'd NVMe page cache (eliminate CPU trips, 1.94×), and trained/tested a pre-attention expert prediction model

3

2

41

Eric

@Ex0byt

7 days

Inspired by all the community support — as a thank-you and sweetener: the next 100 confirmed donations will get access to https://t.co/GCgYkpSqqN ($250.00 value, and you support a good cause)— our most powerful handcrafted PRISM model to date, with over-refusals, bias, and

huggingface.co

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

0xSero

@0xSero

7 days

In 72 hours I got over 100k of value 1. Lambda gave me 5000$ credits in compute 2. Nvidia offered me 8x H100s on the cloud (20$/h) idk for how long but assuming 2 weeks that'd be 5000$~ 3. TNG technology offered me 2 weeks of B200s which is something like 12000$ in compute

7

3

45

Eric

@Ex0byt

7 days

Quick Update: currently testing on scitrera/dgx-spark-pytorch-runtime (DGX aarch64 compatibility headaches)— experimenting with an io_uring expert loader for Kimi-K2.5 to acompany the MoE selector. Python's GIL serializes threaded pread, so io_uring should bypass this entirely:

Eric

@Ex0byt

12 days

Exciting Experiment Update: We ran StepFun_ai's Step-3.5-Flash (197B MoE) on 6.29 GB of GPU memory! Flat. Zero growth. Same footprint at token 1 as at token x100. The model's weights are ~105 GB INT4 (394GB original bf16!). We're running it on 6.29 GB!! — 1/16th the weight

4

3

36

Eric

@Ex0byt

7 days

Progress thrives in the open. You had us all worried for a bit — thank you MiniMax_AI!

Skyler Miao

@SkylerMiao7

7 days

M2.7 open weights coming in ~2 weeks. still actively iterating just updated a new version on yesterday — noticeably better on OpenClaw.

0

33

Eric

@Ex0byt

7 days

elonmusk, nvidia MichaelDell — one unit. We'll make it count for everyone. In just a few weeks with scrappy hardware we've shown 1T+ parameter intelligence can run on everyday consumer devices. Imagine what a Single Dell Pro Max GB300 could unlock for the open source

0xSero

@0xSero

10 days

@Ex0byt @sudoingX will be among those with open access. This will be pooled with my 3090s. I am e-begging, but I will make it up to you.

3

7

77

Eric

@Ex0byt

8 days

Qwen3.5 27B is awesome (the entire family above 9B is impressive). You can now try it directly in your browser at SOTA speeds with whatever GPU you have: https://t.co/avWxUd8vNL My previous research in practice - The `Intel/Qwen3.5-27B-int4-AutoRound` is particularly good.

huggingface.co

This web app lets you type messages (and optionally add images) and have an AI respond in real time. First pick a model from the list, then enter your prompt and the assistant will generate a reply...

0xSero

@0xSero

8 days

A 27B model is #2 on pinch-bench You’d need 150,000$ in GPU hours to train this from scratch (base + post training) Basically 1-2 weeks over 256 H100s That is not unreasonable, you’d need 540B tokens for pre-training and a bit more for post training. None of this is crazy

33

121

2K

Eric

@Ex0byt

8 days

priori:

0

1

9

Zixuan Li

@ZixuanLi_

9 days

Don't panic. GLM-5.1 will be open source.

274

427

8K

Eric

@Ex0byt

10 days

My handcrafted local AI homeboy J.A.R.V.I.S. is gassing us up… go off, little king!

2

4

43

Eric

@Ex0byt

10 days

Get Excited: @0xSero and I are close — B300 is currently training on a tiny (15M param) side-loaded neural network that helps select, load, and cache the correct MoE experts for Kimi-K.2.5 (1T Param MoE model running on 25GB of memory). Once experiments are done -will share

0xSero

@0xSero

10 days

@pierrelezan Yes, @Ex0byt is working on this.

9

22

240

Eric

@Ex0byt

11 days

Kimi-K2.5 (1T-parameter MoE) running coherently on 25GB of GPU memory (on a unified 128 GB machine)!

36

23

563

Eric

@Ex0byt

12 days

Okay, I've had it sitting around since the 13th. I think it's time to get this M5 Max 18-core CPU/40-core GPU, 128GB RAM, 4TB SSD baby monster out of the box and see what it can do?

13

1

109

Eric

@Ex0byt

12 days

Exciting Experiment Update: We ran StepFun_ai's Step-3.5-Flash (197B MoE) on 6.29 GB of GPU memory! Flat. Zero growth. Same footprint at token 1 as at token x100. The model's weights are ~105 GB INT4 (394GB original bf16!). We're running it on 6.29 GB!! — 1/16th the weight

37

517

Eric

@Ex0byt

13 days

Take my money @MichaelDell

10

1

48

Eric

@Ex0byt

13 days

Y'all know how maniacal I am about speed, efficiency, and OSS. Check this puppy out: ~900 tok/s. Will give it a try and share some thoughts.

Ant Open Source

@ant_oss

13 days

⚡️ 892 tokens/s — our 100B diffusion LLM, LLaDA2.1-flash, is now live on @ZenMuxAI! With Token Editing, LLaDA 2.1 goes from research breakthrough to production-ready speed. Diffusion models just got real. Try it via API or Chat 👇 https://t.co/8ObarWTPio #LLaDA #ZenMux #AI #dLLM

1

2

7

Eric

@Ex0byt

13 days

Everyone is working towards more efficient MoEs. An elegant and practical attention architecture/implementation from Kimi.

Kimi.ai

@Kimi_Moonshot

13 days

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with

0

4