Harold Benoit @harold_matmul X Profile

Harold Benoit

@harold_matmul

Followers

534

Following

2K

Media

26

Statuses

252

Another day of being a researcher in theory but an engineer in practice | tech staff @LiquidAI_

https://t.co/D8veIaWq40

Joined April 2024

Don't wanna be here? Send us removal request.

Harold Benoit

@harold_matmul

15 days

The Monarch release from PyTorch is super neat, big fan of clean mesh abstractions. They're using it in their new RL framework, TorchForge.

0

2

8

Harold Benoit

@harold_matmul

28 days

Model conversion (like operator grafting or swapping the modeling objective here) is a key tool for architecture research nowadays, given the prohibitive cost of pretraining. Thrilled to see a lab focusing on fast feedback loops partially based on model conversions, excited to

Radical Numerics

@RadicalNumerics

28 days

Introducing RND1, the most powerful base diffusion language model (DLM) to date. RND1 (Radical Numerics Diffusion) is an experimental DLM with 30B params (3B active) with a sparse MoE architecture. We are making it open source, releasing weights, training details, and code to

0

8

Harold Benoit

@harold_matmul

1 month

It's a good model sir. Very proud of the team, we worked very hard to be on the Pareto frontier of quality and efficiency. Even had the chance to write a CPU-optimized kernel for MoE to squeeze everything from the hardware, and that gave us those sweet throughput results.

Liquid AI

@LiquidAI_

1 month

Meet LFM2-8B-A1B, our first on-device Mixture-of-Experts (MoE)! 🐘 > LFM2-8B-A1B is the best on-device MoE in terms of both quality and speed. > Performance of a 3B-4B model class, with up to 5x faster inference profile on CPUs and GPUs. > Quantized variants fit comfortably on

0

5

45

Harold Benoit

@harold_matmul

1 month

Source:

ita9naiwa.github.io

ita9naiwa

0

Harold Benoit

@harold_matmul

1 month

Found a short and clean walkthrough for how MMA on Tensor Cores works at the PTX level

1

0

4

Harold Benoit

@harold_matmul

1 month

Note that the argument is very similar to the findings by the 🐐 @ZeyuanAllenZhu which motivates its introduction of Canon layers (which are short conv1d), and proves that adding them dramatically improves GLA. (This is in Physics of LLM 4.1)

0

4

Harold Benoit

@harold_matmul

1 month

Full derivation:

0

Harold Benoit

@harold_matmul

1 month

Went through a lengthy derivation of MLA today to understand exactly how we could absorb the KV decompression matrices, as I didn't find much code on it. The trick is to write the proof on a per-head basis to easily derive the absorption. In the end, it does give an enlightening

1

0

2

Harold Benoit

@harold_matmul

1 month

Interesting take on the role of Short Conv. The argument uses the view that linear attention variants do a form of online learning (i.e.eTTT) over the KV pairs (k_1,v_1), ... (k_t,v_t), such the update to the state is S_t=S_{t−1}−η_t*∇_{St−1} L(f(S_{t−1};k_t),v_t),

jianlin.su

@Jianlin_S

1 month

Why does linear attention need Short Conv? https://t.co/luUybG3RXj

1

12

98

Wenhu Chen

@WenhuChen

1 month

With SoRA, Veo, SeedDance coming out, a critical issue in building video generation models is the lack of automatic and interpretable evaluation metrics or reward models. To power the development of video generation models, we built VideoScore2, the SoTA generative video metric

Dongfu Jiang

@DongfuJiang

1 month

🔥 It’s time to bring RL to generative video evaluation! Introducing VideoScore2 — a model that not only generates scores for generative videos but also produces detailed, high-quality reasoning traces. 🚀 To build VideoScore2, we curated prompts from 5 sources, covering both

4

17

124

Harold Benoit

@harold_matmul

1 month

The Dreamer 4 paper is really nice to read. Really appreciate the ablations on how the combinations of learning objectives (shortcut, x-loss, etc.) & architectural tweaks (e.g. hybrid, register tokens) affect speed and quality.

0

3

Harold Benoit

@harold_matmul

1 month

A note on the diffusion learning objective I hadn't realized before and how tweaking it ensures rollout stability (this may be specific to the case where we use diffusion forcing?). In the Dreamer 4 paper, they predict the clean representation x1 instead of the velocity v = (x1

0

3

Harold Benoit

@harold_matmul

1 month

Look at the cool model we just released, it's super fast! :) One innovation is that it uses both cont. & disc. audio tokens to enable gen. w/o losing understanding capabilities.

Liquid AI

@LiquidAI_

1 month

Today, we expand our LFM2 family to audio. 👂👄 LFM2-Audio is an end-to-end audio-text omni foundation model, and delivers responsive, real-time conversation on-device at just 1.5B parameters. One model. Seamless multimodal support. No chains. > Speech-to-speech >

1

0

8

Harold Benoit

@harold_matmul

1 month

banger just dropped

0

3

Harold Benoit

@harold_matmul

1 month

small but mighty

Liquid AI

@LiquidAI_

1 month

Introducing Liquid Nanos ⚛️ — a new family of extremely tiny task-specific models that deliver GPT-4o-class performance while running directly on phones, laptops, cars, embedded devices, and GPUs with the lowest latency and fastest generation speed. > model size: 350M to 2.6B >

0

3

10

Harold Benoit

@harold_matmul

2 months

The secret sauce most definitely is in the data, given that the architecure is fairly standard: Qwen3 backbone + NaViT SigLip2 (i.e. it uses packed vision sequences). They use patch_size=16 and pixel_shuffle_scale_factor=2 in order to use few image tokens. A 256x256 image will

Perceptron AI

@perceptroninc

2 months

1/ Introducing Isaac 0.1 — our first perceptive-language model. 2B params, open weights. Matches or beats models significantly larger on core perception. We are pushing the efficient frontier for physical AI. https://t.co/dJ1Wjh2ARK

2

1

18

Harold Benoit

@harold_matmul

2 months

S/o @samsja19

0

1

Harold Benoit

@harold_matmul

2 months

Switched the config system for the experiments to pydantic-settings, and I've never felt better. I have more energy. My skin is clearer. My eye sight has improved.

1

0

3

Harold Benoit

@harold_matmul

2 months

Many people would benefit from learning tensors contraction. Most concepts in ML architecture can be simplified and abstracted through this lens. Things with different names are just contracting, batching, sharding on different dimensions e.g. TP vs CP, BatchNorm vs LayerNorm,

1

0

7