harold_matmul Profile Banner
Harold Benoit Profile
Harold Benoit

@harold_matmul

Followers
534
Following
2K
Media
26
Statuses
252

Another day of being a researcher in theory but an engineer in practice | tech staff @LiquidAI_

Joined April 2024
Don't wanna be here? Send us removal request.
@harold_matmul
Harold Benoit
15 days
The Monarch release from PyTorch is super neat, big fan of clean mesh abstractions. They're using it in their new RL framework, TorchForge.
0
2
8
@harold_matmul
Harold Benoit
28 days
Model conversion (like operator grafting or swapping the modeling objective here) is a key tool for architecture research nowadays, given the prohibitive cost of pretraining. Thrilled to see a lab focusing on fast feedback loops partially based on model conversions, excited to
@RadicalNumerics
Radical Numerics
28 days
Introducing RND1, the most powerful base diffusion language model (DLM) to date. RND1 (Radical Numerics Diffusion) is an experimental DLM with 30B params (3B active) with a sparse MoE architecture. We are making it open source, releasing weights, training details, and code to
0
0
8
@harold_matmul
Harold Benoit
1 month
It's a good model sir. Very proud of the team, we worked very hard to be on the Pareto frontier of quality and efficiency. Even had the chance to write a CPU-optimized kernel for MoE to squeeze everything from the hardware, and that gave us those sweet throughput results.
@LiquidAI_
Liquid AI
1 month
Meet LFM2-8B-A1B, our first on-device Mixture-of-Experts (MoE)! 🐘 > LFM2-8B-A1B is the best on-device MoE in terms of both quality and speed. > Performance of a 3B-4B model class, with up to 5x faster inference profile on CPUs and GPUs. > Quantized variants fit comfortably on
0
5
45
@harold_matmul
Harold Benoit
1 month
Source:
ita9naiwa.github.io
ita9naiwa
0
0
0
@harold_matmul
Harold Benoit
1 month
Found a short and clean walkthrough for how MMA on Tensor Cores works at the PTX level
1
0
4
@harold_matmul
Harold Benoit
1 month
Note that the argument is very similar to the findings by the 🐐 @ZeyuanAllenZhu which motivates its introduction of Canon layers (which are short conv1d), and proves that adding them dramatically improves GLA. (This is in Physics of LLM 4.1)
0
0
4
@harold_matmul
Harold Benoit
1 month
Full derivation:
0
0
0
@harold_matmul
Harold Benoit
1 month
Went through a lengthy derivation of MLA today to understand exactly how we could absorb the KV decompression matrices, as I didn't find much code on it. The trick is to write the proof on a per-head basis to easily derive the absorption. In the end, it does give an enlightening
1
0
2
@harold_matmul
Harold Benoit
1 month
Interesting take on the role of Short Conv. The argument uses the view that linear attention variants do a form of online learning (i.e.eTTT) over the KV pairs (k_1,v_1), ... (k_t,v_t), such the update to the state is S_t​=S_{tβˆ’1}β€‹βˆ’Ξ·_t​*βˆ‡_{Stβˆ’1​} ​L(f(S_{tβˆ’1}​;k_t​),v_t​),
@Jianlin_S
jianlin.su
1 month
Why does linear attention need Short Conv? https://t.co/luUybG3RXj
1
12
98
@WenhuChen
Wenhu Chen
1 month
With SoRA, Veo, SeedDance coming out, a critical issue in building video generation models is the lack of automatic and interpretable evaluation metrics or reward models. To power the development of video generation models, we built VideoScore2, the SoTA generative video metric
@DongfuJiang
Dongfu Jiang
1 month
πŸ”₯ It’s time to bring RL to generative video evaluation! Introducing VideoScore2 β€” a model that not only generates scores for generative videos but also produces detailed, high-quality reasoning traces. πŸš€ To build VideoScore2, we curated prompts from 5 sources, covering both
4
17
124
@harold_matmul
Harold Benoit
1 month
The Dreamer 4 paper is really nice to read. Really appreciate the ablations on how the combinations of learning objectives (shortcut, x-loss, etc.) & architectural tweaks (e.g. hybrid, register tokens) affect speed and quality.
0
0
3
@harold_matmul
Harold Benoit
1 month
A note on the diffusion learning objective I hadn't realized before and how tweaking it ensures rollout stability (this may be specific to the case where we use diffusion forcing?). In the Dreamer 4 paper, they predict the clean representation x1 instead of the velocity v = (x1
0
0
3
@harold_matmul
Harold Benoit
1 month
Look at the cool model we just released, it's super fast! :) One innovation is that it uses both cont. & disc. audio tokens to enable gen. w/o losing understanding capabilities.
@LiquidAI_
Liquid AI
1 month
Today, we expand our LFM2 family to audio. πŸ‘‚πŸ‘„ LFM2-Audio is an end-to-end audio-text omni foundation model, and delivers responsive, real-time conversation on-device at just 1.5B parameters. One model. Seamless multimodal support. No chains. > Speech-to-speech >
1
0
8
@harold_matmul
Harold Benoit
1 month
banger just dropped
0
0
3
@harold_matmul
Harold Benoit
1 month
small but mighty
@LiquidAI_
Liquid AI
1 month
Introducing Liquid Nanos βš›οΈ β€” a new family of extremely tiny task-specific models that deliver GPT-4o-class performance while running directly on phones, laptops, cars, embedded devices, and GPUs with the lowest latency and fastest generation speed. > model size: 350M to 2.6B >
0
3
10
@harold_matmul
Harold Benoit
2 months
The secret sauce most definitely is in the data, given that the architecure is fairly standard: Qwen3 backbone + NaViT SigLip2 (i.e. it uses packed vision sequences). They use patch_size=16 and pixel_shuffle_scale_factor=2 in order to use few image tokens. A 256x256 image will
@perceptroninc
Perceptron AI
2 months
1/ Introducing Isaac 0.1 β€” our first perceptive-language model. 2B params, open weights. Matches or beats models significantly larger on core perception. We are pushing the efficient frontier for physical AI. https://t.co/dJ1Wjh2ARK
2
1
18
@harold_matmul
Harold Benoit
2 months
S/o @samsja19
0
0
1
@harold_matmul
Harold Benoit
2 months
Switched the config system for the experiments to pydantic-settings, and I've never felt better. I have more energy. My skin is clearer. My eye sight has improved.
1
0
3
@harold_matmul
Harold Benoit
2 months
Many people would benefit from learning tensors contraction. Most concepts in ML architecture can be simplified and abstracted through this lens. Things with different names are just contracting, batching, sharding on different dimensions e.g. TP vs CP, BatchNorm vs LayerNorm,
1
0
7