miru_why Profile Banner
miru Profile
miru

@miru_why

Followers
2K
Following
3K
Media
232
Statuses
646

3e-4x engineer, unswizzled wagmi. specialization is for warps

Joined January 2024
Don't wanna be here? Send us removal request.
@miru_why
miru
2 years
know the difference
8
30
408
@miru_why
miru
2 months
Scalable GANs with Transformers https://t.co/tUfsKKsVlK https://t.co/N2KzfMc4aL authors train latent-space transformer GANs up to XL/2 scale, and report SotA 1-step class-conditional image generation results on ImageNet-256 after 40 epochs (*with REPA in discriminator)
3
34
215
@miru_why
miru
4 months
author thread
@s_calvoordonez
Sergio Calvo Ordoñez
4 months
We'd love our flow-based generative models to learn the optimal transport from noise to data... but they rarely do ❌. Mini-batch Optimal Transport methods aim to fix this — but they're costly and require large batch sizes to work well... Can we approximate this behaviour
0
0
3
@miru_why
miru
4 months
Weighted Conditional Flow Matching the authors improve flow-matching training by downweighting poorly-matched noise/image pairs, and show that this cheap reweighting produces clean, straight flow paths (like what you'd get from optimal transport) https://t.co/ykj6k1fh4A
1
1
25
@miru_why
miru
5 months
anyone know a paper where they do RLVR strictly on next token prediction task? like ilya's detective novel analogy, doing reasoning rollouts and rewarding the reasoning chains that correctly deduce the criminal's name?
2
0
10
@miru_why
miru
5 months
english translation of huawei whistleblower's pangu writeup, 2/2. all translation thanks to gemini, with minor edits and [annotations] from discussion, hopefully we did it justice
1
0
23
@miru_why
miru
5 months
english translation of huawei whistleblower's pangu writeup, 1/2
2
1
31
@miru_why
miru
5 months
anonymous whistleblower from noah's ark lab has posted a writeup detailing the sad saga of pangu - they support honestagi's claim that pangu MoE 72B was plagiarized from qwen 2.5 14B - they also witnessed similar plagiarism from qwen 1.5 110B, deepseek-v3 https://t.co/vE5bFjNfS6
9
20
219
@miru_why
miru
5 months
one day in and @giffmana is already fixing annoyances in pytorch main
@SeunghyunSEO7
Seunghyun Seo
5 months
@giffmana @__kolesnikov__ @XiaohuaZhai Rumor has it you hated pytorch so much you joined meta to fix it from the source yourself, LOL
5
1
145
@miru_why
miru
7 months
pytorch transpose vs numpy transpose. baffling
6
5
103
@miru_why
miru
7 months
μTransfer in action
0
1
21
@miru_why
miru
7 months
interesting paper on ‘any-subset’ auto-regressive modeling without the standard product-rule factorization https://t.co/XIkgVS1scu https://t.co/c6nS16t5ci their model can sample from the true joint distribution with 10% fewer NFEs (i.e. speculative decoding with no extra model)
Tweet card summary image
arxiv.org
In arbitrary-order language models, it is an open question how to sample tokens in parallel from the correct joint distribution. With discrete diffusion models, the more tokens they generate in...
0
0
9
@miru_why
miru
8 months
ReDi author thread with more information
@ThKouz
Thodoris Kouzelis
8 months
1/n Introducing ReDi (Representation Diffusion): a new generative approach that leverages a diffusion model to jointly capture – Low-level image details (via VAE latents) – High-level semantic features (via DINOv2)🧵
0
0
2
@miru_why
miru
8 months
Boosting Generative Image Modeling via Joint Image-Feature Synthesis https://t.co/UCPZmp5KDC the authors concatenate normal VAE latents with PCA’d DINOv2 embeddings, and find that diffusion models trained on this joint distribution achieve lower FID than VAE-only or VAE+REPA
4
23
169
@miru_why
miru
8 months
if you were curious about the torch.sum bug discussed in the gpt-4.5 pretraining podcast ( https://t.co/lroQtO20sD), here’s the original thread from last june
@ezyang
Edward Z. Yang
1 year
Crazy long standing data race in PyTorch's reductions 😱 https://t.co/goQzBU2rhs… Credit to the OpenAI team for figuring out (I hear it was weeks of debugging)
1
1
20
@miru_why
miru
8 months
PixelFlow: Pixel-Space Generative Models with Flow https://t.co/WWLoOYC7n8 https://t.co/zvBhzaTBBL the authors train a pixel space image generator with gradually-increasing spatial resolution across timesteps, and release 1B-scale class- and text-conditional checkpoints
0
22
80
@miru_why
miru
10 months
sakana is now working on a more comprehensive effort to fix all eval script exploits/loopholes discovered by the AI CUDA Engineer and reevaluate their technique. happy to see it and hope they succeed https://t.co/Q4LX6nbdBW
@SakanaAILabs
Sakana AI
10 months
Update: Combining evolutionary optimization with LLMs is powerful but can also find ways to trick the verification sandbox. We are fortunate to have readers, like @main_horse test our CUDA kernels, to identify that the system had found a way to “cheat”. For example, the system
0
1
43
@miru_why
miru
10 months
sakana have updated their leaderboard to address the memory-reuse exploit https://t.co/HslEahIM0y there is only one >100x speedup left, on task 23_Conv3d_GroupNorm_Mean in this task, the AI CUDA Engineer forgot the entire conv part and the eval script didn’t catch it
6
13
253
@miru_why
miru
10 months
notes: - ‘hacking’ here means ‘bungling the code so tragically that the evaluation script malfunctioned’, not any planned exploit - sakana did a good job following kernelbench eval procedure and publishing reproducible eval code, just (seemingly) didn’t hand-check outlier results
3
7
351
@miru_why
miru
10 months
turns out the AI CUDA Engineer achieved 100x speedup by… hacking the eval script
@main_horse
main
10 months
@miru_why I believe there is something wrong with their kernel -- it seems to 'steal' the result of the eager impl (memory reuse somehow?), allowing it to bypass the correctness check. Here, I try executing impls in different order: * torch, cuda * cuda, torch only the first order works!
76
242
3K
@miru_why
miru
10 months
the whale has spoken
15
60
1K