
Ross Wightman
@wightmanr
Followers
22K
Following
4K
Media
132
Statuses
5K
Computer Vision @ 🤗. Ex head of Software, Firmware Engineering at a Canadian 🦄. Currently building ML, AI systems or investing in startups that do it better.
Vancouver, BC
Joined April 2012
RT @ntenenz: The Dayhoff Atlas!. Open code. Open weights. Open datasets. Thanks @huggingface for helping to facilitate open science. ht….
huggingface.co
0
5
0
A joint OpenCLIP (3.0.0) and timm (1.0.18) release day today. It's been a quarter since the last OC release, so what's new? PE (Perception Encoder) Core support was the headline feature. Using the timm vision encoder for the PE models, I adapted the weights from @AIatMeta so they.
2
8
77
RT @ShivamDuggal4: Compression is the heart of intelligence.From Occam to Kolmogorov—shorter programs=smarter representations. Meet KARL: K….
0
62
0
RT @Thom_Wolf: Thrilled to finally share what we've been working on for months at @huggingface 🤝@pollenrobotics. Our first robot: Reachy Mi….
0
514
0
RT @aaron_defazio: AdamC and corrected weight decay for other optimizers is now implemented in timm! .Try it out if you want better behaved….
arxiv.org
During long-duration Large Language Model (LLM) training runs the gradient norm increases rapidly near the end of training. In this short note, we show that this increase is due to an unintended...
0
4
0
RT @PyTorch: torchft + TorchTitan: 1200+ failures, no checkpoints, model convergence. A Llama 3 model was trained across 300 L40S GPUs wit….
0
48
0
RT @ysu_nlp: 📈 Scaling may be hitting a wall in the digital world, but it's only beginning in the biological world!. We trained a foundatio….
0
56
0
RT @lschmidt3: Very excited to finally release our paper for OpenThoughts!. After DataComp and DCLM, this is the third large open dataset m….
0
212
0
RT @aaron_defazio: Why do gradients increase near the end of training? .Read the paper to find out!.We also propose a simple fix to AdamW t….
0
76
0
Oh yeah, and a detail for @giffmana . a while back I asked if you scaled the batch size with seq-len. That's the default here, so batch size changes with each seq-len selected to keep utilization high. It works well, there's also loss scaling enabled to scale loss w/ the.
3
1
14