Crafting pixels w PhotoRoom after some time in sunny California and happy Copenhagen. Meta (xformers, FairScale, R&D), EyeTribe (acq) Mostly tweeting around AI
@hiphopscypher
@VeritasBP
@adamcbest
@howardfineman
Look at the stats and get back here. I’ll help you: Japan homicide rate is 0.02 per 100k per year. US is a thousand times that. What you *think* you know is insignificant when faced with these stats.
Reading the paper, and as a transformers part nerd (xformers 😘), this feels pretty compelling. One of the first "serious" algorithms I came across was of course Kalman, which has a solid mathematical grounding, good touchstone here (and S4 in general)
Transformers power most advances in LLMs, but its core attention layer can’t scale to long context.
With
@_albertgu
, we’re releasing Mamba, an SSM architecture that matches/beats Transformers in language modeling, yet with linear scaling and 5x higher inference throughput.
1/
Finally reading Ring Attention (), doesn't look like there's a performant open source implementation in pytorch out there, but feels like something where
@d_haziza
and crew would shine...
Great paper, in Google's garden given TPU strong interconn, Gemini ?
Just finished a first read of Dino v2, feels really significant I read a fair bit of papers, first one for a while which felt so insightful despite not being about a new arch per say 1/N
This is really big I think. OpenAI Triton is now compatible with nvidia TRT & AMD Rocm (on top of the original use case with nvidia & python). New linga franca for GPU kernels, well deserved kudos to Philippe Tillet
One ongoing story I'm really excited about is the Triton compiler, which AMD has been investing a lot into. The end result: you can write 1 Triton kernel, and run it at high perf on NVIDIA or AMD GPUs!
Here's the current (fwd) perf of a Triton FA-2 kernel on A100 vs. MI250:
Not completely new by now, but “Direct Preference Optimization” (Rafailov et al) is a landmark in tuning models towards preferred distribution at génération time I believe. Not obvious to me in the beginning so mini thread
Hey Twitter, RT appreciated. Let's say we (PhotoRoom) opensource a new text-to-image model, would there be any research labs interested ? I would "just" need mark of interests, nothing binding or costly. DM open
Another great recent publication that went under many radars I think is SigmaReparam (reparam linear layers with spectral normalization, getting rid of LN and training tricks). Tested over many fields, simplifying, feels sensible, impressive results
This is utterly absurd. The planet is burning and we’re focusing on irrelevant and made up problems (crypto a few years ago, now extinction from AI..).
@EU_Commission
seems really poorly informed here, scientific reasoning and asking experts should be a priority
Mitigating the risk of extinction from AI should be a global priority.
And Europe should lead the way, building a new global AI framework built on three pillars: guardrails, governance and guiding innovation ↓
With a little bit of experience, there’s no way these numbers are true, even if they were measured once. Playing this game you should take best implementation from team Jax vs. best from team PyTorch. 4x difference on SAM makes no sense (same for “it’s XLA” explanation
@fchollet
)
Attention really is the tree that keeps giving.. cuDNN9 coming out with nice perf claims + extra flexibility, not triton like but close. Nice for vision, where attention is more easily an issue than LLMs
We are THRILLED to announce a major milestone today: our Series B, raising $43M with Balderton and Aglae
We also announce 6 new GenAI features in Photoroom. More importantly, these features are powered by Photoroom's foundation model, the best model for commerce photography.📸✨
@barf_stepson
@ctatplay
But why don't you US people vote that insanity out ? You're aware that other comparable countries (ahem, Europe for instance) let you learn without digging your debt grave, right ?
Writing a blog post on our (Photoroom) family of diffusion models, from training to features in the app. Anything specific you would be interested in ? Won’t be able to spill all the beans but I can try
Python is atrocious for parallel work, ProcessPool will never cut it because you're stuck in pickling oblivion and the code becomes an unstable spaghetti plate, Asyncio is overrated for anything which is not simple IO, answer is GIL-less project from Sam Gross.
Having a look at datatrove () from
@huggingface
, nice to see this out. There’s also Fondant, but other than that not much in the open for what’s a key building block for modern ML: data pre-processing.
Interesting new deepseek model, still a Transformer but lots of tweaks and seems very competitive vs. llama3 70b. "Latent attention" is one new element in a string of compressed KVs proposals, also happens in vision
One day of training, from scratch (!), on a "big enough" cluster. This was actually a debug run 😅 What a time we live in... Which one is the real one ?
MobileCLIP is a really cool paper from Apple folks,
SOTA on a latency/accuracy basis (Parero sota if you will)
Couple of key axes to get there (ongoing read)
- model arch. Undervisited these days, shame given NextViT or MaxViT
- data
- training
A bit short on ML news at
@photoroom_app
recently (well, quiet for a month) because we're cooking something big, but this we just released: colored shadows, the model understands transparency ! This comes from the app in 5 seconds
I'm still not over Dino v2.. Abstractions have been a staple of LLMs, but they are all over the training set, already baked in (a car and a bus are a vehicle). Dino v2 shows some being captured by the model but it did not get them from a data leak, should be a bigger deal IMO
@Broun_Dragon
@SmallHandsDon
@RALee85
@AbraxasSpa
Not all leaders steal the country wealth, whataboutism only goes so far.. Putin is billionaire in goods after “leading” an impoverished country. Can you think of another western leader with comparable stealing skills ? So no, not just like any pilot in any country
In the middle of an AI storm in SF, the Paris AI scene is vibrant and getting things done !
We're still recruiting in the ML team at PhotoRoom, Senior Applied Scientist and Data Scientist (with a big DL flavor). Know somebody interested ?
Gaudi3 out, looks like a much beefier version of Gaudi2 + emphasis on interconnect, a bit like TPUs. Quoted numbers are fp8 which is non trivial at the moment, but supposedly pytorch compatible (native pytorch ops or intermediate like ONNX ?)
@kikithehamster
@barf_stepson
@ctatplay
That's just sad.. I understand that everything is not that simple, but frankly the inability of the US society to position itself on some subjects which are in much better shape in other OECD countries (school, healthcare, police violence, guns..) is baffling.
Moving our data processing to Ray, pretty cool framework to orchestrate workloads, it's nice that it removes a lot of the Python quirks (esp. their Actor abstraction).
Not a deal breaker but looks like serialization takes a lot of CPU though, CUDA tensors. Any tips ?
Some of the examples in the blog post are early engineering demos, updating them today. Here's a more recent "erase" example which I think is nuts. Soon in your pocket, kudos to
@mearcoforte
New model incoming, doing much better in situations which were tricky, lots of contrast for instance. Easier to prompt also, which was an issue (looks like we're not the only ones, SD3..).
Other examples in the thread
Did you know that superresolution is a surprisingly interesting topic ? Like, the frequencies you need to add (hallucinate) are super context-dependent, you don't want to superresolve bokeh. Super proud of the ML team at
@photoroom_app
(pic credits: )
How
@photoroom_app
speeds up
#stablediffusion
using xformers, explained:
Attention still matters.. Matthieu also contributed a PR to
@huggingface
, eventually this should become largely available
Really cool results, even the "broken" part is fascinating, super smart solution ! Interesting parallel with some discussions on autoregressive LLMs being doomed without scratchpads
Vision transformers need registers!
Or at least, it seems they 𝘸𝘢𝘯𝘵 some…
ViTs have artifacts in attention maps. It’s due to the model using these patches as “registers”.
Just add new tokens (“[reg]”):
- no artifacts
- interpretable attention maps 🦖
- improved performances!
GPU poor or GPU rich, always a matter of perspectives..
- one H100 node down,
- 1 used right now (<- GPU poor)
- 29 nodes to go (<- GPU $$$)
Updating this tweet when I get the full 30 nodes back up :) Sharing a recent learning of mine: interconnect is _still_ key to perf
@francoisfleuret
You would risk overshooting ? Undershooting is not as bad, it could take you more steps but you’ll get there. Overshoot and you may never get there.
Makes no sense to me, focused on anecdata. Scientific progress doesn’t work like this, when comparing training & architectures what merit is there to which data center the model was trained on ? Aim is repro or improve, past is useless. Even
@JeffDean
gets this backwards it seems
I got really excited that the LLaMA paper calculates and reports their carbon footprint! 🦙🌬️🌎
But upon looking at the paper itself, it has this table, which completely misconstrues the emissions of OPT and BLOOM, while not actually reporting LLaMA's own.
How? A thread 🧵
PhotoRoom is hiring for many positions if you’re looking for a lean and fast growing company
You’ll have a big impact for hundreds of million of entrepreneurs in the world🔥
Classification is not a modern computer vision task, I wish people stopped using it for broad architecture prescriptions. Where is MLPMixer again ? “Optimal ViT shape” paper ? Right
ConvNets Match Vision Transformers at Scale
abs:
This paper from Google DeepMind pretrains a variety of NFNet models on the JFT-4B at various scales and obtain performance on ImageNet similar to ViTs.
> Our work reinforces the bitter lesson. The most
@Eastern_Border
He literally says (typical Macron) "neither follow the most warmongering ones, nor abandon the eastern countries so that they have to act alone", it's convoluted but it's actually supportive of Eastern Europe.
Great thread and insights, some mirror vision indeed and lots I didn’t know. Caveat is that this doesn’t improve on BERT, seems trivial but to keep in mind. Would be nice to see this with GPT and compare the keepers, looks like
@karpathy
is on that these days
How good of a BERT can one get in ONE DAY on ONE GPU?
With all the recent studies about scaling compute up, this paper takes a refreshing turn and does a deep dive into scaling down compute.
It's well written, stock full of insights. Here is my summary and my opinions.
🧶 1/N
Something everyone should know, but with an eye on historical perspectives I think. Historically attention is IO bound even more than flops bound, so the incentive was big for LLM practitioners to pile up on model dim. Flash relaxed that.. and openAI moved to 32k context
this chart shows how the FLOPS in a GPT are allocated as the model scales
this model has a fixed context length of 2048 tokens, but the model dim increases by 16x from the smallest to the biggest
Anyone in my torch TL: if you have a "classic" (explicit softmax(QKt)V) implementation of attention in your codebase, it's _really_ worth it moving to torch SDPA (or xformers' or FlashAttention). It's typically only a few lines, more reliable, faster & way less ram
I personally love
@giffmana
very open takes / opinions + shared recipes and little tricks, not seen that often. Beyond this "style", the track record is very impressive
Here's what our (sub)team in Zürich has done for OSS vision over the past 5y, besides inventing ViT:
1) Make i21k a thing
Release:
2) best CLIP (siglip) by a large margin
3) best i1k ResNet50 ever
4) best pre-trained ResNets
5) >55k ViTs
6) Most efficient JAX/TPU CV code
deets👇
It’s also a key part of the soon to be released PhotoRoom model (not vanilla DiT but heavily related). I really think gatekeeping reviews are too noisy, something like arxiv + open reviews/ test of time feels better to my eyes
The Diffusion Transformer paper, by my former-FAIR-and-current-NYU colleague
@sainingxie
and former-Berkeley-student-and-current-OpenAI engineer William Peebles, was rejected from CVR2023 for "lack of novelty", accepted at ICCV2023, and apparently forms the basis for Sora.
How is the photoroom backend fast and scalable, with billions of diffusion based images generated a year -and growing- ? Re-posting slides from
@MatthieuToulem1
and
@EliotAndres
from latest GTC, for visibility.
All decent training gets 40 to 50% MFU, there’s no 4x hiding in there, these numbers mean that PyTorch was at <15% meaning that there were correct but very slow ops that any engineer would fix. 15% speed ups are possible, not 400% unless your baseline is broken
Remember a thread two months ago on how to sell your sneakers () ? Remember this thread from
@matthieurouif
? (). Well, we have new Magic Studio incoming in
@photoroom_app
, now real time. Learning by doing, we're ML artisans :)
PhotoRoom raised $19 million for its AI photo studio, which is already powering millions of resellers and small businesses.
Today, we are launching a new generative AI feature, magic studio, to create stunning marketing images from any product photo
A thread 🧵
@gcvrsa
@silentsara
@brent_bellamy
Honey, check out rear facing car seats, these are by far the safest until age 4 (and are indeed hard to fit in a smallish car even if that doesn't justify crazy American pick ups)
Interesting results, which completely match residual path assumptions if you think about it. Each new layer adds residual information, so simple concepts are nailed early, complex ones late
Diffusion Lens is a pretty neat new paper, you can see a text-to-image encoder's representation of a giraffe getting less and less abstract with every layer 🦒
To the surprise of nobody in the field, but much easier said than done. Congrats
@pommedeterre33
et al. for the Kernl, it was a motivation for xformers at the time and I'm glad that a complete faster Transformer, torch compatible, happened in the open !
12X faster transformer model, possible?
Yes, with
@OpenAI
Triton kernels!
We release Kernl, a lib to speedup inference of transformer models. It's very fast (sometimes SOTA), 1 LoC to use, and hackable to match most transformer architectures.
🧵
Dear TL, we're recruiting at
@photoroom_app
and more specifically in the ML team, if you're interested in wrangling data at scale (100M+), multi modalities, custom labelling models & data science challenges: DM and talent
@photoroom
.com !
Job desc here
Nothing to do with me, but Llama and Dino v2 are two great SOTA papers from FAIR, anchored in different fields, both using specialized parts from xformers (that I didn’t contribute to, again not about me). Optimizing above the torch ops was a key vision, confirmed IMO
I've a huge respect for
@OpenAI
, but this is an incredibly pretentious, self centered and historically wrong take. Swap AGI for RNA vaccines, Crispr , X-rays, relativity theory, antibiotics, .. None of them were any better kept under wraps, and nobody can be trusted with these.
Speeding up multimodal ML: we're rolling out a much faster scene suggestion
@photoroom_app
, which is content aware, editable to your taste, and now almost instant fast. The two top scenes are effectively infinitely fine grained recommendations for this content.
Reacting a bit late, but my take on this (the open part). I've been a follower of SemiAnalysis and
@dylan
for a while, generally impressed. Disagreeing this time.
I think this is based on two premises, largely debatable
1 - winner takes all
2 - bigger is better
Google Gemini Eats The World – Gemini Smashes GPT-4 By 5X
The GPU-Poors, MosaicML, Together, and Hugging face
Broken Open-Source
Compute Resources That Make Everyone Look GPU-Poor
Google Cloud TPU wins
@savvyRL
Word2vec was just about clustering, but here there seems to be a spatial component, new right ? Some cities are bound to be written together more often because of non-spatial logic, say a list of capitals for instance. A purely frequentist approach should get the loc wrong, no ?
We've been using Ruff for a while at Photoroom AI, moving to it for the formatting, pretty incredible tool, we're lucky to have it in the ecosystem (and Rust is the future for low latency tools, that's for sure).
@typewriteralley
There are existing counter examples to that, take Copenhagen for instance, the bikes and buses don't have to compete. The phrasing could have been "people on buses are often stuck in traffic because one-crewed cars take all the space". City planning education is rare :(
@finbarrtimbers
It’s hard to get food perf out of it, there are great kernels for this in xformers but you don’t typically get the speed that you could expect, for instance picking single coefficients doesn’t fit tensor cores. Blocksparse is much better if you want sparse
Of note: there’s an industry standard for inference called MLPerf. Fastest on nvidia GPU is TensorRT, across many models, go check it out.
Benchmarks are only meaningful best on best, there’s an order of magnitude perf span in between correct implementations.
With a little bit of experience, there’s no way these numbers are true, even if they were measured once. Playing this game you should take best implementation from team Jax vs. best from team PyTorch. 4x difference on SAM makes no sense (same for “it’s XLA” explanation
@fchollet
)
AMD dropping on results then jumping on AI partnership with MSFT.. interesting that MSFT is actually a credible partner here, given the workload via OpenAI and their triton-infused stack (nvidia for now but through an IR) 1/2
@Jonathan_Blow
Maybe that’s just an easy way to do layoffs, and people want to stuff their agenda in it ? “People want this not to be true, but it’s true” as the saying goes :)
Some tech news from the
@photoroom_app
team, it’s been a while:
- consistent renderings (in beta on the web already), your gen Ai pics look like they come from the same place, instead of being unrelated
I forgot in the above, but the details on engineering are just great also. Hero number: 2x as fast and 3x less memory as comparable SSL methods, when proper engineering is included. Pretty impactful, and good engineering compounds (reusable), good omen for FAIR 3/N
None of these pics are completely real, but there’s some reality-informed diffusion :) no outgrowing (I believe
@photoroom_app
is the only company nailing that), but we’re improving on some details. Crazy optim on the backend to get to these speeds (seconds) but more to come
Hopefully able to share a bit more in a couple of weeks, we tried to push the walls with this training. It's not a SD_something, another architecture that we believe in for PhotoRoom. Our own data stack, and a metric ton of work even before the diffusion training
To emphasize some key facts:
- we generate billions (with a b) of images a year for our users
- Photoroom is sustainable, this is not a VC money or ZIR plot
- foundational diffusion model, trained from scratch and powering a host of features
- third one we trained actually (:
It feels like a paper which covers the whole loop to start with: a lot of context around the SSL SOTA, picking the good ideas where they are, detailed explanations on the data pipeline, many insightful ablations, extensive results and plenty of take aways and surprises 2/N
Why are the CPUs so idle, I hear you say ?
Because we precompute everything we can, that's why. Removes most augmentation options, but that's not really a thing for diffusion. That's how we got these 10k img/s (training !) on 16 A100 nodes that I mentioned in the blog post
The Lenna of generative AI just got an upgrade... Still ~instant rendering, now need to ship this (after internal demo and convincing colleagues 😬). Not quite the final checkpoint
@tunguz
We did that for xformers () in retrospect that was probably a mistake and a white paper on arxiv would have helped getting traction. Most people receptive fields are tuned to arxiv these days (+derived mailing lists, websites and RSS)
In a similar vein Mistral or Gemma numbers would have fair-er with things like GPT Fast baked in, after all the comparison includes all of Keras so surely that 1000loc are fair game ?
Maximize training throughput using PyTorch FSDP 🔥
The PyTorch teams at IBM and Meta demonstrate the scalability of FSDP with a pre-training exemplar and share various techniques to achieve rapid training speeds.
Read more here:
Liberal depiction of the Mistral naming scheme: so French it’s becoming pretty funny. Joke aside I think they’re doing a great job a standing out
Vive La Plateforme and Le Chat !
(pic not real, from PhotoRoom own model)
Sneak peek, which one is real ?
@photoroom_app
and instant backgrounds, this is still instant but quality and model understanding is going through the roof. Hold on to your socks and stay tuned
Some practical questions in terms of AI image edition these days: go full ML or keep the physics/rendering angle ? It's not entirely obvious in many places, mini thread
@KerenAnnMusic
@lorde
Does "appartheid state" and "colonies" count, or do you have selective eyesight ? Nothing to do with Judaism. With your reasoning Mandela would have died in prison
@melficexd
@cHHillee
they've been GP-GPU for a while, and GP stands for General Purpose, but I guess that most people commenting don't know that either. TensorCores in nvidia chips have been targeted at AI for a long time
Out of my league again, but the parallel scan here feels like the crown jewel. Flash got us used to streaming all the time, kernel fusion is known quantity by now, but this being not bound by recurrence doesn't feel obvious ? Tri Dao strikes as usual
GPU poor or GPU rich, always a matter of perspectives..
- one H100 node down,
- 1 used right now (<- GPU poor)
- 29 nodes to go (<- GPU $$$)
Updating this tweet when I get the full 30 nodes back up :) Sharing a recent learning of mine: interconnect is _still_ key to perf
@rasbt
you can try out "metaformers" (ViTs with some patch embedding layers) on cifar in xformers, super simple script
Defaults bring you to 86% cifar10 (not a world record, I know) within 10 minutes on a laptop with a 6M parameters model (half of resnet18)
Cringe-y for me to watch, too slow in the beginning but getting a bit better over time, a presentation I made a couple of weeks back on how some of the
@photoroom_app
AI features work behind the scenes. Not too detailed but hypothetical questions welcome
A lot of intuition is shared here, which feels great. Definitely putting this selection mechanism to good use (jeez twitter is broken, and sorry for the typos above)
@ID_AA_Carmack
Initial release was not optimized at all, getting better these days (fusing layers with nvfuser or tensorRT for inference, better attention kernels from xformers and
@tri_dao
, ..). New major improvements may not be iso-weights from now on (to be able to use tinyNN for instance)
The DPO promise is to cut through this, as they put it “your LLM is already a reward model”. How that worked in practice I didn’t get it initially though
SSL method: I’m far from a specialist, felt like this reuses a lot of prior insights from this lab + other good ideas from the outside, in particular KoLeo (encourages a very regular feature spread) looks to be very significant (8% retrieval boost !) 5/N
@BlancheMinerva
@norabelrose
Are you sure you’re reading this right ? This just shows the updated algorithm goes higher in flops iso-hardware, and is not affected by sequence length too much, but this is not the attention throughput (else it would be in seq/s or similar).