
Viraj
@tunedgradient
Followers
46
Following
750
Media
41
Statuses
386
applied AI & intelligence. notes on how we think, and how our machines learn.
San Francisco
Joined August 2025
sft is like lego blocks, rl is like ikea furniture with missing screws. everyone keeps promising plug-n-play, but rl isn’t like sft. it’s not just 'load data, train model.' every env+algo pairing adds new quirks, new knobs to tune, new failure modes. the bright side: every
we’re approaching the end of 2025 and there’s still no plug-n-play RL lib in the interrim: - i built a shitty version of this (llamagym) - RL started working (o1) - oss found out how it worked (r1) - “RL env” became the new buzzword - oss RL envs unified around `verifiers`
0
0
1
what i like about detailbench is that it flips the usual framing. it’s not 'can the model follow instructions' but 'can it notice when something’s just a bit off.' catching a wrong digit in the middle of a translation is a very different skill than writing fluent text. most llms
(Re-)launching DetailBench! After a lot of feedback in the comments that LLMs should *always* notify about mistakes, I changed the scoring. Well, let's just say it didn't really help 🙃
1
0
0
sums up llm progress: basically evolutionary search. countless runs branch out, most dead-end, a few checkpoints survive and get refined. evals act as the selection pressure, turning random exploration into structured, compounding progress.
0
0
0
great opportunity to make an impact! building evals right is basically compounding leverage on the whole field.
i'm hiring for a new team @openai: Applied Evals our goal is to build the world's best evals for the economically valuable tasks our customers care about most. we'll execute as a group of high‑taste engineers, combining hands-on, unscalable efforts with systems that others can
0
0
0
not surprised. for non-math folks: the hodge conjecture is one of the clay millennium problems, about which geometric shapes can be described algebraically. and in this case, the grandiosity of this paper itself gives it away, nobody quietly knocks down a clay prize in bullet
0
0
0
exploit prompt format bias, don’t mistake it for magic. llms just inherit a huge prior from oceans of html/xml, so they're good at <open>/<close> delimiters, tolerant of free-form text, and less brittle than json’s quotes/commas. so you may feel better results. but if you need
Literally just append this to the prompt. The results are incredible: “Before answering, <think> inside XML tags for at least 15 paragraphs.”
0
0
0
weird realization: ai folks are building h-nets (models that just invent their own tokens from raw bytes). physicists are chasing the muon g-2 with crazy precision. both sounded like they might change the game. instead, we mostly got….cleaner numbers. are we just maxing rigor
Between h-net disappointment and mixed muon experiments, might not seem the best of times working on model design. Yet as someone primarily on the data side I’m increasingly convinced something is off.
0
0
1
people talk about 'memorization vs understanding' like it’s a clean line. never was. humans memorize patterns & call it intuition. models compress patterns & we call it memorization. so maybe 'understanding' is just the name we give to useful compression that transfers.
A student who truly understands F=ma can solve more novel problems than a Transformer that has memorized every physics textbook ever written.
0
0
0
there's still so much confusion nowadays around parameter ct vs perf
This is a myopic answer that doesn't consider hardware or problem type. A 2B parameter model is both EXTREMELY fast when using the proper hardware and EXTREMELY accurate when used on the correct use case. There are no silver bullets. Good engineering is about considering the
0
0
1
great thread on evals. feels like we’ve reached consensus: you can prototype fast without them, but once you’re aiming for reliability (esp in sensitive domains like health or legal), they’re the difference between pilot hell and prod.
Are evals required to build great AI applications? There’s been a ton of discussion on this recently, and I wanted to share my POV working on evals for startups and enterprises at @OpenAI 👇
0
0
1
what if the real bottleneck isn’t image data at all, but reasoning traces? if text-only 'thinking data' can boost generation, maybe the cheapest supervision ends up being the most powerful.
0
0
0
7/ and in a flourish: gpt-4o generates multimodal visuals conditioned on flow fields, e.g. “diver with bubbles along streamlines.” not just pretty pictures, but tied to simulation data. ( https://t.co/TPswyXI8HB)
0
0
0
6/ it also doubles as a tutor. don’t know what a cfl number is? gpt explains and suggests defaults inline, turning advanced cfd setup into an interactive lesson. ( https://t.co/TPswyXI8HB)
1
0
0
5/ post-sim, gpt can auto-generate scripts for “plot instantaneous cd over time” or “slice velocity at y=0.” it edits params, runs analysis, and visualizes results. ( https://t.co/TPswyXI8HB)
1
0
0
4/ three agents, each steered by gpt-4o (5 will be better!): -preprocessing: builds 3d meshes from text/image (point-e pipeline) -solver: asks for reynolds, cfl, timestep, writes config files -postprocessing: scripts plots, drag/lift curves, streamlines, even photo-real
1
0
0
3/in cfdagent gpt is the planner, coder, and tutor that makes a classical ib solver usable from natural language. ( https://t.co/TPswyXI8HB)
1
0
0
2/ what ai is changing is not the physics but everything around it. geometry generation, meshing, parameter setup, scripting, visualization. the glue work that used to be the bottleneck. ( https://t.co/TPswyXI8HB)
1
0
0
1/ cfd has always been the domain of specialists with cad licenses, meshing scripts, solver configs, and weeks of hpc time. but now multimodal llms are starting to reshape cfd workflows. ( https://t.co/TPswyXI8HB)
1
0
0