0xMichaelGordon Profile Banner
Michael Gordon Profile
Michael Gordon

@0xMichaelGordon

Followers
1K
Following
2K
Media
12
Statuses
498

CTO @TryBrickroad • I run Dataset News—operator notes on datasets, evals & licensing → https://t.co/VfjAL5uKta • Prev AI R&D @GovUK @JPMorgan • Mentor @GPC_xyz

EU (CET)
Joined July 2021
Don't wanna be here? Send us removal request.
@0xMichaelGordon
Michael Gordon
4 days
1/ Launching Dataset News — operator notes on datasets, evaluations, and licensing so teams can buy, benchmark, and ship with confidence. Signals > noise. Ops > hype. 👉
Tweet media one
1
1
2
@0xMichaelGordon
Michael Gordon
17 hours
RT @TryBrickroad: If you own data, this week is the week to start monetizing it. Sign up to be a supplier at
0
3
0
@0xMichaelGordon
Michael Gordon
4 days
@0xMichaelGordon
Michael Gordon
4 days
1/ New post: Judge swaps, drifting agent evals.Leaderboards move when the judge changes. Here’s how to keep web-agent evals stable enough to buy, benchmark, and ship against. 👉
Tweet media one
0
0
0
@0xMichaelGordon
Michael Gordon
4 days
11/ Read the full post with citations + resources.👉
Tweet media one
0
0
0
@0xMichaelGordon
Michael Gordon
4 days
10/ Been bitten by a silent judge swap — or shipped a fix that worked? Reply with one thing that broke and one change that helped. I’ll compile lessons learned.
1
0
0
@0xMichaelGordon
Michael Gordon
4 days
9/ Open problem: a vendor-neutral judge-ID schema (family, snapshot, template, temperature) with signed metadata so scores are portable and auditable.
1
0
0
@0xMichaelGordon
Michael Gordon
4 days
8/ Ops checklist (save this):.□ Pin judge ID string.□ Style-controlled prompts.□ Static anchor + live suite.□ Rule+LLM ensemble.□ Monthly re-runs on same + swapped judge.□ URL-level logs.
1
0
0
@0xMichaelGordon
Michael Gordon
4 days
7/ Report, don’t hide. Split results by judge family; expect rank flips when the judge swaps. Make evaluator effects visible.
1
0
0
@0xMichaelGordon
Michael Gordon
4 days
6/ Rule + LLM judge. Pair a deterministic rule with an LLM judge. Calibrate on a human-rated slice and publish disagreement rates.
1
0
0
@0xMichaelGordon
Michael Gordon
4 days
5/ Blend static + live. Static pages = trendlines; live sites = reality. Report both. Tag item-level time sensitivity and freeze a monthly static slice.
1
0
0
@0xMichaelGordon
Michael Gordon
4 days
4/ Control style. Verbosity/markdown can inflate scores. Use style-controlled judge prompts; keep a content-only ablation for audits.
1
0
0
@0xMichaelGordon
Michael Gordon
4 days
3/ Pin the judge. Record model family/name, snapshot, temperature, template hash. Publish dual-judge deltas. Fail closed if a judge isn’t pinned.
1
0
0
@0xMichaelGordon
Michael Gordon
4 days
2/ A single score hides a system: task design, environment realism, judge prompt+snapshot. Style sensitivity (verbosity/markdown) and time sensitivity (live vs. static) nudge rankings.
1
0
0
@0xMichaelGordon
Michael Gordon
4 days
1/ New post: Judge swaps, drifting agent evals.Leaderboards move when the judge changes. Here’s how to keep web-agent evals stable enough to buy, benchmark, and ship against. 👉
Tweet media one
1
1
2
@0xMichaelGordon
Michael Gordon
4 days
7/ First issue is live — thread below 👇.
1
0
0
@0xMichaelGordon
Michael Gordon
4 days
6/ What to expect: one anchor post/week + short, tactical takes in between. No hype, just “try this, avoid that.”.
1
0
0
@0xMichaelGordon
Michael Gordon
4 days
5/ My lens: I’m cofounder & CTO @TryBrickroad. We build agents that discover/evaluate/purchase datasets and broker data deals with top labs.
1
0
1
@0xMichaelGordon
Michael Gordon
4 days
4/ Who it’s for: dataset researchers, evaluators, and buyers building with agents/models. If your roadmap depends on scores or licenses, this is you.
1
0
0
@0xMichaelGordon
Michael Gordon
4 days
3/ The promise:.• Curated releases & benchmarks.• Risk/rights context (what you can actually use).• What to run, pin, and report next.
1
0
0
@0xMichaelGordon
Michael Gordon
4 days
2/ Why now: release velocity is insane, licensing is fragmented, benchmarks get gamed, and policies shift weekly. You need a clean yes/no on use and a stable way to compare.
1
0
0