Michael Gordon @0xMichaelGordon X Profile

Michael Gordon

@0xMichaelGordon

Followers

1K

Following

2K

Media

12

Statuses

498

CTO @TryBrickroad • I run Dataset News—operator notes on datasets, evals & licensing → https://t.co/VfjAL5uKta • Prev AI R&D @GovUK @JPMorgan • Mentor @GPC_xyz

EU (CET)

Joined July 2021

Don't wanna be here? Send us removal request.

Michael Gordon

@0xMichaelGordon

4 days

1/ Launching Dataset News — operator notes on datasets, evaluations, and licensing so teams can buy, benchmark, and ship with confidence. Signals > noise. Ops > hype. 👉

1

2

Michael Gordon

@0xMichaelGordon

17 hours

RT @TryBrickroad: If you own data, this week is the week to start monetizing it. Sign up to be a supplier at

0

3

0

Michael Gordon

@0xMichaelGordon

4 days

Michael Gordon

@0xMichaelGordon

4 days

1/ New post: Judge swaps, drifting agent evals.Leaderboards move when the judge changes. Here’s how to keep web-agent evals stable enough to buy, benchmark, and ship against. 👉

0

Michael Gordon

@0xMichaelGordon

4 days

11/ Read the full post with citations + resources.👉

0

Michael Gordon

@0xMichaelGordon

4 days

10/ Been bitten by a silent judge swap — or shipped a fix that worked? Reply with one thing that broke and one change that helped. I’ll compile lessons learned.

1

0

Michael Gordon

@0xMichaelGordon

4 days

9/ Open problem: a vendor-neutral judge-ID schema (family, snapshot, template, temperature) with signed metadata so scores are portable and auditable.

1

0

Michael Gordon

@0xMichaelGordon

4 days

8/ Ops checklist (save this):.□ Pin judge ID string.□ Style-controlled prompts.□ Static anchor + live suite.□ Rule+LLM ensemble.□ Monthly re-runs on same + swapped judge.□ URL-level logs.

1

0

Michael Gordon

@0xMichaelGordon

4 days

7/ Report, don’t hide. Split results by judge family; expect rank flips when the judge swaps. Make evaluator effects visible.

1

0

Michael Gordon

@0xMichaelGordon

4 days

6/ Rule + LLM judge. Pair a deterministic rule with an LLM judge. Calibrate on a human-rated slice and publish disagreement rates.

1

0

Michael Gordon

@0xMichaelGordon

4 days

5/ Blend static + live. Static pages = trendlines; live sites = reality. Report both. Tag item-level time sensitivity and freeze a monthly static slice.

1

0

Michael Gordon

@0xMichaelGordon

4 days

4/ Control style. Verbosity/markdown can inflate scores. Use style-controlled judge prompts; keep a content-only ablation for audits.

1

0

Michael Gordon

@0xMichaelGordon

4 days

3/ Pin the judge. Record model family/name, snapshot, temperature, template hash. Publish dual-judge deltas. Fail closed if a judge isn’t pinned.

1

0

Michael Gordon

@0xMichaelGordon

4 days

2/ A single score hides a system: task design, environment realism, judge prompt+snapshot. Style sensitivity (verbosity/markdown) and time sensitivity (live vs. static) nudge rankings.

1

0

Michael Gordon

@0xMichaelGordon

4 days

1/ New post: Judge swaps, drifting agent evals.Leaderboards move when the judge changes. Here’s how to keep web-agent evals stable enough to buy, benchmark, and ship against. 👉

1

2

Michael Gordon

@0xMichaelGordon

4 days

7/ First issue is live — thread below 👇.

1

0

Michael Gordon

@0xMichaelGordon

4 days

6/ What to expect: one anchor post/week + short, tactical takes in between. No hype, just “try this, avoid that.”.

1

0

Michael Gordon

@0xMichaelGordon

4 days

5/ My lens: I’m cofounder & CTO @TryBrickroad. We build agents that discover/evaluate/purchase datasets and broker data deals with top labs.

1

0

1

Michael Gordon

@0xMichaelGordon

4 days

4/ Who it’s for: dataset researchers, evaluators, and buyers building with agents/models. If your roadmap depends on scores or licenses, this is you.

1

0

Michael Gordon

@0xMichaelGordon

4 days

3/ The promise:.• Curated releases & benchmarks.• Risk/rights context (what you can actually use).• What to run, pin, and report next.

1

0

Michael Gordon

@0xMichaelGordon

4 days

2/ Why now: release velocity is insane, licensing is fragmented, benchmarks get gamed, and policies shift weekly. You need a clean yes/no on use and a stable way to compare.

1

0