Sutro @sutro_sh X Profile

Sutro

@sutro_sh

Followers

11

Following

4

Media

2

Statuses

14

High-throughput batch inference engine for AI teams. Synthetic data generation, model evals, and unstructured data curation - one simple Python SDK.

https://t.co/tma83khxpK

pip install sutro

Joined July 2025

Don't wanna be here? Send us removal request.

Sutro

@sutro_sh

4 months

Earlier this year we partnered with SynthLabs ( https://t.co/Y0R2ObW4wT), a post-training research lab, to generate a 351 billion token synthetic dataset 10x faster and 80% cheaper. Read more in our case study:

sutro.sh

By partnering with Sutro, SynthLabs generated a 351 billion-token dataset with 10x greater speed and 80% lower costs, turning complex research ideas into production-grade results without infrastruc...

0

4

Sutro

@sutro_sh

27 days

Takeaway: Offline LLM-as-a-Judge ensembles are practical. Useful for: - Model selection & benchmarking - Prompt/parameter sweeps - Agent trace/trajectory evaluations Full guide w/ implementation details + dataset:

docs.sutro.sh

Learn three LLM-as-a-judge techniques to improve models and agents without human feedback

0

1

Sutro

@sutro_sh

27 days

Win rate matrix for ranking-based (pairwise) evaluations:

1

0

1

Sutro

@sutro_sh

27 days

Efficiency: - 100k samples × 4 models × 16 eval jobs = 1.6M evals - Few hours of wall-clock time ~$100 total cost Comparable human annotation would be orders of magnitude slower + more expensive.

1

0

1

Sutro

@sutro_sh

27 days

Findings: - GPT-OSS 20B emerged as the most consistent generator across judges - Llama 3.1 8B performed worst - Judge harshness varied: GPT-OSS 120B strictest, Llama 70B most lenient

1

0

1

Sutro

@sutro_sh

27 days

Why ensemble? A single evaluator model can carry bias (family similarity, RLHF skew, etc.). Cross-family “jury” scoring mitigates this, approximating more robust evals.

1

0

1

Sutro

@sutro_sh

27 days

Setup: - Task: generate ELI5 explanations of 100k arXiv abstracts - Models: Llama, Qwen, Gemma, GPT-OSS (smaller models as generators) - Evaluators: larger models from other families creates an ensemble “jury”

1

0

1

Sutro

@sutro_sh

27 days

Evaluation options are limited: - Human annotation is slow, expensive, inconsistent - Online user feedback is valuable, but not available pre-scale - Vibes don't cut it Offline LLM-as-a-Judge provides a reproducible, scalable alternative.

1

0

1

Sutro

@sutro_sh

27 days

Evaluating LLMs, apps, and agents is notoriously hard: subjective outputs, no ground truth, and costly feedback loops. We ran a large-scale test of LLM-as-a-Judge (ensemble style) on 100k samples, across 4 model families, for ~$100. Results + code 👇

1

0

1

Sutro

@sutro_sh

1 month

New guide up: Large Scale Embedding Generation with Qwen3 0.6B Build a semantic search engine over Apple's patent corpus (4m+ chunks) in 44 minutes for $14.80 using the Sutro SDK. Full guide:

docs.sutro.sh

Easily (and inexpensively) create a semantic search index of over 4M document chunks from Apple's patent literature, using Sutro

0

Sutro

@sutro_sh

2 months

https://t.co/CmabTWB7Rj

docs.sutro.sh

Get up and running with synthetic data generation using the Sutro Python SDK.

0

Sutro

@sutro_sh

2 months

Generate 20,000 synthetic product reviews in under an hour ⏱️ ✅ Few dozen lines of code ✅ <$2 total cost ✅ No infra setup Dataset: https://t.co/IhrGVayd3R Full guide 👇

huggingface.co

1

0

1

Sutro

@sutro_sh

4 months

Beep boop - hello, world!

0

1

3