Sutro
@sutro_sh
Followers
11
Following
4
Media
2
Statuses
14
High-throughput batch inference engine for AI teams. Synthetic data generation, model evals, and unstructured data curation - one simple Python SDK.
pip install sutro
Joined July 2025
Earlier this year we partnered with SynthLabs ( https://t.co/Y0R2ObW4wT), a post-training research lab, to generate a 351 billion token synthetic dataset 10x faster and 80% cheaper. Read more in our case study:
sutro.sh
By partnering with Sutro, SynthLabs generated a 351 billion-token dataset with 10x greater speed and 80% lower costs, turning complex research ideas into production-grade results without infrastruc...
0
0
4
Takeaway: Offline LLM-as-a-Judge ensembles are practical. Useful for: - Model selection & benchmarking - Prompt/parameter sweeps - Agent trace/trajectory evaluations Full guide w/ implementation details + dataset:
docs.sutro.sh
Learn three LLM-as-a-judge techniques to improve models and agents without human feedback
0
0
1
Efficiency: - 100k samples × 4 models × 16 eval jobs = 1.6M evals - Few hours of wall-clock time ~$100 total cost Comparable human annotation would be orders of magnitude slower + more expensive.
1
0
1
Findings: - GPT-OSS 20B emerged as the most consistent generator across judges - Llama 3.1 8B performed worst - Judge harshness varied: GPT-OSS 120B strictest, Llama 70B most lenient
1
0
1
Why ensemble? A single evaluator model can carry bias (family similarity, RLHF skew, etc.). Cross-family “jury” scoring mitigates this, approximating more robust evals.
1
0
1
Setup: - Task: generate ELI5 explanations of 100k arXiv abstracts - Models: Llama, Qwen, Gemma, GPT-OSS (smaller models as generators) - Evaluators: larger models from other families creates an ensemble “jury”
1
0
1
Evaluation options are limited: - Human annotation is slow, expensive, inconsistent - Online user feedback is valuable, but not available pre-scale - Vibes don't cut it Offline LLM-as-a-Judge provides a reproducible, scalable alternative.
1
0
1
Evaluating LLMs, apps, and agents is notoriously hard: subjective outputs, no ground truth, and costly feedback loops. We ran a large-scale test of LLM-as-a-Judge (ensemble style) on 100k samples, across 4 model families, for ~$100. Results + code 👇
1
0
1
New guide up: Large Scale Embedding Generation with Qwen3 0.6B Build a semantic search engine over Apple's patent corpus (4m+ chunks) in 44 minutes for $14.80 using the Sutro SDK. Full guide:
docs.sutro.sh
Easily (and inexpensively) create a semantic search index of over 4M document chunks from Apple's patent literature, using Sutro
0
0
0
Generate 20,000 synthetic product reviews in under an hour ⏱️ ✅ Few dozen lines of code ✅ <$2 total cost ✅ No infra setup Dataset: https://t.co/IhrGVayd3R Full guide 👇
huggingface.co
1
0
1