Boris Shaposhnikov
@borisshapa
Followers
56
Following
7
Media
17
Statuses
58
10/ 🙌 cc @yule_gan, @ericmitchellai, @winniethexu, @_lewtun, @krasul, @lvwerra, @tomekkorbak, @archit_sharma97, @sea_snell, @rm_rafailov, @Learnius, @yumeng0818, @jiwoohong98, @KarelDoostrlnck, @chenhuay17, @TengX6
0
0
0
9/ đź§ TL;DR: ESSA turns alignment into a simple, stable, highly parallel evaluation loop. Competitive quality, dramatically lower engineering overhead, and much faster time-to-quality. Paper:
1
1
0
8/ 🧪 Broad evaluation. ESSA spans 7 advanced math benchmarks (MATH500, MinervaMath, OlympiadBench, AIME’24/’25, AMC’23) on Qwen2.5-32B/72B and, in the paper, also non-math tasks like IFEval & HelpSteer with richer reward signals.
1
0
0
7/ 🔢 Precision robustness: BF16 → INT8 → INT4 costs <1% accuracy for ESSA (e.g. 0.847 → 0.844 → 0.838 for Qwen2.5-32B on PRM800K).
1
0
0
6/ ⏱️ Scaling wins: On Qwen2.5-32B / PRM800K, ESSA reaches near-optimal accuracy 2× faster on 16 GPUs and 6× faster on 128 GPUs than GRPO — minimal comms (just seeds + rewards) makes parallelism easy.
1
0
1
5/ 📊 Headline results (vs GRPO): • Qwen2.5-Math-7B: +12.6% on GSM8K, +14.8% on PRM800K • LLaMA-3.1-8B: +22.5% on IFEval
1
0
0
4/ 🔬 As @johnschulman2 & @thinkymachines showed, LoRA can match Full Fine-Tuning in post-training — a strong, efficient baseline. ESSA goes further: it tunes only the singular values — smaller, faster, fully gradient-free.
1
0
0
3/ 💡 Core idea: optimize only the singular values of LoRA adapters (SVD-LoRA) with CMA-ES. That shrinks the search space by >1000�� while keeping the stability and gradient-free nature of Evolution Strategies.
1
0
0
2/ ⚙️ Why this matters: PPO/GRPO pipelines are complex (actor/critic, long rollouts, sync, memory). ESSA uses only forward passes + black-box optimization. No backprop. Works even in INT8/INT4.
1
0
0
1/ 🚀 We’re releasing ESSA: Evolutionary Strategies for Scalable Alignment — a gradient-free, inference-only alternative to RLHF that makes aligning LLMs faster, simpler, and cheaper.👇
1
6
7
1/ @johnschulman2 and @thinkymachines showed that LoRA can match full fine-tuning in many post-training regimes. In our earlier paper, we went even tighter — train steering vectors. That’s 131K extra params on Llama3.1-8B-Instruct and matches full-tuning on 6/7 models we studied
1
7
12
[1/9] VLMs caption well, but no simple RL trains them in multi-step sims and shows gains. VL-DAC: a lightweight RL algorithm on top of VLM + cheap sims yields agents that finish long quests and transfer to skill specific benchmarks with no tuning. HF link: https://t.co/HIGDN7NisA
1
7
7
9/ All our findings and full experimental results are available in the paper. If you’re using offline alignment methods, it’s important to consider how your chosen objective interacts with prompt bias. 🔗
arxiv.org
Direct Alignment Algorithms (DAAs) offer a simpler way to language model alignment than traditional RLHF by directly optimizing policies. While DAAs differ in their use of SFT (one-stage vs....
1
0
0
8/ Takeaway: Pairwise objectives outperform pointwise on tasks where the model's capacity is just enough to “unlearn” prompt bias, but not enough to tackle harder examples. This nuance explains why previous “best method” claims depend on overlooked details like setup and bias.
1
0
0
7/ So, removing prompt bias isn't always “good” or “bad”, it depends on what you want your model to do. For some tasks, you might want to deliberately keep or remove bias.
1
0
0
6/ We confirmed this with new controlled experiments: * In data with no prompt bias, all methods converge similarly. * In biased setups, pointwise methods reduce bias more, but this can come at the cost of less robust alignment on harder prompts.
1
0
0
5/ * Pairwise objectives just require that the preferred output ranks higher, without flattening the whole probability landscape, so the model can still focus on genuinely tricky prompts.
1
0
0
4/ Here’s why: * Pointwise methods try to “unlearn” prompt bias by forcing probabilities for preferred responses toward 1, and rejected ones toward 0, for each prompt. This “zeroing out” eats up capacity that could be spent on harder cases.
1
0
0