borisshapa Profile Banner
Boris Shaposhnikov Profile
Boris Shaposhnikov

@borisshapa

Followers
56
Following
7
Media
17
Statuses
58

ai research

Joined February 2022
Don't wanna be here? Send us removal request.
@borisshapa
Boris Shaposhnikov
3 months
9/ đź§­ TL;DR: ESSA turns alignment into a simple, stable, highly parallel evaluation loop. Competitive quality, dramatically lower engineering overhead, and much faster time-to-quality. Paper:
1
1
0
@borisshapa
Boris Shaposhnikov
3 months
8/ 🧪 Broad evaluation. ESSA spans 7 advanced math benchmarks (MATH500, MinervaMath, OlympiadBench, AIME’24/’25, AMC’23) on Qwen2.5-32B/72B and, in the paper, also non-math tasks like IFEval & HelpSteer with richer reward signals.
1
0
0
@borisshapa
Boris Shaposhnikov
3 months
7/ 🔢 Precision robustness: BF16 → INT8 → INT4 costs <1% accuracy for ESSA (e.g. 0.847 → 0.844 → 0.838 for Qwen2.5-32B on PRM800K).
1
0
0
@borisshapa
Boris Shaposhnikov
3 months
6/ ⏱️ Scaling wins: On Qwen2.5-32B / PRM800K, ESSA reaches near-optimal accuracy 2× faster on 16 GPUs and 6× faster on 128 GPUs than GRPO — minimal comms (just seeds + rewards) makes parallelism easy.
1
0
1
@borisshapa
Boris Shaposhnikov
3 months
5/ 📊 Headline results (vs GRPO): • Qwen2.5-Math-7B: +12.6% on GSM8K, +14.8% on PRM800K • LLaMA-3.1-8B: +22.5% on IFEval
1
0
0
@borisshapa
Boris Shaposhnikov
3 months
4/ 🔬 As @johnschulman2 & @thinkymachines showed, LoRA can match Full Fine-Tuning in post-training — a strong, efficient baseline. ESSA goes further: it tunes only the singular values — smaller, faster, fully gradient-free.
1
0
0
@borisshapa
Boris Shaposhnikov
3 months
3/ 💡 Core idea: optimize only the singular values of LoRA adapters (SVD-LoRA) with CMA-ES. That shrinks the search space by >1000�� while keeping the stability and gradient-free nature of Evolution Strategies.
1
0
0
@borisshapa
Boris Shaposhnikov
3 months
2/ ⚙️ Why this matters: PPO/GRPO pipelines are complex (actor/critic, long rollouts, sync, memory). ESSA uses only forward passes + black-box optimization. No backprop. Works even in INT8/INT4.
1
0
0
@borisshapa
Boris Shaposhnikov
3 months
1/ 🚀 We’re releasing ESSA: Evolutionary Strategies for Scalable Alignment — a gradient-free, inference-only alternative to RLHF that makes aligning LLMs faster, simpler, and cheaper.👇
1
6
7
@ummagumm_a
Viacheslav Sinii
3 months
1/ @johnschulman2 and @thinkymachines showed that LoRA can match full fine-tuning in many post-training regimes. In our earlier paper, we went even tighter — train steering vectors. That’s 131K extra params on Llama3.1-8B-Instruct and matches full-tuning on 6/7 models we studied
1
7
12
@BredisGeorge
George Bredis
5 months
[1/9] VLMs caption well, but no simple RL trains them in multi-step sims and shows gains. VL-DAC: a lightweight RL algorithm on top of VLM + cheap sims yields agents that finish long quests and transfer to skill specific benchmarks with no tuning. HF link: https://t.co/HIGDN7NisA
1
7
7
@borisshapa
Boris Shaposhnikov
8 months
10/ Huge thanks to my amazing colleagues. This work wouldn’t have been possible without you @AMyashka @kefirski
1
0
0
@borisshapa
Boris Shaposhnikov
8 months
9/ All our findings and full experimental results are available in the paper. If you’re using offline alignment methods, it’s important to consider how your chosen objective interacts with prompt bias. 🔗
Tweet card summary image
arxiv.org
Direct Alignment Algorithms (DAAs) offer a simpler way to language model alignment than traditional RLHF by directly optimizing policies. While DAAs differ in their use of SFT (one-stage vs....
1
0
0
@borisshapa
Boris Shaposhnikov
8 months
8/ Takeaway: Pairwise objectives outperform pointwise on tasks where the model's capacity is just enough to “unlearn” prompt bias, but not enough to tackle harder examples. This nuance explains why previous “best method” claims depend on overlooked details like setup and bias.
1
0
0
@borisshapa
Boris Shaposhnikov
8 months
7/ So, removing prompt bias isn't always “good” or “bad”, it depends on what you want your model to do. For some tasks, you might want to deliberately keep or remove bias.
1
0
0
@borisshapa
Boris Shaposhnikov
8 months
6/ We confirmed this with new controlled experiments: * In data with no prompt bias, all methods converge similarly. * In biased setups, pointwise methods reduce bias more, but this can come at the cost of less robust alignment on harder prompts.
1
0
0
@borisshapa
Boris Shaposhnikov
8 months
5/ * Pairwise objectives just require that the preferred output ranks higher, without flattening the whole probability landscape, so the model can still focus on genuinely tricky prompts.
1
0
0
@borisshapa
Boris Shaposhnikov
8 months
4/ Here’s why: * Pointwise methods try to “unlearn” prompt bias by forcing probabilities for preferred responses toward 1, and rejected ones toward 0, for each prompt. This “zeroing out” eats up capacity that could be spent on harder cases.
1
0
0