Boris Shaposhnikov @borisshapa X Profile

Boris Shaposhnikov

@borisshapa

Followers

56

Following

7

Media

17

Statuses

58

ai research

Joined February 2022

Don't wanna be here? Send us removal request.

Boris Shaposhnikov

@borisshapa

3 months

10/ 🙌 cc @yule_gan, @ericmitchellai, @winniethexu, @_lewtun, @krasul, @lvwerra, @tomekkorbak, @archit_sharma97, @sea_snell, @rm_rafailov, @Learnius, @yumeng0818, @jiwoohong98, @KarelDoostrlnck, @chenhuay17, @TengX6

0

Boris Shaposhnikov

@borisshapa

3 months

9/ 🧭 TL;DR: ESSA turns alignment into a simple, stable, highly parallel evaluation loop. Competitive quality, dramatically lower engineering overhead, and much faster time-to-quality. Paper:

1

0

Boris Shaposhnikov

@borisshapa

3 months

8/ 🧪 Broad evaluation. ESSA spans 7 advanced math benchmarks (MATH500, MinervaMath, OlympiadBench, AIME’24/’25, AMC’23) on Qwen2.5-32B/72B and, in the paper, also non-math tasks like IFEval & HelpSteer with richer reward signals.

1

0

Boris Shaposhnikov

@borisshapa

3 months

7/ 🔢 Precision robustness: BF16 → INT8 → INT4 costs <1% accuracy for ESSA (e.g. 0.847 → 0.844 → 0.838 for Qwen2.5-32B on PRM800K).

1

0

Boris Shaposhnikov

@borisshapa

3 months

6/ ⏱️ Scaling wins: On Qwen2.5-32B / PRM800K, ESSA reaches near-optimal accuracy 2× faster on 16 GPUs and 6× faster on 128 GPUs than GRPO — minimal comms (just seeds + rewards) makes parallelism easy.

1

0

1

Boris Shaposhnikov

@borisshapa

3 months

5/ 📊 Headline results (vs GRPO): • Qwen2.5-Math-7B: +12.6% on GSM8K, +14.8% on PRM800K • LLaMA-3.1-8B: +22.5% on IFEval

1

0

Boris Shaposhnikov

@borisshapa

3 months

4/ 🔬 As @johnschulman2 & @thinkymachines showed, LoRA can match Full Fine-Tuning in post-training — a strong, efficient baseline. ESSA goes further: it tunes only the singular values — smaller, faster, fully gradient-free.

1

0

Boris Shaposhnikov

@borisshapa

3 months

3/ 💡 Core idea: optimize only the singular values of LoRA adapters (SVD-LoRA) with CMA-ES. That shrinks the search space by >1000�� while keeping the stability and gradient-free nature of Evolution Strategies.

1

0

Boris Shaposhnikov

@borisshapa

3 months

2/ ⚙️ Why this matters: PPO/GRPO pipelines are complex (actor/critic, long rollouts, sync, memory). ESSA uses only forward passes + black-box optimization. No backprop. Works even in INT8/INT4.

1

0

Boris Shaposhnikov

@borisshapa

3 months

1/ 🚀 We’re releasing ESSA: Evolutionary Strategies for Scalable Alignment — a gradient-free, inference-only alternative to RLHF that makes aligning LLMs faster, simpler, and cheaper.👇

1

6

7

Viacheslav Sinii

@ummagumm_a

3 months

1/ @johnschulman2 and @thinkymachines showed that LoRA can match full fine-tuning in many post-training regimes. In our earlier paper, we went even tighter — train steering vectors. That’s 131K extra params on Llama3.1-8B-Instruct and matches full-tuning on 6/7 models we studied

1

7

12

George Bredis

@BredisGeorge

5 months

[1/9] VLMs caption well, but no simple RL trains them in multi-step sims and shows gains. VL-DAC: a lightweight RL algorithm on top of VLM + cheap sims yields agents that finish long quests and transfer to skill specific benchmarks with no tuning. HF link: https://t.co/HIGDN7NisA

1

7

Boris Shaposhnikov

@borisshapa

8 months

11/ cc @ericmitchellai, @winniethexu, @_lewtun, @krasul, @lvwerra, @tomekkorbak, @archit_sharma97, @sea_snell, @rm_rafailov, @Learnius, @yumeng0818, @jiwoohong98, @KarelDoostrlnck, @chenhuay17, @TengX6

0

Boris Shaposhnikov

@borisshapa

8 months

10/ Huge thanks to my amazing colleagues. This work wouldn’t have been possible without you @AMyashka @kefirski

1

0

Boris Shaposhnikov

@borisshapa

8 months

9/ All our findings and full experimental results are available in the paper. If you’re using offline alignment methods, it’s important to consider how your chosen objective interacts with prompt bias. 🔗

arxiv.org

Direct Alignment Algorithms (DAAs) offer a simpler way to language model alignment than traditional RLHF by directly optimizing policies. While DAAs differ in their use of SFT (one-stage vs....

1

0

Boris Shaposhnikov

@borisshapa

8 months

8/ Takeaway: Pairwise objectives outperform pointwise on tasks where the model's capacity is just enough to “unlearn” prompt bias, but not enough to tackle harder examples. This nuance explains why previous “best method” claims depend on overlooked details like setup and bias.

1

0

Boris Shaposhnikov

@borisshapa

8 months

7/ So, removing prompt bias isn't always “good” or “bad”, it depends on what you want your model to do. For some tasks, you might want to deliberately keep or remove bias.

1

0

Boris Shaposhnikov

@borisshapa

8 months

6/ We confirmed this with new controlled experiments: * In data with no prompt bias, all methods converge similarly. * In biased setups, pointwise methods reduce bias more, but this can come at the cost of less robust alignment on harder prompts.

1

0

Boris Shaposhnikov

@borisshapa

8 months

5/ * Pairwise objectives just require that the preferred output ranks higher, without flattening the whole probability landscape, so the model can still focus on genuinely tricky prompts.

1

0

Boris Shaposhnikov

@borisshapa

8 months

4/ Here’s why: * Pointwise methods try to “unlearn” prompt bias by forcing probabilities for preferred responses toward 1, and rejected ones toward 0, for each prompt. This “zeroing out” eats up capacity that could be spent on harder cases.

1

0