Talor Abramovich @AbramovichTalor X Profile

Talor Abramovich

@AbramovichTalor

Followers

83

Following

88

Media

10

Statuses

37

Israel

Joined December 2020

Don't wanna be here? Send us removal request.

Talor Abramovich

@AbramovichTalor

1 day

8/8. You’re welcome to read more on our project website and the paper, or start exploring the benchmark!. 🔗 Website: 📄 Paper: 📊 Benchmark: 💻 Code:

github.com

AblationBench is evaluation framework for language models on ablation planning in empricial AI research - ai-scientist-bench/ablation-bench

0

1

Talor Abramovich

@AbramovichTalor

1 day

7/8. Special thanks to the teams behind CSR-Bench, SUPER-Expert, and PaperBench for open-sourcing their benchmarks. @YIJIA_XIAO_ @ben_bogin @OpenAI .AblationBench builds on their work collecting high-quality papers for AI coscientist tasks. Thanks to @GalChechik for supervising.

1

0

2

Grok

@grok

1 day

Generate videos in just a few seconds. Try Grok Imagine, free for a limited time.

507

718

5K

Talor Abramovich

@AbramovichTalor

1 day

6/8. We created two baselines to test our benchmark: LM-Planner using CoT prompting, and Agent-Planner based on SWE-agent. LM-Planner currently outperforms Agent-Planner, and is also ~3x more cost-effective.

1

0

Talor Abramovich

@AbramovichTalor

1 day

5/8. To automate evaluation, we built LM-based judges for both tasks and created separate human-annotated datasets to assess them. Our judges achieved F1 scores of 0.81 for AuthorAblation and 0.70 for ReviewerAblation.

1

0

Talor Abramovich

@AbramovichTalor

1 day

4/8. For the second task, we created ReviewerAblation, a dataset of 350 ICLR submissions from 2023 to 2025, paired with reviewer comments suggesting new ablations. The task is to find missing ablations in the original paper submission.

1

0

Talor Abramovich

@AbramovichTalor

1 day

3/8. For the first task, we built AuthorAblation. We collected 230 human-annotated ablations from 83 papers across 14 top AI conferences. The task is to generate an ablation plan using only the sections up to (and including) the method section.

1

0

Talor Abramovich

@AbramovichTalor

1 day

2/8. We look into two ablation planning tasks in empirical AI research:. 1️⃣ Helping authors plan ablations given a written method section. 2️⃣ Helping reviewers spot missing ablations in a given paper. This dual perspective captures the complementary sides of ablation planning.

1

0

Talor Abramovich

@AbramovichTalor

1 day

Can AI coscientists help automate ablation planning?. To test this, we created AblationBench, a benchmark suite to evaluate models on ablation planning in empirical AI research. The results? Even the best current models recover only 29% of the original ablations on average ⬇️

1

8

12

Talor Abramovich

@AbramovichTalor

5 days

Incredible to see the progress in Offensive Cybersecurity benchmarks!.

Terry Yue Zhuo @ SF 🏖️

@terryyuezhuo

7 days

Training Agents without Runtime? Yes, and it works well on Offensive Cybersecurity!. Introducing Cyber-Zero, the first approach that trains top-tier open-source cybersecurity agents that achieves comparable accuracy on 300+ CTFs like DeepSeek-V3 and Claude-3.5-Sonnet. What makes

0

1

5

Talor Abramovich

@AbramovichTalor

7 days

RT @ori_press: We just benchmarked Qwen 3 Coder and GLM 4.5 on AlgoTune, and they manage to beat Claude Opus 4! We're excited to see if the….

0

3

0

Talor Abramovich

@AbramovichTalor

26 days

We’re @icmlconf Hall B2-B3, Poster W-101 right now to talk about our agent for discovering new cybersec vulns. Come chat with us!

1

13

Talor Abramovich

@AbramovichTalor

1 month

Thanks to my collaborators from NYU and Princeton who made it possible. @Udiboy1209.@klieret @jyangballin @_carlosejimenez @moyix @karthik_r_n @ofirpress. Read more here:

enigma-agent.com

This is the landing and main page of EnIGMA

0

6

Talor Abramovich

@AbramovichTalor

1 month

Come hear about LM agents for the cybersecurity domain, and what we learned about data contamination in LM evaluations. Poster session will be held on July 17th, 11am–1:30pm PST at the Vancouver Convention Centre.

1

0

5

Talor Abramovich

@AbramovichTalor

1 month

Join me next week at #ICML25, where I will be presenting my first first-author paper –– EnIGMA. EnIGMA, an LM agent for cybersecurity, uses interactive tools for server connection and debugging, achieving state-of-the-art on 3 CTF benchmarks.

3

5

25

Talor Abramovich

@AbramovichTalor

1 month

A new benchmark about algorithmic efficiency - AlgoTune! . Super exciting to see how we progress from benchmarks about writing code, to benchmarks about improving existing algorithms of hard problems such as AES encryption and prime factorization.

Ori Press

@ori_press

1 month

Do language models have algorithmic creativity?. To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

0

2

Talor Abramovich

@AbramovichTalor

3 months

RT @jyangballin: 40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synt….

0

136

0

Talor Abramovich

@AbramovichTalor

4 months

RT @KLieret: Maps. Diagrams. UI glitches. SWE-bench Multimodal benchmarks AI agents on real-world frontend issues and they struggle. Poster….

0

3

0

Talor Abramovich

@AbramovichTalor

5 months

RT @yairshp: 🚀Introducing SISO – a plug-and-play approach for image personalization using just one image!

0

22

0

Talor Abramovich

@AbramovichTalor

6 months

RT @KLieret: Setting a new SoTA among open source systems on SWE-bench Verified with SWE-agent 1.0 + Claude 3.7!

0

5

0

Talor Abramovich

@AbramovichTalor

6 months

RT @Gidonrosenberg: Today, Avinatan marks his Hebrew birthday. 32 birthdays, of which two birthdays have been spent in Hamas captivity. Av….

0

40

0