Talor Abramovich Profile
Talor Abramovich

@AbramovichTalor

Followers
83
Following
88
Media
10
Statuses
37

Israel
Joined December 2020
Don't wanna be here? Send us removal request.
@AbramovichTalor
Talor Abramovich
1 day
8/8. You’re welcome to read more on our project website and the paper, or start exploring the benchmark!. 🔗 Website: 📄 Paper: 📊 Benchmark: 💻 Code:
Tweet card summary image
github.com
AblationBench is evaluation framework for language models on ablation planning in empricial AI research - ai-scientist-bench/ablation-bench
0
0
1
@AbramovichTalor
Talor Abramovich
1 day
7/8. Special thanks to the teams behind CSR-Bench, SUPER-Expert, and PaperBench for open-sourcing their benchmarks. @YIJIA_XIAO_ @ben_bogin @OpenAI .AblationBench builds on their work collecting high-quality papers for AI coscientist tasks. Thanks to @GalChechik for supervising.
1
0
2
@grok
Grok
1 day
Generate videos in just a few seconds. Try Grok Imagine, free for a limited time.
507
718
5K
@AbramovichTalor
Talor Abramovich
1 day
6/8. We created two baselines to test our benchmark: LM-Planner using CoT prompting, and Agent-Planner based on SWE-agent. LM-Planner currently outperforms Agent-Planner, and is also ~3x more cost-effective.
Tweet media one
1
0
0
@AbramovichTalor
Talor Abramovich
1 day
5/8. To automate evaluation, we built LM-based judges for both tasks and created separate human-annotated datasets to assess them. Our judges achieved F1 scores of 0.81 for AuthorAblation and 0.70 for ReviewerAblation.
1
0
0
@AbramovichTalor
Talor Abramovich
1 day
4/8. For the second task, we created ReviewerAblation, a dataset of 350 ICLR submissions from 2023 to 2025, paired with reviewer comments suggesting new ablations. The task is to find missing ablations in the original paper submission.
1
0
0
@AbramovichTalor
Talor Abramovich
1 day
3/8. For the first task, we built AuthorAblation. We collected 230 human-annotated ablations from 83 papers across 14 top AI conferences. The task is to generate an ablation plan using only the sections up to (and including) the method section.
1
0
0
@AbramovichTalor
Talor Abramovich
1 day
2/8. We look into two ablation planning tasks in empirical AI research:. 1️⃣ Helping authors plan ablations given a written method section. 2️⃣ Helping reviewers spot missing ablations in a given paper. This dual perspective captures the complementary sides of ablation planning.
1
0
0
@AbramovichTalor
Talor Abramovich
1 day
Can AI coscientists help automate ablation planning?. To test this, we created AblationBench, a benchmark suite to evaluate models on ablation planning in empirical AI research. The results? Even the best current models recover only 29% of the original ablations on average ⬇️
Tweet media one
1
8
12
@AbramovichTalor
Talor Abramovich
5 days
Incredible to see the progress in Offensive Cybersecurity benchmarks!.
@terryyuezhuo
Terry Yue Zhuo @ SF 🏖️
7 days
Training Agents without Runtime? Yes, and it works well on Offensive Cybersecurity!. Introducing Cyber-Zero, the first approach that trains top-tier open-source cybersecurity agents that achieves comparable accuracy on 300+ CTFs like DeepSeek-V3 and Claude-3.5-Sonnet. What makes
Tweet media one
0
1
5
@AbramovichTalor
Talor Abramovich
7 days
RT @ori_press: We just benchmarked Qwen 3 Coder and GLM 4.5 on AlgoTune, and they manage to beat Claude Opus 4! We're excited to see if the….
0
3
0
@AbramovichTalor
Talor Abramovich
26 days
We’re @icmlconf Hall B2-B3, Poster W-101 right now to talk about our agent for discovering new cybersec vulns. Come chat with us!
Tweet media one
1
1
13
@AbramovichTalor
Talor Abramovich
1 month
Thanks to my collaborators from NYU and Princeton who made it possible. @Udiboy1209.@klieret @jyangballin @_carlosejimenez @moyix @karthik_r_n @ofirpress. Read more here:
enigma-agent.com
This is the landing and main page of EnIGMA
0
0
6
@AbramovichTalor
Talor Abramovich
1 month
Come hear about LM agents for the cybersecurity domain, and what we learned about data contamination in LM evaluations. Poster session will be held on July 17th, 11am–1:30pm PST at the Vancouver Convention Centre.
1
0
5
@AbramovichTalor
Talor Abramovich
1 month
Join me next week at #ICML25, where I will be presenting my first first-author paper –– EnIGMA. EnIGMA, an LM agent for cybersecurity, uses interactive tools for server connection and debugging, achieving state-of-the-art on 3 CTF benchmarks.
3
5
25
@AbramovichTalor
Talor Abramovich
1 month
A new benchmark about algorithmic efficiency - AlgoTune! . Super exciting to see how we progress from benchmarks about writing code, to benchmarks about improving existing algorithms of hard problems such as AES encryption and prime factorization.
@ori_press
Ori Press
1 month
Do language models have algorithmic creativity?. To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️
Tweet media one
0
0
2
@AbramovichTalor
Talor Abramovich
3 months
RT @jyangballin: 40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synt….
0
136
0
@AbramovichTalor
Talor Abramovich
4 months
RT @KLieret: Maps. Diagrams. UI glitches. SWE-bench Multimodal benchmarks AI agents on real-world frontend issues and they struggle. Poster….
0
3
0
@AbramovichTalor
Talor Abramovich
5 months
RT @yairshp: 🚀Introducing SISO – a plug-and-play approach for image personalization using just one image!
Tweet media one
0
22
0
@AbramovichTalor
Talor Abramovich
6 months
RT @KLieret: Setting a new SoTA among open source systems on SWE-bench Verified with SWE-agent 1.0 + Claude 3.7!
Tweet media one
0
5
0
@AbramovichTalor
Talor Abramovich
6 months
RT @Gidonrosenberg: Today, Avinatan marks his Hebrew birthday. 32 birthdays, of which two birthdays have been spent in Hamas captivity. Av….
0
40
0