
Talor Abramovich
@AbramovichTalor
Followers
83
Following
88
Media
10
Statuses
37
8/8. You’re welcome to read more on our project website and the paper, or start exploring the benchmark!. 🔗 Website: 📄 Paper: 📊 Benchmark: 💻 Code:
github.com
AblationBench is evaluation framework for language models on ablation planning in empricial AI research - ai-scientist-bench/ablation-bench
0
0
1
7/8. Special thanks to the teams behind CSR-Bench, SUPER-Expert, and PaperBench for open-sourcing their benchmarks. @YIJIA_XIAO_ @ben_bogin @OpenAI .AblationBench builds on their work collecting high-quality papers for AI coscientist tasks. Thanks to @GalChechik for supervising.
1
0
2
Incredible to see the progress in Offensive Cybersecurity benchmarks!.
Training Agents without Runtime? Yes, and it works well on Offensive Cybersecurity!. Introducing Cyber-Zero, the first approach that trains top-tier open-source cybersecurity agents that achieves comparable accuracy on 300+ CTFs like DeepSeek-V3 and Claude-3.5-Sonnet. What makes
0
1
5
RT @ori_press: We just benchmarked Qwen 3 Coder and GLM 4.5 on AlgoTune, and they manage to beat Claude Opus 4! We're excited to see if the….
0
3
0
We’re @icmlconf Hall B2-B3, Poster W-101 right now to talk about our agent for discovering new cybersec vulns. Come chat with us!
1
1
13
Thanks to my collaborators from NYU and Princeton who made it possible. @Udiboy1209.@klieret @jyangballin @_carlosejimenez @moyix @karthik_r_n @ofirpress. Read more here:
enigma-agent.com
This is the landing and main page of EnIGMA
0
0
6
Join me next week at #ICML25, where I will be presenting my first first-author paper –– EnIGMA. EnIGMA, an LM agent for cybersecurity, uses interactive tools for server connection and debugging, achieving state-of-the-art on 3 CTF benchmarks.
3
5
25
A new benchmark about algorithmic efficiency - AlgoTune! . Super exciting to see how we progress from benchmarks about writing code, to benchmarks about improving existing algorithms of hard problems such as AES encryption and prime factorization.
Do language models have algorithmic creativity?. To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️
0
0
2
RT @jyangballin: 40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synt….
0
136
0
RT @KLieret: Maps. Diagrams. UI glitches. SWE-bench Multimodal benchmarks AI agents on real-world frontend issues and they struggle. Poster….
0
3
0
RT @yairshp: 🚀Introducing SISO – a plug-and-play approach for image personalization using just one image!
0
22
0
RT @KLieret: Setting a new SoTA among open source systems on SWE-bench Verified with SWE-agent 1.0 + Claude 3.7!
0
5
0
RT @Gidonrosenberg: Today, Avinatan marks his Hebrew birthday. 32 birthdays, of which two birthdays have been spent in Hamas captivity. Av….
0
40
0