Piotr Piękos
@PiotrPiekosAI
Followers
235
Following
327
Media
11
Statuses
54
PhD student with @SchmidhuberAI at @KAUST. Interested in systematic generalization and reasoning.
Joined April 2012
If you’re not happy with the quality of the reviews you got, remember that with current trends the reviews at ICML will be even worse
0
0
2
Our Huxley-Gödel Machine learns to rewrite its own code, estimating its own long-term self-improvement potential. It generalizes on new tasks (SWE-Bench Lite), matching the best officially checked human-engineered agents. Arxiv 2510.21614 With @Wenyi_AI_Wang, @PiotrPiekosAI,
56
154
1K
Authors: Wenyi Wang, @PiotrPiekosAI , @nbl_ai , Firas Laakom, @Beastlyprime , @MatOstasze, @MingchenZhuge , @SchmidhuberAI ArXiv: https://t.co/jpQ1JQjhO4 Github:
github.com
🧬 The Huxley-Gödel Machine. Contribute to metauto-ai/HGM development by creating an account on GitHub.
1
2
14
✅HGM’s improvements transfer across datasets and LLMs. Optimized on SWE-Bench Verified with GPT-5 mini, evaluated on SWE-Bench Lite with GPT-5, HGM reaches human-level performance 👩💻🤖, matching the best officially checked human-engineered coding agent.
1
0
9
⚡️HGM selects agents for self-improvement based on CMP estimates. On SWE-Bench Verified, HGM outperforms prior self-improving coding-systems methods while using 🕐 less wall-clock time.
1
0
10
💡Introducing Clade Metaproductivity (CMP): A metric that measures an agent’s true potential for self-improvement via its descendants. We prove: under certain assumptions, access to true CMP lets a coding agent simulate the Gödel Machine - the optimal self-improving machine.🧩
1
0
11
Most self-improving coding systems⚙️self-modify better performing agents. But❌immediate higher scores ≠ long-term self-improvement. We call this the Metaproductivity–Performance Mismatch 📉
1
0
11
🚨Time to let agents code themselves! Meet Huxley-Gödel Machine (HGM), a game changer in coding agent development🚨 [🤖vs.🧑💻]HGM evolves by self-rewrites to match the best officially checked human-engineered agents on SWE-Bench Lite despite being optimized on a different dataset.
1
11
60
Next time you complain about the bidding system, be grateful that you at least had a chance to bid.
0
0
2
[LG] Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing P Piękos, R Csordás, J Schmidhuber [KAUST & Stanford University] (2025) https://t.co/kR2NuFNa5P
0
4
20
Intuitive and reduces complexity. Measurable effectiveness. That's what you're going for.
In MoSA, each attention head selects k tokens to process using Expert-Choice Routing. This creates a learnable, context-based sparse attention with reduced complexity from O(T^2) to O(T+k^2). The saved compute can be used to create a large set of specialized, sparse heads.
0
1
2
What if instead of a couple of dense attention heads, we use lots of sparse heads, each learning to select its own set of tokens to process? Introducing Mixture of Sparse Attention (MoSA)
1
4
19
More results can be found in the paper. Huge thanks to my amazing collaborators: @robert_csordas , @SchmidhuberAI Paper: https://t.co/20toWkvjyL Code:
github.com
User-friendly implementation of the Mixture-of-Sparse-Attention (MoSA). MoSA selects distinct tokens for each head with expert choice routing providing a content-based sparse attention mechanism. -...
0
0
1
MoSA’s advantages also hold on longer sequences. Our preliminary results on lengths up to 8192 show that MoSA consistently holds the advantage over the compared sparse baselines.
1
0
2
MoSA improves resource utilization in the practical setting as well. Even with a pure Torch implementation, when perplexity-matched to a dense baseline, it runs in faster wall-clock time, has lower memory usage, and uses a significantly smaller KV cache.
1
0
2
The sparsity induced by MoSA helps even in standard length sequences, reducing perplexity by up to 27% in the IsoFLOP setting compared to dense attention. Moreover, MoSA is the only sparse attention variant we tested that reduces perplexity in the standard length setting.
1
0
2
In MoSA, each attention head selects k tokens to process using Expert-Choice Routing. This creates a learnable, context-based sparse attention with reduced complexity from O(T^2) to O(T+k^2). The saved compute can be used to create a large set of specialized, sparse heads.
1
1
6
What if instead of a couple of dense attention heads, we use lots of sparse heads, each learning to select its own set of tokens to process? Introducing Mixture of Sparse Attention (MoSA)
1
4
19
One of my rejected ICML 2025 (@icmlconf) papers. Can anyone spot any criticism in the metareview? What a joke 🙃
26
25
518
Come visit our poster "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention" on Thursday at 11 am in East Exhibit Hall A-C on #NeurIPS2024. With @PiotrPiekosAI, Kazuki Irie and @SchmidhuberAI.
2
20
51