Piotr Piękos Profile
Piotr Piękos

@PiotrPiekosAI

Followers
235
Following
327
Media
11
Statuses
54

PhD student with @SchmidhuberAI at @KAUST. Interested in systematic generalization and reasoning.

Joined April 2012
Don't wanna be here? Send us removal request.
@PiotrPiekosAI
Piotr Piękos
13 days
If you’re not happy with the quality of the reviews you got, remember that with current trends the reviews at ICML will be even worse
0
0
2
@SchmidhuberAI
Jürgen Schmidhuber
1 month
Our Huxley-Gödel Machine learns to rewrite its own code, estimating its own long-term self-improvement potential. It generalizes on new tasks (SWE-Bench Lite), matching the best officially checked human-engineered agents. Arxiv 2510.21614  With @Wenyi_AI_Wang, @PiotrPiekosAI,
56
154
1K
@PiotrPiekosAI
Piotr Piękos
1 month
✅HGM’s improvements transfer across datasets and LLMs. Optimized on SWE-Bench Verified with GPT-5 mini, evaluated on SWE-Bench Lite with GPT-5, HGM reaches human-level performance 👩‍💻🤖, matching the best officially checked human-engineered coding agent.
1
0
9
@PiotrPiekosAI
Piotr Piękos
1 month
⚡️HGM selects agents for self-improvement based on CMP estimates. On SWE-Bench Verified, HGM outperforms prior self-improving coding-systems methods while using 🕐 less wall-clock time.
1
0
10
@PiotrPiekosAI
Piotr Piękos
1 month
💡Introducing Clade Metaproductivity (CMP):  A metric that measures an agent’s true potential for self-improvement via its descendants. We prove: under certain assumptions, access to true CMP lets a coding agent simulate the Gödel Machine - the optimal self-improving machine.🧩
1
0
11
@PiotrPiekosAI
Piotr Piękos
1 month
Most self-improving coding systems⚙️self-modify better performing agents.  But❌immediate higher scores ≠ long-term self-improvement.  We call this the Metaproductivity–Performance Mismatch 📉
1
0
11
@PiotrPiekosAI
Piotr Piękos
1 month
🚨Time to let agents code themselves! Meet Huxley-Gödel Machine (HGM), a game changer in coding agent development🚨 [🤖vs.🧑‍💻]HGM evolves by self-rewrites to match the best officially checked human-engineered agents on SWE-Bench Lite despite being optimized on a different dataset.
1
11
60
@PiotrPiekosAI
Piotr Piękos
7 months
Next time you complain about the bidding system, be grateful that you at least had a chance to bid.
0
0
2
@fly51fly
fly51fly
7 months
[LG] Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing P Piękos, R Csordás, J Schmidhuber [KAUST & Stanford University] (2025) https://t.co/kR2NuFNa5P
0
4
20
@TheGrizztronic
Josh Cason
7 months
Intuitive and reduces complexity. Measurable effectiveness. That's what you're going for.
@PiotrPiekosAI
Piotr Piękos
7 months
In MoSA, each attention head selects k tokens to process using Expert-Choice Routing. This creates a learnable, context-based sparse attention with reduced complexity from O(T^2) to O(T+k^2). The saved compute can be used to create a large set of specialized, sparse heads.
0
1
2
@PiotrPiekosAI
Piotr Piękos
7 months
What if instead of a couple of dense attention heads, we use lots of sparse heads, each learning to select its own set of tokens to process? Introducing Mixture of Sparse Attention (MoSA)
1
4
19
@PiotrPiekosAI
Piotr Piękos
7 months
MoSA’s advantages also hold on longer sequences. Our preliminary results on lengths up to 8192 show that MoSA consistently holds the advantage over the compared sparse baselines.
1
0
2
@PiotrPiekosAI
Piotr Piękos
7 months
MoSA improves resource utilization in the practical setting as well. Even with a pure Torch implementation, when perplexity-matched to a dense baseline, it runs in faster wall-clock time, has lower memory usage, and uses a significantly smaller KV cache.
1
0
2
@PiotrPiekosAI
Piotr Piękos
7 months
The sparsity induced by MoSA helps even in standard length sequences, reducing perplexity by up to 27% in the IsoFLOP setting compared to dense attention. Moreover, MoSA is the only sparse attention variant we tested that reduces perplexity in the standard length setting.
1
0
2
@PiotrPiekosAI
Piotr Piękos
7 months
In MoSA, each attention head selects k tokens to process using Expert-Choice Routing. This creates a learnable, context-based sparse attention with reduced complexity from O(T^2) to O(T+k^2). The saved compute can be used to create a large set of specialized, sparse heads.
1
1
6
@PiotrPiekosAI
Piotr Piękos
7 months
What if instead of a couple of dense attention heads, we use lots of sparse heads, each learning to select its own set of tokens to process? Introducing Mixture of Sparse Attention (MoSA)
1
4
19
@peter_richtarik
Peter Richtarik
7 months
One of my rejected ICML 2025 (@icmlconf) papers. Can anyone spot any criticism in the metareview? What a joke 🙃
26
25
518
@robert_csordas
Csordás Róbert
1 year
Come visit our poster "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention" on Thursday at 11 am in East Exhibit Hall A-C on #NeurIPS2024. With @PiotrPiekosAI, Kazuki Irie and @SchmidhuberAI.
2
20
51