papers_anon Profile Banner
PapersAnon Profile
PapersAnon

@papers_anon

Followers
2K
Following
9K
Media
316
Statuses
662

Just a fan of acceleration. I read and post interesting papers. Let's all make it through.

SAITAMA
Joined February 2024
Don't wanna be here? Send us removal request.
@papers_anon
PapersAnon
1 year
https://t.co/CJC3YWPoB6 Various links for ML and local models (not just LLMs) that's kept fairly updated. https://t.co/5pLfM330hp ML papers I've read that I think are interesting. Also keep a text file at the top of all the abstracts for easy searching.
Tweet card summary image
rentry.org
/lmg/ Abstracts Search (Current as of the end of 11/2025)Links Google Papers Blog 12/2017 Attention Is All You Need (Transformers) 10/2018 BERT: Pre-training of Deep Bidirectional Transformers for...
1
17
141
@papers_anon
PapersAnon
8 hours
RePo: Language Models with Context Re-Positioning From Sakana AI. Novel mechanism that utilizes a differentiable module, fϕ, to assign token positions that capture contextual dependencies, rather than replying on pre-defined integer range. Links below
1
0
6
@papers_anon
PapersAnon
8 days
Group Representational Position Encoding Unified framework for positional encoding based on group actions. Supplies a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Links below
1
0
3
@papers_anon
PapersAnon
23 days
Why Do Language Model Agents Whistleblow? Introduce WhistleBench, a dataset designed to evaluate language models’ propensity for whistleblowing behavior. Found that Opus 4.1, Gemini 2.5 Pro, and Grok 4 were the most likely while GPT 4.1/5 and Llama Maverick never do Links below
1
0
5
@papers_anon
PapersAnon
29 days
P1: Mastering Physics Olympiads with Reinforcement Learning Combination of train-time scaling via RL post-training and test-time scaling via agentic framework on top of Qwen3 models to achieve Gold-medal performance at the latest International Physics Olympiad. Links below
2
7
55
@papers_anon
PapersAnon
1 month
Virtual Width Networks From ByteDance. Decouples representational width from backbone width expanding the embedding space while keeping backbone compute near constant. 8× expansion accelerates optimization by over 2× for next-token and 3× for next-2-token prediction Links below
3
18
139
@papers_anon
PapersAnon
1 month
Optimizing Mixture of Block Attention Introduce FlashMoBA, a hardware-aware CUDA kernel that enables efficient MoBA execution with the small block sizes that priorly were inefficient on GPU. Achieves up to 14.7× speedup over FlashAttention-2 for small blocks. Links below
2
20
136
@papers_anon
PapersAnon
1 month
LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics Identifies that isotropic Gaussian is the optimal distribution for JEPAs’ embeddings and introduces a novel objective, SIGReg, to constrain them to reach that ideal distribution. Links below
3
1
3
@papers_anon
PapersAnon
1 month
FedMuon: Accelerating Federated Learning with Matrix Orthogonalization Structure-aware federated optimizer that addresses core challenges of non-IID by coupling matrixorthogonalized local updates with local-global alignment and cross-round momentum aggregation. Links below
1
5
62
@papers_anon
PapersAnon
2 months
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats From ByteDance. Found that MXINT8 is superior to its FP counterpart in algorithmic accuracy and hardware efficiency and introduces a symmetric clipping method that resolves gradient bias Links below
1
5
28
@papers_anon
PapersAnon
2 months
FlexLink: Boosting your NVLink Bandwidth by 27% without accuracy concern From Ant Group. Proposes a collective communication framework designed to aggregate heterogeneous links—NVLink, PCIe, and RDMA NICs—into a single, high-performance communication fabric. Links below
1
0
9