
Xinyi Wu
@XinyiWu98
Followers
171
Following
72
Media
6
Statuses
13
PhD @mit. Making sense of attention mechanism.
Cambridge, MA
Joined October 2024
I’ll be presenting our poster tomorrow (July 17 at 11am-1:30pm) at East Hall 3407!.If you’re interested in the inductive biases of attention, natural language (and other types of data), and how their alignment enables learning and generalization, come by and let’s chat!.Also feel.
🚨Our new #ICML2025 paper is featured by MIT News!. Why are LLMs lost in the middle? Why do transformers systematically favor tokens at certain positions in a sequence? We had vague intuitions but no rigorous understandings, until now!. We develop a.
1
0
14
🔍 Key finding #4: Even without explicit positional encodings, NoPE (no positional encoding) can still develop a sense of position. But it’s not the same as known explicit encodings like sinPE or RoPE: In controlled settings where training data is biased toward both the beginning
1
1
5
🔍 Key finding #3: The architecture alone can induce position bias. In a controlled setup with no positional bias in the data, we still see position biases arising solely from the model inductive bias. Residual connections also affect the biases in nontrivial ways. (6/8)
1
1
4
🔍 Key finding #2: Relative positional encodings (RoPE, decay masks) compete with the causal mask. They add distance-based decay to attention. But across layers, early tokens still gain dominance due to repeated accumulation. (5/8).
1
1
3
🔍 Key finding #1: Causal masking alone induces strong bias toward earlier tokens. Why? It imposes a directional flow among tokens: deeper layers attend to increasingly contextualized versions of earlier tokens. This causes the first token in the sequence to act as a center node:
1
1
3
🚨Our new #ICML2025 paper is featured by MIT News!. Why are LLMs lost in the middle? Why do transformers systematically favor tokens at certain positions in a sequence? We had vague intuitions but no rigorous understandings, until now!. We develop a.
news.mit.edu
MIT researchers discovered the underlying cause of position bias, a phenomenon that causes large language models to overemphasize the beginning or end of a document or conversation, while neglecting...
7
28
79