Xinyi Wu Profile
Xinyi Wu

@XinyiWu98

Followers
171
Following
72
Media
6
Statuses
13

PhD @mit. Making sense of attention mechanism.

Cambridge, MA
Joined October 2024
Don't wanna be here? Send us removal request.
@XinyiWu98
Xinyi Wu
2 months
I’ll be presenting our poster tomorrow (July 17 at 11am-1:30pm) at East Hall 3407!.If you’re interested in the inductive biases of attention, natural language (and other types of data), and how their alignment enables learning and generalization, come by and let’s chat!.Also feel.
@XinyiWu98
Xinyi Wu
3 months
🚨Our new #ICML2025 paper is featured by MIT News!. Why are LLMs lost in the middle? Why do transformers systematically favor tokens at certain positions in a sequence? We had vague intuitions but no rigorous understandings, until now!. We develop a.
1
0
14
@XinyiWu98
Xinyi Wu
3 months
There are many other interesting discussions in the paper. For example: attention sinks tend to emerge at specific positions. They are actually center nodes in the graph induced by the mask. It’s exciting to see how our perspective helps connect and unify many intriguing findings
Tweet media one
0
4
12
@XinyiWu98
Xinyi Wu
3 months
🔍 Key finding #4: Even without explicit positional encodings, NoPE (no positional encoding) can still develop a sense of position. But it’s not the same as known explicit encodings like sinPE or RoPE: In controlled settings where training data is biased toward both the beginning
Tweet media one
1
1
5
@XinyiWu98
Xinyi Wu
3 months
🔍 Key finding #3: The architecture alone can induce position bias. In a controlled setup with no positional bias in the data, we still see position biases arising solely from the model inductive bias. Residual connections also affect the biases in nontrivial ways. (6/8)
Tweet media one
1
1
4
@XinyiWu98
Xinyi Wu
3 months
🔍 Key finding #2: Relative positional encodings (RoPE, decay masks) compete with the causal mask. They add distance-based decay to attention. But across layers, early tokens still gain dominance due to repeated accumulation. (5/8).
1
1
3
@XinyiWu98
Xinyi Wu
3 months
🔍 Key finding #1: Causal masking alone induces strong bias toward earlier tokens. Why? It imposes a directional flow among tokens: deeper layers attend to increasingly contextualized versions of earlier tokens. This causes the first token in the sequence to act as a center node:
Tweet media one
1
1
3
@XinyiWu98
Xinyi Wu
3 months
We use graphs to reason about position bias! 🧠.Transformers = message-passing GNNs over tokens. Yet in practice, not all tokens can attend to all other tokens. To take this constraint into account, we treat the attention mask as a directed graph: an edge from token j to i means
Tweet media one
1
3
7
@XinyiWu98
Xinyi Wu
3 months
Transformers are supposed to be permutation-equivariant, yet LLMs show strong position bias:.– “Lost in the middle” .– Order sensitivity in in-context learning / LLM-as-a-judge.– Attention sinks. Why does this matter? LLMs are black boxes. Aside from satisfying our scientific
Tweet media one
1
1
3
@XinyiWu98
Xinyi Wu
3 months
🚨Our new #ICML2025 paper is featured by MIT News!. Why are LLMs lost in the middle? Why do transformers systematically favor tokens at certain positions in a sequence? We had vague intuitions but no rigorous understandings, until now!. We develop a.
Tweet card summary image
news.mit.edu
MIT researchers discovered the underlying cause of position bias, a phenomenon that causes large language models to overemphasize the beginning or end of a document or conversation, while neglecting...
7
28
79