Xinyi Wu @XinyiWu98 X Profile

Xinyi Wu

@XinyiWu98

Followers

171

Following

72

Media

6

Statuses

13

PhD @mit. Making sense of attention mechanism.

Cambridge, MA

Joined October 2024

Don't wanna be here? Send us removal request.

Xinyi Wu

@XinyiWu98

2 months

I’ll be presenting our poster tomorrow (July 17 at 11am-1:30pm) at East Hall 3407!.If you’re interested in the inductive biases of attention, natural language (and other types of data), and how their alignment enables learning and generalization, come by and let’s chat!.Also feel.

Xinyi Wu

@XinyiWu98

3 months

🚨Our new #ICML2025 paper is featured by MIT News!. Why are LLMs lost in the middle? Why do transformers systematically favor tokens at certain positions in a sequence? We had vague intuitions but no rigorous understandings, until now!. We develop a.

1

0

14

Xinyi Wu

@XinyiWu98

3 months

There are many other interesting discussions in the paper. For example: attention sinks tend to emerge at specific positions. They are actually center nodes in the graph induced by the mask. It’s exciting to see how our perspective helps connect and unify many intriguing findings

0

4

12

Xinyi Wu

@XinyiWu98

3 months

🔍 Key finding #4: Even without explicit positional encodings, NoPE (no positional encoding) can still develop a sense of position. But it’s not the same as known explicit encodings like sinPE or RoPE: In controlled settings where training data is biased toward both the beginning

1

5

Xinyi Wu

@XinyiWu98

3 months

🔍 Key finding #3: The architecture alone can induce position bias. In a controlled setup with no positional bias in the data, we still see position biases arising solely from the model inductive bias. Residual connections also affect the biases in nontrivial ways. (6/8)

1

4

Xinyi Wu

@XinyiWu98

3 months

🔍 Key finding #2: Relative positional encodings (RoPE, decay masks) compete with the causal mask. They add distance-based decay to attention. But across layers, early tokens still gain dominance due to repeated accumulation. (5/8).

1

3

Xinyi Wu

@XinyiWu98

3 months

🔍 Key finding #1: Causal masking alone induces strong bias toward earlier tokens. Why? It imposes a directional flow among tokens: deeper layers attend to increasingly contextualized versions of earlier tokens. This causes the first token in the sequence to act as a center node:

1

3

Xinyi Wu

@XinyiWu98

3 months

We use graphs to reason about position bias! 🧠.Transformers = message-passing GNNs over tokens. Yet in practice, not all tokens can attend to all other tokens. To take this constraint into account, we treat the attention mask as a directed graph: an edge from token j to i means

1

3

7

Xinyi Wu

@XinyiWu98

3 months

Transformers are supposed to be permutation-equivariant, yet LLMs show strong position bias:.– “Lost in the middle” .– Order sensitivity in in-context learning / LLM-as-a-judge.– Attention sinks. Why does this matter? LLMs are black boxes. Aside from satisfying our scientific

1

3

Xinyi Wu

@XinyiWu98

3 months

🚨Our new #ICML2025 paper is featured by MIT News!. Why are LLMs lost in the middle? Why do transformers systematically favor tokens at certain positions in a sequence? We had vague intuitions but no rigorous understandings, until now!. We develop a.

news.mit.edu

MIT researchers discovered the underlying cause of position bias, a phenomenon that causes large language models to overemphasize the beginning or end of a document or conversation, while neglecting...

7

28

79