Tilde @tilderesearch X Profile

Tilde

@tilderesearch

Followers

2K

Following

95

Media

17

Statuses

44

Doing cool things.

Joined July 2024

Don't wanna be here? Send us removal request.

Tilde

@tilderesearch

10 days

Sparse attention (MoBA/NSA) trains faster & beats full attention in key tasks. But we’ve had no idea how they truly work…until now. 🔍 We reverse-engineered them to uncover:. - Novel attention patterns.- Hidden "attention sinks".- Better performance.- And more. A 🧵… ~1/8~

5

80

404

Tilde

@tilderesearch

10 days

Read the full post here:

0

1

22

Tilde

@tilderesearch

10 days

~8/8~ We release our NSA kernel for experimentation and research here: At Tilde, we believe interpretability is the path towards building better models. If that sounds cool, reach out!.

1

2

30

Tilde

@tilderesearch

10 days

~7/8~ We analyzed the gating distributions for NSA models and found we can ablate many branches without compromising model performance! Our principled ablations enabled massive gains in throughput without losses in performance.

2

3

38

Tilde

@tilderesearch

10 days

~6/8~ We investigated the key geometry for different attention models, and also found that KV-sharing through GQA strongly influences the resulting query/key manifolds. This offered a clue into how sparse attention could be beating base attention → by removing cross-head

1

0

29

Tilde

@tilderesearch

10 days

~5/8~ We identified the novel mechanisms of attention sinking in sparse attention models. We found MoBA models periodically re-inject attention sinks and that these sinks can be easily identified using value norm and attention score heuristics.

1

25

Tilde

@tilderesearch

10 days

~4/8~ We visualized the first-ever long-context attention maps for sparse attention, revealing fascinating patterns and attention circuits.

1

0

24

Tilde

@tilderesearch

10 days

~3/8~ We trained dozens of sparse attention models and poked around in their brains. Sparse attention models boost superior long-context generalization capability out of box, even with 80% sparsity in attention scores.

1

0

27

Tilde

@tilderesearch

10 days

📖 Read the full post here: ~2/8~ Sparse attention exploits inherent sparsity in model attention patterns to dramatically accelerate sequence mixing. Natively trainable approaches, such as Kimi’s MoBA and Deepseek’s NSA, expand the pareto frontier by

1

5

41

Tilde

@tilderesearch

4 months

~6/6~ We would like to credit this LessWrong post for inspiration. Big shoutout to the @NebiusAI solutions team for their assistance in testing on their platform.

0

12

Tilde

@tilderesearch

4 months

~5/6~ We open-source our code which can be found below. We encourage PRs and active maintainers, as Activault is a community tool.

1

0

12

Tilde

@tilderesearch

4 months

~4/6~ Here’s how Activault compares to other common methods of handling activation data. Activault achieves cost-efficiency without compromising on performance.

1

0

6

Tilde

@tilderesearch

4 months

~3/6~ Activault provides a flexible and efficient interface for managing activation data on shared object storage solutions (S3). S3 data is cheap and easily shareable - promoting reproducibility in the field.

1

0

7

Tilde

@tilderesearch

4 months

~2/6~ Training interpreter models require dozens of billions of activations - constituting several petabytes of data that must be dynamically generated, stored, and read. As such, interpretability-at-scale on frontier open-weight LMs remains out-of-reach for the majority of

1

0

7

Tilde

@tilderesearch

4 months

Today, we open-source Activault, a simple, high-throughput, and cost-effective solution to activation data management for accelerating interpretability research on frontier models. A 🧵… ~1/6~.

8

14

87

Tilde

@tilderesearch

4 months

And if you don’t like graph theory, but do like interpretability, we have plenty of other fun problems so feel free to email us join@tilderesearch.com. We are doing a lot of applied interpretability work like this: which was the first application of.

0

1

4

Tilde

@tilderesearch

4 months

Over the past few weeks, we've been using this graph theory problem in interviews and figured we'd open it up to everyone here! If you solve it, we’ll move you directly to the last rounds of our process!.

4

39

Tilde

@tilderesearch

7 months

Thank you to @ArthurConmy, @NeelNanda5, and @StephenLCasper for their comments and suggestions during the drafting process. This blog post is joint work with @a_karvonen, and the task is derived from Benchify.

1

0

22

Tilde

@tilderesearch

7 months

~8/8~ In the spirit of open science and reproducibility, we release every component involved in our study (model, data, evaluation results, etc.) and an extensive set of analyses and visualizations with our blog post. Make sure to check it out!.

1

0

20

Tilde

@tilderesearch

7 months

~7/8~ Though there remains much for future work, our results show that the gap between theoretical interpretability research and practical applications is closing rapidly. We are excited to release further developments toward fine-grained levers for model control enabled by.

1

0

12