tilderesearch Profile Banner
Tilde Profile
Tilde

@tilderesearch

Followers
2K
Following
95
Media
17
Statuses
44

Doing cool things.

Joined July 2024
Don't wanna be here? Send us removal request.
@tilderesearch
Tilde
10 days
Sparse attention (MoBA/NSA) trains faster & beats full attention in key tasks. But we’ve had no idea how they truly work…until now. 🔍 We reverse-engineered them to uncover:. - Novel attention patterns.- Hidden "attention sinks".- Better performance.- And more. A 🧵… ~1/8~
5
80
404
@tilderesearch
Tilde
10 days
Read the full post here:
0
1
22
@tilderesearch
Tilde
10 days
~8/8~ We release our NSA kernel for experimentation and research here: At Tilde, we believe interpretability is the path towards building better models. If that sounds cool, reach out!.
1
2
30
@tilderesearch
Tilde
10 days
~7/8~ We analyzed the gating distributions for NSA models and found we can ablate many branches without compromising model performance! Our principled ablations enabled massive gains in throughput without losses in performance.
Tweet media one
2
3
38
@tilderesearch
Tilde
10 days
~6/8~ We investigated the key geometry for different attention models, and also found that KV-sharing through GQA strongly influences the resulting query/key manifolds. This offered a clue into how sparse attention could be beating base attention → by removing cross-head
1
0
29
@tilderesearch
Tilde
10 days
~5/8~ We identified the novel mechanisms of attention sinking in sparse attention models. We found MoBA models periodically re-inject attention sinks and that these sinks can be easily identified using value norm and attention score heuristics.
Tweet media one
1
1
25
@tilderesearch
Tilde
10 days
~4/8~ We visualized the first-ever long-context attention maps for sparse attention, revealing fascinating patterns and attention circuits.
1
0
24
@tilderesearch
Tilde
10 days
~3/8~ We trained dozens of sparse attention models and poked around in their brains. Sparse attention models boost superior long-context generalization capability out of box, even with 80% sparsity in attention scores.
Tweet media one
1
0
27
@tilderesearch
Tilde
10 days
📖 Read the full post here: ~2/8~ Sparse attention exploits inherent sparsity in model attention patterns to dramatically accelerate sequence mixing. Natively trainable approaches, such as Kimi’s MoBA and Deepseek’s NSA, expand the pareto frontier by
Tweet media one
1
5
41
@tilderesearch
Tilde
4 months
~6/6~ We would like to credit this LessWrong post for inspiration. Big shoutout to the @NebiusAI solutions team for their assistance in testing on their platform.
0
0
12
@tilderesearch
Tilde
4 months
~5/6~ We open-source our code which can be found below. We encourage PRs and active maintainers, as Activault is a community tool.
1
0
12
@tilderesearch
Tilde
4 months
~4/6~ Here’s how Activault compares to other common methods of handling activation data. Activault achieves cost-efficiency without compromising on performance.
Tweet media one
1
0
6
@tilderesearch
Tilde
4 months
~3/6~ Activault provides a flexible and efficient interface for managing activation data on shared object storage solutions (S3). S3 data is cheap and easily shareable - promoting reproducibility in the field.
Tweet media one
1
0
7
@tilderesearch
Tilde
4 months
~2/6~ Training interpreter models require dozens of billions of activations - constituting several petabytes of data that must be dynamically generated, stored, and read. As such, interpretability-at-scale on frontier open-weight LMs remains out-of-reach for the majority of
Tweet media one
1
0
7
@tilderesearch
Tilde
4 months
Today, we open-source Activault, a simple, high-throughput, and cost-effective solution to activation data management for accelerating interpretability research on frontier models. A 🧵… ~1/6~.
8
14
87
@tilderesearch
Tilde
4 months
And if you don’t like graph theory, but do like interpretability, we have plenty of other fun problems so feel free to email us join@tilderesearch.com. We are doing a lot of applied interpretability work like this: which was the first application of.
0
1
4
@tilderesearch
Tilde
4 months
Over the past few weeks, we've been using this graph theory problem in interviews and figured we'd open it up to everyone here! If you solve it, we’ll move you directly to the last rounds of our process!.
4
4
39
@tilderesearch
Tilde
7 months
Thank you to @ArthurConmy, @NeelNanda5, and @StephenLCasper for their comments and suggestions during the drafting process. This blog post is joint work with @a_karvonen, and the task is derived from Benchify.
1
0
22
@tilderesearch
Tilde
7 months
~8/8~ In the spirit of open science and reproducibility, we release every component involved in our study (model, data, evaluation results, etc.) and an extensive set of analyses and visualizations with our blog post. Make sure to check it out!.
1
0
20
@tilderesearch
Tilde
7 months
~7/8~ Though there remains much for future work, our results show that the gap between theoretical interpretability research and practical applications is closing rapidly. We are excited to release further developments toward fine-grained levers for model control enabled by.
1
0
12