nickhjiang Profile Banner
Nick Jiang Profile
Nick Jiang

@nickhjiang

Followers
750
Following
1K
Media
33
Statuses
224

interpreting neural networks @berkeley_ai // cs + philosophy @ucberkeley // prev @briskteaching @watershed

Berkeley, CA
Joined July 2019
Don't wanna be here? Send us removal request.
@nickhjiang
Nick Jiang
7 days
What makes LLMs like Grok-4 unique?. We use sparse autoencoders (SAEs) to tackle queries like these and apply them to four data analysis tasks: data diffing, correlations, targeted clustering, and retrieval. By analyzing model outputs, SAEs find novel insights on model behavior!
Tweet media one
6
16
156
@nickhjiang
Nick Jiang
6 days
RT @nickhjiang: What makes LLMs like Grok-4 unique?. We use sparse autoencoders (SAEs) to tackle queries like these and apply them to four….
0
16
0
@grok
Grok
3 days
Join millions who have switched to Grok.
166
191
1K
@nickhjiang
Nick Jiang
7 days
RT @BaldassarreFe: Say hello to DINOv3 🦖🦖🦖. A major release that raises the bar of self-supervised vision foundation models. With stunning….
0
277
0
@nickhjiang
Nick Jiang
7 days
RT @lilysun004: New blog post on SAEs for data-centric interpretability!.
0
2
0
@nickhjiang
Nick Jiang
7 days
RT @NeelNanda5: I'm excited about our vision of data-centric interpretability! Even if you can't use a model's internals, there's a lot of….
0
9
0
@nickhjiang
Nick Jiang
7 days
Work done with @lilysun004*, Lewis Smith, and @NeelNanda5. Thank you to @GoodfireAI and MATS for compute support!. Blog post:
Tweet card summary image
lesswrong.com
Nick and Lily are co-first authors on this project. Lewis and Neel jointly supervised this project. …
0
0
10
@nickhjiang
Nick Jiang
7 days
Our results suggest that SAEs can systematically analyze large datasets for insights. We are particularly excited for their application to data-centric interpretability, which we believe is an under-invested but crucial approach to understand models.
1
0
2
@nickhjiang
Nick Jiang
7 days
Finally, we use SAEs for data retrieval and find they excel at property-based queries (e.g., “repeating text”). We construct a benchmark of property-based queries and find that SAEs often outperform semantic embeddings, which are tuned for semantic queries like fact-finding.
Tweet media one
1
0
5
@nickhjiang
Nick Jiang
7 days
SAE features can also cluster data in different ways than semantic embeddings, which group data by similar topics. By filtering features (ex. tone-related), we can group data by similar properties. For instance, we can cluster GSM8K responses into different reasoning approaches.
Tweet media one
1
0
3
@nickhjiang
Nick Jiang
7 days
We can also detect correlations in datasets, where all French samples might happen to only talk about dogs. By finding features that co-occur often in data samples, we find that offensive topics in Chatbot Arena tend to correlate with narrative stories.
Tweet media one
1
0
3
@nickhjiang
Nick Jiang
7 days
For example, to “diff” two datasets, we can subtract the frequencies by which features fire. Diffing model outputs to the same prompts lets us compare models like Grok-4, with top feature differences revealing stronger differences at 5× lower cost than our LLM baseline.
Tweet media one
1
0
4
@nickhjiang
Nick Jiang
7 days
LLM outputs & training data hold valuable insights on how models behave and work. While analysis has typically relied on LLMs 🤑🤑, we show that SAEs can interpret large datasets more efficiently by using max-pooled feature activations as a large set of labels for our data.
Tweet media one
1
0
9
@nickhjiang
Nick Jiang
27 days
man when u visit a new country and this is the first meal you’re handed
Tweet media one
1
0
8
@nickhjiang
Nick Jiang
1 month
So much of research centers on hope and faith that ppl who don’t take leaps of faith are unlikely to enjoy research.
3
0
10
@nickhjiang
Nick Jiang
2 months
Sorry, this is the correct link for LLaVA-Llama-8B with test-time registers:
Tweet card summary image
huggingface.co
0
0
0
@nickhjiang
Nick Jiang
2 months
RT @nickhjiang: Updated paper!. Our main new finding: by creating attention biases at test time—without extra tokens—we remove high-norm ou….
0
35
0
@nickhjiang
Nick Jiang
2 months
Additionally, we release a LLaVA-Llama 8b model (CLIP-L encoder) configured with a test-time register. We have also updated references and added new experiments to the appendix!. LLaVA-Llama 8b: Paper: Repo:
Tweet card summary image
huggingface.co
1
0
5
@nickhjiang
Nick Jiang
2 months
These findings are preliminary and only done on a base OpenCLIP model, where outliers are small (norm < 500). If they extend to the language domain, however, they offer a promising way to manage outliers—a challenge for quantization—without engineering hacks. See Appendix A.11.
1
0
1
@nickhjiang
Nick Jiang
2 months
Zeroing out the activations of register neurons removes outliers and drops classification performance by ~20%, but adding an attention bias recovers this drop. Our results suggest that outliers are primarily attention biases in ViTs.
Tweet media one
1
0
2
@nickhjiang
Nick Jiang
2 months
Sun et al (”Massive Activations”) proposed adding an attention bias during training to mitigate high-norm outliers. But we can do this training-free! Specifically, for each attention head, we set v’ and k’ to the value and key vectors of a test-time register averaged over images.
Tweet media one
2
0
3