David Chiang
@davidweichiang
Followers
2K
Following
503
Media
35
Statuses
794
Associate Professor of Computer Science and Engineering at University of Notre Dame. Natural language processing, formal grammars, machine learning
South Bend, IN
Joined September 2012
I'd be glad to discuss in person at NeurIPS in December Always glad to discuss in the FLaNN discord: https://t.co/U8omHLu1KL Read the paper: https://t.co/iFTPgBEDuM And apply to work with David for a PhD!
arxiv.org
It has been observed that transformers with greater depth (that is, more layers) have more capabilities, but can we establish formally which capabilities are gained? We answer this question with a...
0
1
3
Our updated theorems show this depth separation holds even when the transformers incorporate positional information, like RoPE and ALiBi. As a fun side quest, our results also imply depth separations in extremely uniform subclasses of linear TC^0.
1
1
3
L_k consists of k alternating blocks of symbols, e.g. L_3={aba, aabbaa,aaabbbbbaaaaa,...}, and each requires more depth to express. The updated experiments show this theory very closely predicts what depth transformers need to learn this language!
1
1
0
Then, by leveraging lower-bound techniques developed for majority logic with two variables, we prove a depth hierarchy for C-RASP. That is, we find a family of languages L_k, such that a program of depth k can express L_k, but no programs of depth k-1 can.
1
1
0
C-RASP is a programming language which extends and refines @gail_w's RASP. We prove transformers are expressively equivalent to C-RASP programs under a particular fixed-precision set-up.
1
1
1
For those stumbling on my page, here's a research update. Earlier, Michaël Cadilhac, @davidweichiang, and I proved a depth hierarchy in C-RASP which aligns with learnability in transformers of a given depth. Now with new theory on positional encodings and new experiments :)
1
1
3
I am recruiting a PhD student to work with me, Peter Cholak, Anand Pillay, and Andy Yang @pentagonalize on transformers and logic/model theory (or related topics). If you are interested, please email me with "FLaNN" in the subject line!
9
68
272
Read the cookbook: https://t.co/ymBPgfwGxa Join us for weekly seminars on formal language theory, ML, NLP, and more:
arxiv.org
We present the transformer cookbook: a collection of techniques for directly encoding algorithms into a transformer's parameters. This work addresses the steep learning curve of such endeavors, a...
0
5
24
Thanks to all the chefs, Chris Watson, @AntonXue, @satwik1729, Jose Llarena, @lambdaviking, Emile Dos Santos Ferreira, @AnejSvete, @davidweichiang !
1
1
8
There is no better way to understand what transformers can do than to get your hands dirty and construct them, weight-by-weight. The Transformer Cookbook provides a guide for anyone aiming to understand the expressive power of transformers on such a formal level.
1
1
5
We present The Transformer Cookbook: a collection of recipes for programming algorithms directly into transformers! Hungry for an induction head? Craving a Dyck language recognizer? We show you step-by-step how to cook up transformers for these algorithms and many more!
1
13
40
@ND_CSE is hiring a tenure-track professor at @NotreDame. Computer vision, software systems for robotics, and quantum computing as priority search areas. Apply and join the Notre Dame Computer Science and Engineering Department! ☘️ Apply Now:
0
5
6
📢 We're hiring open-rank TT CS faculty at Notre Dame!! All areas are welcomed, with computer vision, software systems for robotics, and quantum computing being of particular interest. ♥️ Come and be my colleague! It's a fantastic dept. to be a part of.
0
20
44
Dear NeurIPS 2030 reviewers: We have not yet received your final final final justification in response to the authors' final final final remarks.
0
0
41
By @pentagonalize, Lena Strobl, Dana Angluin, and me, on arXiv:
arxiv.org
We study conditions under which transformers using soft attention can simulate hard attention, that is, effectively focus all attention on a subset of positions. First, we examine several...
0
0
1
If position embeddings have poly(n) magnitude and the 1st- and 2nd-place attention weight are separated by a constant-size gap, then the required scale is O(log n). If the gap is 1/n^k, then the required scale is O(n^k log n).
1
0
1
It's known that any average-hard attention transformer can be simulated by a softmax-attention transformer by scaling attention logits. We give a new bound on how much they need to be scaled by, and this bound now works for any average-hard attention transformer.
1
0
0
We updated our paper on soft attention simulating hard attention with a more general result. Many theoretical constructions of transformers use hard attention, but what does that say about actual transformers, which use soft attention?
1
0
3
Andy Yang @pentagonalize drove the conceptualization, theory, and experiments of this work. I was just the checker and editor!
0
0
5
Very excited about this work: deep results from logic shedding light on Transformers and the benefit of depth
New on arXiv: Knee-Deep in C-RASP, by @pentagonalize, Michael Cadilhac and me. The solid stepped line is our theoretical prediction based on what problems C-RASP can solve, and the numbers/colors are what transformers (no position embedding) can learn.
0
3
13