ericjmichaud_ Profile Banner
Eric J. Michaud Profile
Eric J. Michaud

@ericjmichaud_

Followers
3K
Following
3K
Media
46
Statuses
271

PhD student at MIT. Trying to make deep neural networks among the best understood objects in the universe. ๐Ÿ’ป๐Ÿค–๐Ÿง ๐Ÿ‘ฝ๐Ÿ”ญ๐Ÿš€

SF
Joined February 2015
Don't wanna be here? Send us removal request.
@ericjmichaud_
Eric J. Michaud
2 years
Understanding the origin of neural scaling laws and the emergence of new capabilities with scale is key to understanding what deep neural networks are learning. In our new paper, @tegmark, @ZimingLiu11, @uzpg_ and I develop a theory of neural scaling. ๐Ÿงต:.
Tweet card summary image
arxiv.org
We propose the Quantization Model of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and also the sudden emergence of new capabilities with...
5
44
231
@ericjmichaud_
Eric J. Michaud
3 months
I've moved to SF and am working at @GoodfireAI this summer! Excited to be here and to spend time with many friends, old and new.
Tweet media one
18
5
401
@grok
Grok
5 days
What do you want to know?.
418
251
2K
@ericjmichaud_
Eric J. Michaud
3 months
RT @asher5772: It's been a pleasure working on this with @ericjmichaud_ and @tegmark, and I'm excited that it's finally out!. In this work,โ€ฆ.
0
3
0
@ericjmichaud_
Eric J. Michaud
3 months
However, if this way of thinking about intelligence, as just an ensemble of more narrow skills, is wrong, and instead the intelligence of our models results from some more unified, irreducible thing, then the basic prospect of creating narrow systems may be a challenge.
2
2
31
@ericjmichaud_
Eric J. Michaud
3 months
If the apparent generality of our systems today results from their having learned a very large number of specialized circuits, then we should be able to transfer some of those circuits into smaller, specialized models, and still get good performance.
1
1
23
@ericjmichaud_
Eric J. Michaud
3 months
Zooming way out though, I think this problem of whether it is possible to create narrow AI hinges on a basic question about the intelligence of our AI models.
1
0
15
@ericjmichaud_
Eric J. Michaud
3 months
We also have some nice experiments on creating narrow networks on real-world data:. 1) MNIST: only classifying even digits.2) LLMs: next-token prediction on only Python code. Tentatively, pruning general models works better than training new models from scratch on each of these.
Tweet media one
Tweet media two
1
0
17
@ericjmichaud_
Eric J. Michaud
3 months
It's as if there is some simple circuit hidden in that mess of connectivity, and we can move that circuit into alignment with a very small subset of the neurons while cleaning away the rest. Neat!.
1
0
20
@ericjmichaud_
Eric J. Michaud
3 months
This gif shows the full cycle. First, we train on a broad distribution of atomic and compositional samples. Next, we train just on a compositional subtask, while regularizing the weights. This unlearns all the other tasks, while also making the network very sparse.
1
0
19
@ericjmichaud_
Eric J. Michaud
3 months
On our networks trained on CMSP, we find that this simple approach works: perform some additional training on only the narrow distribution you care about, while also imposing a regularization penalty on the weights that encourages structured sparsity.
2
0
13
@ericjmichaud_
Eric J. Michaud
3 months
However, distinct skills are often not localized to a distinct set of model components. Instead it seems like mechanisms are distributed across overlapping model components (perhaps due to feature superposition). This means we must do extra work than just naively pruning.
Tweet media one
1
0
14
@ericjmichaud_
Eric J. Michaud
3 months
If the mechanisms for different skills were localized to different model components (like neurons), then we could make general models narrow just by pruning the components which implemented the skills we wanted to unlearn.
1
0
11
@ericjmichaud_
Eric J. Michaud
3 months
If it is sometimes necessary to train networks on a broader distribution first, it is natural to next ask: how can we turn big general networks into small narrow ones?.
1
0
14
@ericjmichaud_
Eric J. Michaud
3 months
CMSP is also a nice generalization of the multitask sparse parity task we studied in our "quanta hypothesis" neural scaling work: There, tasks were independent, but our tasks here have a kind of hierarchical dependence structure like a "skill tree".
Tweet card summary image
arxiv.org
We propose the Quantization Model of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and also the sudden emergence of new capabilities with...
1
0
16
@ericjmichaud_
Eric J. Michaud
3 months
This is a nice existence proof that, for data with certain structure, it can be necessary to train networks across a broader distribution of tasks in order to learn certain tasks within that distribution.
1
0
19
@ericjmichaud_
Eric J. Michaud
3 months
If one wants to efficiently learn to classify compositional samples, it is necessary to *also* train on a broader distribution of atomic samples. Here, it takes at least 10x the samples to learn the composite task on its own vs. when atomic samples are included (bottom left):
Tweet media one
1
0
16
@ericjmichaud_
Eric J. Michaud
3 months
When more control bits are ON, the label is the parity of more of the task bits. We call samples "atomic" if only one control bit is ON, and "compositional" if multiple control bits are ON. We find that networks trained on CMSP show extremely strong curriculum learning effects.
1
0
13
@ericjmichaud_
Eric J. Michaud
3 months
To study this, we describe a synthetic task which we call "compositional multitask sparse parity" (CMSP). This is a binary classification problem on binary strings. Based on the pattern of the "control bits" of a sample, the label is the parity of some subset of its "task bits".
Tweet media one
1
0
14
@ericjmichaud_
Eric J. Michaud
3 months
Let's start with the following question: when can we train models from scratch, just on a narrow task of interest, and achieve good performance on that task?.
1
0
24
@ericjmichaud_
Eric J. Michaud
3 months
Today, the most competent AI systems in almost *any* domain (math, coding, etc.) are broadly knowledgeable across almost *every* domain. Does it have to be this way, or can we create truly narrow AI systems? In a new preprint, we explore some questions relevant to this goal.
3
46
439