Eric J. Michaud @ericjmichaud_ X Profile

Eric J. Michaud

@ericjmichaud_

Followers

3K

Following

3K

Media

46

Statuses

271

PhD student at MIT. Trying to make deep neural networks among the best understood objects in the universe. 💻🤖🧠👽🔭🚀

SF

Joined February 2015

Don't wanna be here? Send us removal request.

Eric J. Michaud

@ericjmichaud_

2 years

Understanding the origin of neural scaling laws and the emergence of new capabilities with scale is key to understanding what deep neural networks are learning. In our new paper, @tegmark, @ZimingLiu11, @uzpg_ and I develop a theory of neural scaling. 🧵:.

arxiv.org

We propose the Quantization Model of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and also the sudden emergence of new capabilities with...

5

44

231

Eric J. Michaud

@ericjmichaud_

3 months

I've moved to SF and am working at @GoodfireAI this summer! Excited to be here and to spend time with many friends, old and new.

18

5

401

Grok

@grok

5 days

What do you want to know?.

418

251

2K

Eric J. Michaud

@ericjmichaud_

3 months

RT @asher5772: It's been a pleasure working on this with @ericjmichaud_ and @tegmark, and I'm excited that it's finally out!. In this work,….

0

3

0

Eric J. Michaud

@ericjmichaud_

3 months

However, if this way of thinking about intelligence, as just an ensemble of more narrow skills, is wrong, and instead the intelligence of our models results from some more unified, irreducible thing, then the basic prospect of creating narrow systems may be a challenge.

2

31

Eric J. Michaud

@ericjmichaud_

3 months

If the apparent generality of our systems today results from their having learned a very large number of specialized circuits, then we should be able to transfer some of those circuits into smaller, specialized models, and still get good performance.

1

23

Eric J. Michaud

@ericjmichaud_

3 months

Zooming way out though, I think this problem of whether it is possible to create narrow AI hinges on a basic question about the intelligence of our AI models.

1

0

15

Eric J. Michaud

@ericjmichaud_

3 months

We also have some nice experiments on creating narrow networks on real-world data:. 1) MNIST: only classifying even digits.2) LLMs: next-token prediction on only Python code. Tentatively, pruning general models works better than training new models from scratch on each of these.

1

0

17

Eric J. Michaud

@ericjmichaud_

3 months

It's as if there is some simple circuit hidden in that mess of connectivity, and we can move that circuit into alignment with a very small subset of the neurons while cleaning away the rest. Neat!.

1

0

20

Eric J. Michaud

@ericjmichaud_

3 months

This gif shows the full cycle. First, we train on a broad distribution of atomic and compositional samples. Next, we train just on a compositional subtask, while regularizing the weights. This unlearns all the other tasks, while also making the network very sparse.

1

0

19

Eric J. Michaud

@ericjmichaud_

3 months

On our networks trained on CMSP, we find that this simple approach works: perform some additional training on only the narrow distribution you care about, while also imposing a regularization penalty on the weights that encourages structured sparsity.

2

0

13

Eric J. Michaud

@ericjmichaud_

3 months

However, distinct skills are often not localized to a distinct set of model components. Instead it seems like mechanisms are distributed across overlapping model components (perhaps due to feature superposition). This means we must do extra work than just naively pruning.

1

0

14

Eric J. Michaud

@ericjmichaud_

3 months

If the mechanisms for different skills were localized to different model components (like neurons), then we could make general models narrow just by pruning the components which implemented the skills we wanted to unlearn.

1

0

11

Eric J. Michaud

@ericjmichaud_

3 months

If it is sometimes necessary to train networks on a broader distribution first, it is natural to next ask: how can we turn big general networks into small narrow ones?.

1

0

14

Eric J. Michaud

@ericjmichaud_

3 months

CMSP is also a nice generalization of the multitask sparse parity task we studied in our "quanta hypothesis" neural scaling work: There, tasks were independent, but our tasks here have a kind of hierarchical dependence structure like a "skill tree".

arxiv.org

We propose the Quantization Model of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and also the sudden emergence of new capabilities with...

1

0

16

Eric J. Michaud

@ericjmichaud_

3 months

This is a nice existence proof that, for data with certain structure, it can be necessary to train networks across a broader distribution of tasks in order to learn certain tasks within that distribution.

1

0

19

Eric J. Michaud

@ericjmichaud_

3 months

If one wants to efficiently learn to classify compositional samples, it is necessary to *also* train on a broader distribution of atomic samples. Here, it takes at least 10x the samples to learn the composite task on its own vs. when atomic samples are included (bottom left):

1

0

16

Eric J. Michaud

@ericjmichaud_

3 months

When more control bits are ON, the label is the parity of more of the task bits. We call samples "atomic" if only one control bit is ON, and "compositional" if multiple control bits are ON. We find that networks trained on CMSP show extremely strong curriculum learning effects.

1

0

13

Eric J. Michaud

@ericjmichaud_

3 months

To study this, we describe a synthetic task which we call "compositional multitask sparse parity" (CMSP). This is a binary classification problem on binary strings. Based on the pattern of the "control bits" of a sample, the label is the parity of some subset of its "task bits".

1

0

14

Eric J. Michaud

@ericjmichaud_

3 months

Let's start with the following question: when can we train models from scratch, just on a narrow task of interest, and achieve good performance on that task?.

1

0

24

Eric J. Michaud

@ericjmichaud_

3 months

Before jumping in, here is the arxiv link:. Many thanks to co-authors @asher5772 and @tegmark :).

arxiv.org

We study the problem of creating strong, yet narrow, AI systems. While recent AI progress has been driven by the training of large general-purpose foundation models, the creation of smaller models...

1

38

Eric J. Michaud

@ericjmichaud_

3 months

Today, the most competent AI systems in almost *any* domain (math, coding, etc.) are broadly knowledgeable across almost *every* domain. Does it have to be this way, or can we create truly narrow AI systems? In a new preprint, we explore some questions relevant to this goal.

3

46

439