Damek @damekdavis X Profile

Damek

@damekdavis

Followers

5K

Following

7K

Media

498

Statuses

2K

Optimization and Machine Learning Assoc Prof @Wharton stats https://t.co/bfOIEx0lHj (not quite a) blog: https://t.co/RFKUB4qDKF

Joined May 2022

Don't wanna be here? Send us removal request.

Damek

@damekdavis

6 months

taught my first class at penn today. topic: optimization in PyTorch

21

87

1K

Damek

@damekdavis

4 hours

:contentReference[oacite:2]{index=2}

0

3

Damek

@damekdavis

11 hours

o3 has a new `turn2file` token, interesting.

1

0

3

Damek

@damekdavis

11 hours

i'm in section 7 of the assignment now, where you start to run everything on an h100. since every run takes so gd long, most of my time has been spent thinking about how i set up wandb, checkpointing, training configs/ablations, downloading the tokenized data. much overhead.

1

0

3

Damek

@damekdavis

1 day

torch.compile with device=mps seems to have some critical bugs.

2

0

5

Damek

@damekdavis

2 days

lol why are people cloning this

2

0

5

Damek

@damekdavis

2 days

> Below is a “starter kit” I give grad students. thank you o3

1

0

2

Damek

@damekdavis

4 days

A few interesting numbers: on a single 80 GB A100. 1. to train gpt2-xl with AdamW, max batch size that can fit on a gpu is 4. 2. for same model, assuming 50% MFU, to run 400k steps of AdamW with batch size 1024, it will take around 7 years.

1

0

9

Damek

@damekdavis

4 days

Part 3 of assignment 1 done: . > Implement AdamW, calculate total activation memory and flops, 'how long will it take to train gpt2-xl on an A100?' . Probably the simplest part of the assignment.1. AdamW: 2. Accounting:

1

0

6

Damek

@damekdavis

6 days

part 2 of assignment 1 done:. > build a minimal transformer implementation extending torch.nn.module. it is essentially an exercise in 'einsums' and 'rearrange.' nice to do once in your life.

1

0

9

Damek

@damekdavis

7 days

in this one, you're asked to implement the entire figure 1 transformer architecture, but the output is not expecting the softmax at the end. mostly posting these so i don't forget to open an issue with the repo later.

1

0

6

Damek

@damekdavis

7 days

in this one, you should really be running the adapter, 'run_multihead_self_attention_with_rope'

1

0

3

Damek

@damekdavis

12 days

funny thing is this will leak into training data.

Anthropic

@AnthropicAI

13 days

New Anthropic Research: Project Vend. We had Claude run a small shop in our office lunchroom. Here’s how it went.

2

0

29

Damek

@damekdavis

12 days

warning there is a typo in this exercise. should be k in {0, . d/2 - 1}.

1

0

4

Damek

@damekdavis

13 days

moving to transformer implementation.

1

0

3

Damek

@damekdavis

13 days

tokenizer done. very nice exercise!. wrote like 10 versions of increasing complexity. could have saved a lot of time by avoiding premature optimization. it's faster than i would have expected. probably lots of code bloat from refactoring.

1

0

4

Damek

@damekdavis

14 days

0

5

Damek

@damekdavis

14 days

Same group and this quanta article

1

0

9

Damek

@damekdavis

14 days

1

9

Damek

@damekdavis

14 days

deepmind trying to solve millennium problem

6

11

278

Damek

@damekdavis

14 days

counterintuitive about BPE training: later iterations are faster because there are less possible merges. i spent so much time optimizing my implementation based on first dozens of iterations. I projected it would take 8 hours on openwebtext. it took 38 mins.

1

0

9