damekdavis Profile Banner
Damek Profile
Damek

@damekdavis

Followers
5K
Following
7K
Media
498
Statuses
2K

Optimization and Machine Learning Assoc Prof @Wharton stats https://t.co/bfOIEx0lHj (not quite a) blog: https://t.co/RFKUB4qDKF

Joined May 2022
Don't wanna be here? Send us removal request.
@damekdavis
Damek
6 months
taught my first class at penn today. topic: optimization in PyTorch
Tweet media one
21
87
1K
@damekdavis
Damek
4 hours
:contentReference[oacite:2]{index=2}
Tweet media one
0
0
3
@damekdavis
Damek
11 hours
o3 has a new `turn2file` token, interesting.
Tweet media one
1
0
3
@damekdavis
Damek
11 hours
i'm in section 7 of the assignment now, where you start to run everything on an h100. since every run takes so gd long, most of my time has been spent thinking about how i set up wandb, checkpointing, training configs/ablations, downloading the tokenized data. much overhead.
1
0
3
@damekdavis
Damek
1 day
torch.compile with device=mps seems to have some critical bugs.
Tweet media one
2
0
5
@damekdavis
Damek
2 days
lol why are people cloning this
Tweet media one
2
0
5
@damekdavis
Damek
2 days
> Below is a “starter kit” I give grad students. thank you o3
Tweet media one
1
0
2
@damekdavis
Damek
4 days
A few interesting numbers: on a single 80 GB A100. 1. to train gpt2-xl with AdamW, max batch size that can fit on a gpu is 4. 2. for same model, assuming 50% MFU, to run 400k steps of AdamW with batch size 1024, it will take around 7 years.
Tweet media one
Tweet media two
1
0
9
@damekdavis
Damek
4 days
Part 3 of assignment 1 done: . > Implement AdamW, calculate total activation memory and flops, 'how long will it take to train gpt2-xl on an A100?' . Probably the simplest part of the assignment.1. AdamW: 2. Accounting:
1
0
6
@damekdavis
Damek
6 days
part 2 of assignment 1 done:. > build a minimal transformer implementation extending torch.nn.module. it is essentially an exercise in 'einsums' and 'rearrange.' nice to do once in your life.
1
0
9
@damekdavis
Damek
7 days
in this one, you're asked to implement the entire figure 1 transformer architecture, but the output is not expecting the softmax at the end. mostly posting these so i don't forget to open an issue with the repo later.
Tweet media one
Tweet media two
1
0
6
@damekdavis
Damek
7 days
in this one, you should really be running the adapter, 'run_multihead_self_attention_with_rope'
Tweet media one
1
0
3
@damekdavis
Damek
12 days
funny thing is this will leak into training data.
@AnthropicAI
Anthropic
13 days
New Anthropic Research: Project Vend. We had Claude run a small shop in our office lunchroom. Here’s how it went.
Tweet media one
2
0
29
@damekdavis
Damek
12 days
warning there is a typo in this exercise. should be k in {0, . d/2 - 1}.
Tweet media one
1
0
4
@damekdavis
Damek
13 days
moving to transformer implementation.
1
0
3
@damekdavis
Damek
13 days
tokenizer done. very nice exercise!. wrote like 10 versions of increasing complexity. could have saved a lot of time by avoiding premature optimization. it's faster than i would have expected. probably lots of code bloat from refactoring.
1
0
4
@damekdavis
Damek
14 days
0
0
5
@damekdavis
Damek
14 days
Same group and this quanta article
Tweet media one
Tweet media two
1
0
9
@damekdavis
Damek
14 days
1
1
9
@damekdavis
Damek
14 days
deepmind trying to solve millennium problem
Tweet media one
6
11
278
@damekdavis
Damek
14 days
counterintuitive about BPE training: later iterations are faster because there are less possible merges. i spent so much time optimizing my implementation based on first dozens of iterations. I projected it would take 8 hours on openwebtext. it took 38 mins.
1
0
9