Adam P. Goucher @apgox X Profile

Adam P. Goucher

@apgox

Followers

1K

Following

6K

Media

204

Statuses

3K

Algorithmist

https://t.co/JSd9D7DadK

Cambridge

Joined September 2014

Don't wanna be here? Send us removal request.

ncklr

@n1ckler

2 months

We just published "Hash-based signatures for Bitcoin," a new analysis of post-quantum schemes by @kudinov_mikhail and myself at @blksresearch. This paper serves as a gentle intro to hash-based schemes and explores how to optimize them specifically for application in Bitcoin. 🧵

49

253

1K

Stanislav Fort

@stanislavfort

4 months

I worked with Sam extensively at DeepMind (for example on arxiv . org / abs / 2105.13343 = multiple augmentations of the same data point in the batch => better & faster training) and this seems like an amazing opportunity to join a great team & mentor at the same time!

Samuel L Smith

@SamuelMLSmith

4 months

The Training team @OpenAI is hiring researchers in London 🚀 Our twin missions are to train better LLMs, and serve them more cheaply Get in touch if you are excited to collaborate on architecture design, reliable scaling, and faster optimization

0

1

15

Adam P. Goucher

@apgox

5 months

@Leik0w0 @AltmejdAdam @itsclivetime Scott wrote his own SASS assembler so that he could get his matmul kernel to 98% of theoretical throughput (ptxas could only get 70%). https://t.co/HazorS0gns I learned so much of what I know about writing efficient CUDA from reading Scott’s sgemm walkthrough!

github.com

Assembler for NVIDIA Maxwell architecture. Contribute to NervanaSystems/maxas development by creating an account on GitHub.

3

7

83

Adam P. Goucher

@apgox

8 months

Potentially worth stockpiling 4090s: the new Blackwell GPUs don’t natively support single-bit matrix multiply accumulate.

Lucas Beyer (bl16)

@giffmana

8 months

A friend pointed out I could've just bought a 4090 instead, so got curious and... They actually roughly doubled in price over MSRP, wow!

1

0

6

Adam P. Goucher

@apgox

9 months

This was the culmination of a long story! The origins of these ideas came from writing AVX assembly back in 2019 for transposing bitmatrices, and now finally have come to fruition as a general framework for choosing register layouts on SIMD architectures:

typedfemale

@typedfemale

9 months

a paper describing triton's linear layouts is out!

0

22

Adam P. Goucher

@apgox

10 months

Here's the code that we (mostly GPT-4.5 rather than me) wrote together: https://t.co/qelwXcGnzY

github.com

This avoids pathological edgecases (often involving tl.sort) where the ttgir graph grows exponentially as a result of repeated rematerialisations. Tested on internal benchmarks and there are no obs...

0

2

Adam P. Goucher

@apgox

10 months

I was pleasantly surprised with how well GPT-4.5 writes code: my prior experience with LLMs is that they do things pretty naively (usually with atrocious performance), but GPT-4.5 used memoization and performant data structures ab initio.

1

0

2

Adam P. Goucher

@apgox

10 months

To get GPT-4.5 to be maximally helpful, I resorted to prompting it as follows: -- stating high-level intent first; -- pasting the entire ~ 1500-line source file; and then had an interactive conversation with it.

1

0

2

Adam P. Goucher

@apgox

10 months

As a result, in certain rare cases (especially when using the bitonic sort operation which triggers this behaviour) you would get many successive duplications resulting in an exponential increase in intermediate IR size (and thus compilation time).

1

0

2

Adam P. Goucher

@apgox

10 months

Essentially, the problem arose from how eagerly Triton's backwardMaterialization pass would duplicate parts of the IR graph to avoid layout conversions: if anything consisted purely of 'cheap arithmetic' it would get duplicated, irrespective of the amount of arithmetic.

1

0

3

Adam P. Goucher

@apgox

10 months

I've had my first successful experience of 'vibe-coding' today: using GPT-4.5 (which is far more au fait with LLVM/MLIR than I am) to modify a Triton compiler pass to avoid certain edge-cases with exponential compilation times. 🧵

3

0

9

Adam P. Goucher

@apgox

11 months

I've written up what I know about a heavily customised PDP-3 computer built in 1960 by Charles Corderman and collaborators (originally for military applications, and later used for recreational mathematics): https://t.co/6KBR8nxLks

cp4space.hatsya.com

This is an atypical post, being chiefly about the history of a rather obscure computer that was built in 1960 out of repurposed PDP parts, but it needs to be written somewhere lest it be forgotten.…

0

3

Adam P. Goucher

@apgox

1 year

Not at all! You can generate and store hashes of all strings within a Levenshtein distance of r of the password and see whether there are collisions between those two radius-r balls, determining whether they’re within a distance of 2r. (r=1 is very practical.)

Sebastian Aaltonen

@SebAaltonen

1 year

Just a reminder to everybody: If a website compares your new password with your old one, the web browser has to send a plain text password instead of a hashed one. This is a security issue.

1

0

11

Sam Altman

@sama

1 year

it is hard to overstate how much alec radford has contributed to the field, and how much of everyone's current progress traces back to his work. i believe he is a genius at the level of einstein, and also he is one of my favorite people ever--hard to imagine a nicer, warmer, or

297

392

8K

Adam P. Goucher

@apgox

1 year

AGI has been achieved internally

ARC Prize

@arcprize

1 year

New verified ARC-AGI-Pub SoTA! @OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation. And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval. 1/4

0

1

11

Jade-Amanda Laporte

@ladyjadeamanda

1 year

Haven’t been on here for over a year but a special exception has to be made to send a huge thanks & to express my sheer admiration of @tessybarton for creating this work of art & gift wrapping with such TLC (and a chip bow!) (and ofc @apgox for such a unique surprise) #gpu #bag

1

3

17

Adam P. Goucher

@apgox

1 year

Somehow this Montgomery trick has managed to replace this with cheap arithmetic plus a one-time preprocessing step, where remarkably the preprocessing step only involves reducing a 64-bit integer. This feels impossible!

3

0

Adam P. Goucher

@apgox

1 year

Suppose that you wanted to compute ab (mod N) without this Montgomery trick. If N is a 64-bit integer, then the product ab would require 128 bits to store the result, so you'd need to reduce a 128-bit integer modulo a 64-bit integer, usually requiring an expensive function call!

1

0

Adam P. Goucher

@apgox

1 year

Why do I find this remarkable? Well, it means that you can do any amount of arbitrary ring arithmetic mod N (any odd constant fitting in a machine word) just by using cheap processor instructions together with a preprocessing step involving 1 machine-word-sized modular reduction.

1

0

Adam P. Goucher

@apgox

1 year

The Montgomery representative of 2 is just 2R (mod N), which you can obtain in the following way: - compute (R/2) mod N with a single machine-word modulo instruction (the only time we ever use this!); - double it twice to get 2R mod N. See the top of

2

0

1