apgox Profile Banner
Adam P. Goucher Profile
Adam P. Goucher

@apgox

Followers
1K
Following
6K
Media
204
Statuses
3K

Algorithmist

Cambridge
Joined September 2014
Don't wanna be here? Send us removal request.
@n1ckler
ncklr
2 months
We just published "Hash-based signatures for Bitcoin," a new analysis of post-quantum schemes by @kudinov_mikhail and myself at @blksresearch. This paper serves as a gentle intro to hash-based schemes and explores how to optimize them specifically for application in Bitcoin. 🧵
49
253
1K
@stanislavfort
Stanislav Fort
4 months
I worked with Sam extensively at DeepMind (for example on arxiv . org / abs / 2105.13343 = multiple augmentations of the same data point in the batch => better & faster training) and this seems like an amazing opportunity to join a great team & mentor at the same time!
@SamuelMLSmith
Samuel L Smith
4 months
The Training team @OpenAI is hiring researchers in London 🚀 Our twin missions are to train better LLMs, and serve them more cheaply Get in touch if you are excited to collaborate on architecture design, reliable scaling, and faster optimization
0
1
15
@apgox
Adam P. Goucher
5 months
@Leik0w0 @AltmejdAdam @itsclivetime Scott wrote his own SASS assembler so that he could get his matmul kernel to 98% of theoretical throughput (ptxas could only get 70%). https://t.co/HazorS0gns I learned so much of what I know about writing efficient CUDA from reading Scott’s sgemm walkthrough!
Tweet card summary image
github.com
Assembler for NVIDIA Maxwell architecture. Contribute to NervanaSystems/maxas development by creating an account on GitHub.
3
7
83
@apgox
Adam P. Goucher
8 months
Potentially worth stockpiling 4090s: the new Blackwell GPUs don’t natively support single-bit matrix multiply accumulate.
@giffmana
Lucas Beyer (bl16)
8 months
A friend pointed out I could've just bought a 4090 instead, so got curious and... They actually roughly doubled in price over MSRP, wow!
1
0
6
@apgox
Adam P. Goucher
9 months
This was the culmination of a long story! The origins of these ideas came from writing AVX assembly back in 2019 for transposing bitmatrices, and now finally have come to fruition as a general framework for choosing register layouts on SIMD architectures:
@typedfemale
typedfemale
9 months
a paper describing triton's linear layouts is out!
0
0
22
@apgox
Adam P. Goucher
10 months
I was pleasantly surprised with how well GPT-4.5 writes code: my prior experience with LLMs is that they do things pretty naively (usually with atrocious performance), but GPT-4.5 used memoization and performant data structures ab initio.
1
0
2
@apgox
Adam P. Goucher
10 months
To get GPT-4.5 to be maximally helpful, I resorted to prompting it as follows: -- stating high-level intent first; -- pasting the entire ~ 1500-line source file; and then had an interactive conversation with it.
1
0
2
@apgox
Adam P. Goucher
10 months
As a result, in certain rare cases (especially when using the bitonic sort operation which triggers this behaviour) you would get many successive duplications resulting in an exponential increase in intermediate IR size (and thus compilation time).
1
0
2
@apgox
Adam P. Goucher
10 months
Essentially, the problem arose from how eagerly Triton's backwardMaterialization pass would duplicate parts of the IR graph to avoid layout conversions: if anything consisted purely of 'cheap arithmetic' it would get duplicated, irrespective of the amount of arithmetic.
1
0
3
@apgox
Adam P. Goucher
10 months
I've had my first successful experience of 'vibe-coding' today: using GPT-4.5 (which is far more au fait with LLVM/MLIR than I am) to modify a Triton compiler pass to avoid certain edge-cases with exponential compilation times. 🧵
3
0
9
@apgox
Adam P. Goucher
11 months
I've written up what I know about a heavily customised PDP-3 computer built in 1960 by Charles Corderman and collaborators (originally for military applications, and later used for recreational mathematics): https://t.co/6KBR8nxLks
Tweet card summary image
cp4space.hatsya.com
This is an atypical post, being chiefly about the history of a rather obscure computer that was built in 1960 out of repurposed PDP parts, but it needs to be written somewhere lest it be forgotten.…
0
0
3
@apgox
Adam P. Goucher
1 year
Not at all! You can generate and store hashes of all strings within a Levenshtein distance of r of the password and see whether there are collisions between those two radius-r balls, determining whether they’re within a distance of 2r. (r=1 is very practical.)
@SebAaltonen
Sebastian Aaltonen
1 year
Just a reminder to everybody: If a website compares your new password with your old one, the web browser has to send a plain text password instead of a hashed one. This is a security issue.
1
0
11
@sama
Sam Altman
1 year
it is hard to overstate how much alec radford has contributed to the field, and how much of everyone's current progress traces back to his work. i believe he is a genius at the level of einstein, and also he is one of my favorite people ever--hard to imagine a nicer, warmer, or
297
392
8K
@apgox
Adam P. Goucher
1 year
AGI has been achieved internally
@arcprize
ARC Prize
1 year
New verified ARC-AGI-Pub SoTA! @OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation. And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval. 1/4
0
1
11
@ladyjadeamanda
Jade-Amanda Laporte
1 year
Haven’t been on here for over a year but a special exception has to be made to send a huge thanks & to express my sheer admiration of @tessybarton for creating this work of art & gift wrapping with such TLC (and a chip bow!) (and ofc @apgox for such a unique surprise) #gpu #bag
1
3
17
@apgox
Adam P. Goucher
1 year
Somehow this Montgomery trick has managed to replace this with cheap arithmetic plus a one-time preprocessing step, where remarkably the preprocessing step only involves reducing a 64-bit integer. This feels impossible!
3
0
0
@apgox
Adam P. Goucher
1 year
Suppose that you wanted to compute ab (mod N) without this Montgomery trick. If N is a 64-bit integer, then the product ab would require 128 bits to store the result, so you'd need to reduce a 128-bit integer modulo a 64-bit integer, usually requiring an expensive function call!
1
0
0
@apgox
Adam P. Goucher
1 year
Why do I find this remarkable? Well, it means that you can do any amount of arbitrary ring arithmetic mod N (any odd constant fitting in a machine word) just by using cheap processor instructions together with a preprocessing step involving 1 machine-word-sized modular reduction.
1
0
0
@apgox
Adam P. Goucher
1 year
The Montgomery representative of 2 is just 2R (mod N), which you can obtain in the following way: - compute (R/2) mod N with a single machine-word modulo instruction (the only time we ever use this!); - double it twice to get 2R mod N. See the top of
2
0
1