Dylan Lim @dylan__lim X Profile

Dylan Lim

@dylan__lim

Followers

260

Following

55

Media

0

Statuses

8

cs @stanford | prev @jumptrading

https://t.co/jEBFocAxGU

Stanford, CA

Joined April 2024

Don't wanna be here? Send us removal request.

Dylan Lim

@dylan__lim

6 months

One kernel. One llama. Mega results. Proud to share our fully fused Llama-1B megakernel! Check out the code and blog below!

Benjamin F Spector

@bfspector

6 months

(1/5) We’ve never enjoyed watching people chop Llamas into tiny pieces. So, we’re excited to be releasing our Low-Latency-Llama Megakernel! We run the whole forward pass in single kernel. Megakernels are faster & more humane. Here’s how to treat your Llamas ethically: (Joint

1

19

Dylan Lim

@dylan__lim

2 months

Megakernels continue with our 8-GPU Llama-70B release! Please check out below!

Benjamin F Spector

@bfspector

2 months

(1/8) We’re releasing an 8-GPU Llama-70B inference engine megakernel! Our megakernel supports arbitrary batch sizes, mixed prefill+decode, a paged KV cache, instruction pipelining, dynamic scheduling, interleaved communication, and more! On ShareGPT it’s 22% faster than SGLang.

0

7

Dylan Lim

@dylan__lim

2 months

Happy to announce that multi-GPU ThunderKittens is finally here! Help your GPU's meow better by checking out the following blog!

Stuart Sul

@stuart_sul

2 months

(1/6) We’re happy to share that ThunderKittens now supports writing multi-GPU kernels, with the same programming model and full compatibility with PyTorch + torchrun. We’re also releasing collective ops and fused multi-GPU GEMM kernels, up to 2.6x faster than PyTorch + NCCL.

0

6

Jordan Juravsky

@jordanjuravsky

5 months

Happy Throughput Thursday! We’re excited to release Tokasaurus: an LLM inference engine designed from the ground up for high-throughput workloads with large and small models. (Joint work with @achakravarthy01, @ryansehrlich, @EyubogluSabri, @brad19brown, @jshetaye,

7

48

207

Andrej Karpathy

@karpathy

6 months

So so so cool. Llama 1B batch one inference in one single CUDA kernel, deleting synchronization boundaries imposed by breaking the computation into a series of kernels called in sequence. The *optimal* orchestration of compute and memory is only achievable in this way.

Benjamin F Spector

@bfspector

6 months

(1/5) We’ve never enjoyed watching people chop Llamas into tiny pieces. So, we’re excited to be releasing our Low-Latency-Llama Megakernel! We run the whole forward pass in single kernel. Megakernels are faster & more humane. Here’s how to treat your Llamas ethically: (Joint

63

262

2K

Dylan Lim

@dylan__lim

8 months

Excited to share LayoutVLM—leveraging VLMs for spatial reasoning in 3D layout generation!

Fan-Yun Sun

@sunfanyun

8 months

Spatial reasoning is a major challenge for the foundation models today, even in simple tasks like arranging objects in 3D space. #CVPR2025 Introducing LayoutVLM, a differentiable optimization framework that uses VLM to spatially reason about diverse scene layouts from unlabeled

0

1

Dylan Lim

@dylan__lim

1 year

Had a super fun time building this out - always love working on distributed ML systems. Big thanks to @pearvc for awarding us the best startup prize at Stanford TreeHacks!

Aksh Garg

@AkshGarg03

1 year

(1/5) @CKT_Conner, @dill_pkl, @emilyzsh, and I are excited to introduce Shard - a proof-of-concept for an infinitely scalable distributed system composed of consumer hardware for training and running ML models! Features: - Data + Pipeline Parallel for handling arbitrarily large

1

11