Georgi Gerganov @ggerganov profile

Georgi Gerganov

@ggerganov

Followers

50K

Following

3K

Media

269

Statuses

2K

24th at the Electrica puzzle challenge | https://t.co/baTQS2bdia

Joined May 2015

Don't wanna be here? Send us removal request.

Georgi Gerganov

@ggerganov

1 month

New account for ggml news and notable PRs.

ggml

@ggml_org

1 month

3

21

188

Georgi Gerganov

@ggerganov

3 months

Today I was sent the following cool demo:. Two AI agents on a phone call realize they��re both AI and switch to a superior audio signal ggwave

3K

14K

95K

Georgi Gerganov

@ggerganov

2 years

Introducing LLaMA voice chat! 🦙 . You can run this locally on an M1 Pro

183

1K

8K

Georgi Gerganov

@ggerganov

4 months

pack it up boys, it's over

118

649

8K

Georgi Gerganov

@ggerganov

3 months

The developers (Anton and Boris) used the ggwave library to make the AIs communicate faster over a phone call.

27

262

4K

Georgi Gerganov

@ggerganov

2 years

Casually running a 180B parameter LLM on M2 Ultra

75

384

4K

Georgi Gerganov

@ggerganov

2 years

I've started a company: From a fun side project just a few months ago, ggml has now become a useful library and framework for machine learning with a great open-source community.

143

372

3K

Georgi Gerganov

@ggerganov

2 years

LLaMA voice chat + Siri TTS. This example is now truly 100% offline since we are now using the built-in Siri text-to-speech available on MacOS through the "say" command

42

362

2K

Georgi Gerganov

@ggerganov

3 months

The "gibberlink" project is hosted here:. They won 1st place in a hackaton competition!.Congrats!.

19

133

2K

Georgi Gerganov

@ggerganov

3 months

Here is another Gibberlink experiment:. Two AI agents autonomously encrypt their audio chat. (video by Anton Pidkuiko)

144

352

2K

Georgi Gerganov

@ggerganov

1 year

llama.cpp is now in Homebrew Core 🍺

31

227

2K

Georgi Gerganov

@ggerganov

2 years

Full F16 precision 34B Code Llama at >20 t/s on M2 Ultra

38

254

2K

Georgi Gerganov

@ggerganov

2 years

ggtag : data-over-sound is back !. Please checkout our latest geeky side project --.An e-paper badge that can be programmed with sound. Here is how it works 🔊

33

249

2K

Georgi Gerganov

@ggerganov

2 years

sam.cpp 👀. Inference of Meta's Segment Anything Model on the CPU. Project by @YavorGI - powered by

34

271

2K

Georgi Gerganov

@ggerganov

2 years

guys it’s real

43

62

2K

Georgi Gerganov

@ggerganov

2 years

The future of on-device inference is ggml + Apple Silicon. You heard it here first!.

Nat Friedman

@natfriedman

2 years

Watching llama.cpp do 40 tok/s inference of the 7B model on my M2 Max, with 0% CPU usage, and using all 38 GPU cores. Congratulations @ggerganov ! This is a triumph.

38

177

2K

Georgi Gerganov

@ggerganov

1 year

Causally running Grok-1 at home

76

158

2K

Georgi Gerganov

@ggerganov

4 months

DeepSeek-R1 on Mac Studio 192GB 🪄

49

165

2K

Georgi Gerganov

@ggerganov

2 years

Simultaneously running LLaMA-7B (left) + Whisper Small (right) on M1 Pro

29

178

1K

Georgi Gerganov

@ggerganov

2 years

Let’s see what this rock can do

49

26

1K

Georgi Gerganov

@ggerganov

2 years

Announcing the Local LLaMA podcast 🎙️🦙. In today's episode we have LLaMA, GGaMA, SSaMA and RRaMA joining us to discuss the future of AI

31

185

1K

Georgi Gerganov

@ggerganov

1 year

Adding support for the new Mixtral models. Runs on CPU, CUDA and Metal with quantization support and partial GPU offloading. Very interesting architecture to play with!.

24

144

1K

Georgi Gerganov

@ggerganov

1 year

Wrote a short tutorial for setting up llama.cpp on AWS instances. For example, you can use one of the cheapest 16GB VRAM (NVIDIA T4) instances to serve a quantum Mistral 7B model to multiple clients in parallel with full context. Hope it is useful!.

27

170

1K

Georgi Gerganov

@ggerganov

2 months

Google is taking local AI to the next level with Gemma 3 QAT GGUF models. Uncompromised quality with quantization-aware training and uncompromised on-device performance with ggml. This is the way!

27

161

1K

Georgi Gerganov

@ggerganov

4 months

Make your Mac think faster 🧠🧠. Tomorrow I'll show you how to cancel your copilot subscription.

Georgi Gerganov

@ggerganov

4 months

Make your Mac think 🧠. Tomorrow I'll show you how to enable speculative decoding for extra speed.

35

121

1K

Georgi Gerganov

@ggerganov

1 year

ggml will soon run on billion devices. @apple don't sleep on it 🙃.

Radoslav Gerganov

@rgerganov

1 year

I just verified this on my Pixel 8 Pro phone! It has AICore included and it is using ggml

60

124

1K

Georgi Gerganov

@ggerganov

2 years

Native whisper.cpp server with OAI-like API is now available. $ make server && ./server. This is a very convenient way to run an efficient local transcription service locally on any kind of hardware (CPU, GPU (CUDA or Metal) or ANE). thx felrock

24

147

1K

Georgi Gerganov

@ggerganov

3 months

You can actually decode the audio messages from the video above using the waver webpage:.

18

62

1K

Georgi Gerganov

@ggerganov

2 years

llama.cpp server now support multimodal (LLaVA) 🎉. Huge shoutout to FSSRepo and monatis.

16

131

1K

Georgi Gerganov

@ggerganov

2 years

👀 What is this black magic!?

20

133

1K

Georgi Gerganov

@ggerganov

2 years

Just added support for all LLaMA models. I'm out of disk space, so if someone can give this a try for 33B and 65BB would be great 😄.See updated instructions in the Readme. Here is LLaMA-13B at ~10 tokens/s

Georgi Gerganov

@ggerganov

2 years

I think I can make 4-bit LLaMA-65B inference run on a 64 GB M1 Pro 🤔. Speed should be somewhere around 2 tokens/sec. Is this useful for anything?.

26

135

991

Georgi Gerganov

@ggerganov

2 years

llama.cpp just got access to the new Copilot for Pull Request technical preview by @github . Just add tags like "copilot:all" / "copilot:summary" / "copilot:walkthrough" to your PR comment the magic happens 🪄

15

95

971

Georgi Gerganov

@ggerganov

4 months

Make your Mac think 🧠. Tomorrow I'll show you how to enable speculative decoding for extra speed.

17

81

958

Georgi Gerganov

@ggerganov

2 years

The llama.cpp repo is buzzing with activity today. Here are some highlights. Added Alpaca model support and usage instructions

18

71

937

Georgi Gerganov

@ggerganov

2 years

llama2.c running in a web-page. Compiled with Emscripten and modified the code to predict one token per render pass. The page auto-loads 50MB of model data - sorry about that 😄.

Andrej Karpathy

@karpathy

2 years

My fun weekend hack: llama2.c 🦙🤠.Lets you train a baby Llama 2 model in PyTorch, then inference it with one 500-line file with no dependencies, in pure C. My pretrained model (on TinyStories) samples stories in fp32 at 18 tok/s on my MacBook Air M1 CPU.

16

142

876

Georgi Gerganov

@ggerganov

2 years

Here is how to deploy and serve any LLM on HF with a single command in less than 3 minutes with llama.cpp. $ bash -c "$(curl -s "

8

138

858

Georgi Gerganov

@ggerganov

2 years

llama.cpp now supports distributed inference across multiple devices via MPI. This is possible thanks to @EvMill's work. Looking for people to give this a try and attempt to run a 65B LLaMA on cluster of Raspberry Pis 🙃.

17

135

845

Georgi Gerganov

@ggerganov

2 years

whisper.cpp v1.3.0 now with Core ML support. Currently, the Encoder runs on the ANE, while the Decoder remains on the CPU. Check the linked PR 566 for implementation details and usage instructions.

12

118

755

Georgi Gerganov

@ggerganov

2 years

Here is 4-bit inference of LLaMA-7B using ggml:. Pure C/C++, runs on the CPU at 20 tokens/sec (M1 Pro). Generated text looks coherent, but quickly degrades - not sure if I have a bug or something 🤔. Anyway, LLaMA-65B on M1 coming soon!.

24

127

724

Georgi Gerganov

@ggerganov

2 years

I'm thinking about making an open-source local iOS voice chat app running Whisper Base + 4-bit Cerebras-GPT 2.7B. Should be able to run quite real-time on newer iPhones. Pretty sure I have everything needed and can build this in a day. Only question is if Cerebras is good enough.

42

40

716

Georgi Gerganov

@ggerganov

1 year

Running some LLM benches on iPhone 13 Mini. This is 1.1B TinyLlama. Speed looks quite reasonable. Wonder what would be some cool applications that we can try out 🤔. P.S. Forget about useless chat bots - we want something else. Think grammar, function calling, etc.

48

66

720

Georgi Gerganov

@ggerganov

1 year

llama.cpp releases now ship with pre-built macOS binaries. This should reduce the entry barrier for llama.cpp on Apple devices. Thanks to @huggingface for the friendly support 🙏

16

67

719

Georgi Gerganov

@ggerganov

4 months

Here is the most cost effective way to deploy (the real) R1 in the cloud - use llama.cpp-powered inference endpoints @huggingface . The @UnslothAI quantizations fit neatly in 4x L40S.(max 8192 ctx for now). Use the following link:.

29

101

726

Georgi Gerganov

@ggerganov

2 years

Apparently, Stable Diffusion can be used to generate images of spectrograms from text prompts. The spectrograms can in turn be converted to audio using STFT and some tricks. Mind is blown!.

18

120

659

Georgi Gerganov

@ggerganov

2 years

Experimenting with speculative decoding + grammar sampling. This is an example of summarizing a short story into a structured JSON. We again utilize speculative decoding, but this time we constrain the output using a JSON grammar to achieve > 95% token acceptance rate

11

67

659

Georgi Gerganov

@ggerganov

2 years

M2 Ultra serving Q8_0 LLaMA-v2 70B to 4 clients in parallel

16

66

632

Georgi Gerganov

@ggerganov

2 years

Top quality post on r/LocalLLaMA today 😅.Btw, great subreddit!

8

52

624

Georgi Gerganov

@ggerganov

2 years

shower thought : drop the position embeddings, rewrite the transformer using complex numbers, encode the position information in the complex phase. ref : see how MRI phase encoding works.

31

24

615

Georgi Gerganov

@ggerganov

1 year

Run @Google's Gemma Open Models with llama.cpp.

21

74

625

Georgi Gerganov

@ggerganov

5 months

Open the pod bay doors, HAL.

21

41

603

Georgi Gerganov

@ggerganov

2 years

Serving 8 clients in parallel on A100 with llama.cpp. Model: Codellama 7B F16.System prompt: 305 tokens.Requests: 128.Max sequence length: 100.Continuous batching: enabled. Average speed ~484 t/s (including prompts and generated tokens)

16

62

588

Georgi Gerganov

@ggerganov

2 years

whisper.cpp v1.5.0.

15

71

582

Georgi Gerganov

@ggerganov

2 years

llama.cpp is standing ground against the behemoths. The CUDA backend is contained in a single C++ file so it allows for very easy deployment and custom modifications. (pp - prefill, tg - text gen)

anton

@abacaj

2 years

Trying out the new TensorRT-LLM framework and get some pretty good performance out of the box with 3090s. 107 tokens/sec int8 and 54 tok/sec bf16 for llama-2 7B models (not much work to setup either). Get 160+ tokens/sec on 2x3090s (these are just batch_size=1)

12

47

568

Georgi Gerganov

@ggerganov

2 years

2,3,4,5 and 6-bit quantization methods are now available in llama.cpp. Efficient inference implementation with ARM NEON, AVX2 and CUDA - see sample numbers in the screenshots.Big thanks to ikawrakow for this contribution. More info:.

13

74

556

Georgi Gerganov

@ggerganov

2 years

Full GPU Metal inference with whisper.cpp. This is the Medium model on M2 Ultra, greedy decoding

15

52

559

Georgi Gerganov

@ggerganov

1 year

Challenge accepted! 😀.

Awni Hannun

@awnihannun

1 year

Achievement unlocked:. 100 tokens-per-sec, 4-bit Mistral 7B in MLX on an M2 Ultra

11

32

556

Georgi Gerganov

@ggerganov

4 months

jk we are just starting.

7

15

563

Georgi Gerganov

@ggerganov

1 year

The GGUF file format is a great example of the cool things that an open-source community can achieve. Props to @philpax_ and everyone else involved in the design and implementation of the format. I'm thankful and happy to see that it finds adoption in ML.

Mishig Davaadorj

@mishig25

1 year

At @huggingface, we are adding more support to GGUF (model format by @ggerganov). The number of GGUF models on the hub has been exploding & doesn't look like it is gonna slow down🔥.see more at:

11

63

496

Georgi Gerganov

@ggerganov

2 months

Retweet if you want llama.cpp added here 👇.

Visual Studio Code

@code

2 months

In preview, Copilot Pro and Copilot Free users can now bring your own key (BYOK) for popular providers such as Anthropic, Gemini, Ollama, and Open Router. This allows you to use new models that aren’t supported natively by Copilot the very first day that they’re released.

16

222

560

Georgi Gerganov

@ggerganov

7 months

ggml inference tech making its way into this week’s @apple M4 announcements is a great testament to this. IMO, Apple Silicon continues to be the best consumer-grade hardware for local AI applications. For next year, they should move copilot on-device.

Georgi Gerganov

@ggerganov

2 years

The future of on-device inference is ggml + Apple Silicon. You heard it here first!.

15

47

542

Georgi Gerganov

@ggerganov

7 months

llama.vim : Neovim plugin for local text completion . (powered by llama.cpp)

24

69

537

Georgi Gerganov

@ggerganov

2 years

Initial low-rank adaptation support has been added to llama.cpp. We now have the option to apply LoRA adapters to a base model at runtime. Lots of room for improvements and opens up possibilities for some interesting applications.

9

82

536

Georgi Gerganov

@ggerganov

2 years

Here are some inference numbers for Code Llama on M2 Ultra at different quantum levels using latest llama.cpp . pp - prompt processing.tg - text generation. Code Llama 7B

12

60

536

Georgi Gerganov

@ggerganov

2 years

The ggml roadmap is progressing as expected with a lot of infrastructural development already completed. We now enter the more interesting phase of the project - applying the framework to practical problems and doing cool stuff on the Edge

Georgi Gerganov

@ggerganov

2 years

Took the time to prepare a ggml development roadmap in the form of a Github Project. This sets the priorities for the short/mid term and will offer a good way for everyone to keep track of the progress that is being made across related projects

7

41

519

Georgi Gerganov

@ggerganov

2 years

Can't help but feel the AI hype is oriented in a non-optimal direction. It's almost as if we had just discovered the FFT algorithm and instead of revolutionizing telecommunications, we are using it to build Tamagotchis. P.S. I'm only half joking 😄.

31

30

503

Georgi Gerganov

@ggerganov

2 years

Took the time to prepare a ggml development roadmap in the form of a Github Project. This sets the priorities for the short/mid term and will offer a good way for everyone to keep track of the progress that is being made across related projects

10

38

486

Georgi Gerganov

@ggerganov

2 years

I'm trying to figure out what this means. Any ideas?

39

32

492

Georgi Gerganov

@ggerganov

2 years

New release: whisper.cpp v1.4. - Added 4-bit, 5-bit and 8-bit integer quantization.- Added partial GPU support via cuBLAS.

11

65

498

Georgi Gerganov

@ggerganov

2 years

whisper.cpp now supports @akashmjn's tinydiarize models. These fine-tuned models offer experimental support for speaker segmentation by introducing special tokens for marking speaker changes.

16

62

496

Georgi Gerganov

@ggerganov

2 years

Progress update on adding Core ML support to whisper.cpp. We can now run the small model with a 400ms time step quite efficiently thanks to evaluating the Encoder on the ANE

11

43

479

Georgi Gerganov

@ggerganov

14 days

Son has been doing an outstanding job at maintaining the llama-server implementation and now bringing full-blown vision input support to llama.cpp!. Massive kudos and thanks for your valuable contributions to the project!.

Xuan-Son Nguyen

@ngxson

14 days

Vision support now available on llama.cpp server and Web UI!. More details in 🧵

9

46

498

Georgi Gerganov

@ggerganov

2 years

Interactive chat mode added to 🦙.cpp. It actually works surprisingly well from the few tests that I tried!. Kindly contributed by GH user Blackhole89

12

43

467

Georgi Gerganov

@ggerganov

4 months

llama.vscode. (powered by Qwen Coder)

Georgi Gerganov

@ggerganov

4 months

Make your Mac think faster 🧠🧠. Tomorrow I'll show you how to cancel your copilot subscription.

11

71

479

Georgi Gerganov

@ggerganov

2 years

Some of llama.cpp's features.

13

34

472

Georgi Gerganov

@ggerganov

4 months

this is what a happy llama.cpp user looks like

6

23

473

Georgi Gerganov

@ggerganov

2 years

Initial tests with parallel decoding in llama.cpp. A simulated server processing 64 client requests with 32 decoding streams on M2 Ultra. Supports hot-plugging of new sequences. Model is 30B LLaMA F16. ~4000 tokens (994 prompt + 3001 gen) with system prompt of 305 tokens in 46s

16

52

462

Georgi Gerganov

@ggerganov

2 years

Will be cancelling my Github Copilot subscription soon 🙃.

9

33

446

Georgi Gerganov

@ggerganov

1 year

LBDL + llama for scale. thx @francoisfleuret

4

13

451

Georgi Gerganov

@ggerganov

2 years

Here is what a properly built llama.cpp looks like. Running 7B on 2 years old Pixel 5 at 1 token/sec. Would be interesting to see how an interactive session feels like.

Radoslav Gerganov

@rgerganov

2 years

Running llama.cpp on my Pixel5 phone with termux. Kudos to @ggerganov !

10

66

441

Georgi Gerganov

@ggerganov

1 year

GGUF My Repo by @huggingface . Create quantum GGUF models fully online - quickly and secure. Thanks to @reach_vb, @pcuenq and team for creating this HF space!. In the video below I give it a try to create a quantum 8-bit model of Gemma 2B - it took about

23

85

449

Georgi Gerganov

@ggerganov

8 months

Llama 3.2 3B & 1B GGUF.

25

84

448

Georgi Gerganov

@ggerganov

2 years

ROCm support in llama.cpp. 4 months community effort enables AMD devices to run quantum LLMs with high efficiency. Really great to see the strong collaboration in this work!.

11

63

439

Georgi Gerganov

@ggerganov

2 years

Very clever stuff! Will be adding a llama.cpp example soon.

lmarena.ai (formerly lmsys.org)

@lmarena_ai

2 years

Introduce lookahead decoding:.- a parallel decoding algo to accelerate LLM inference.- w/o the need for a draft model or a data store.- linearly decreases # decoding steps relative to log(FLOPs) used per decoding step. Blog: Code:

7

38

444

Georgi Gerganov

@ggerganov

2 years

I think I can make 4-bit LLaMA-65B inference run on a 64 GB M1 Pro 🤔. Speed should be somewhere around 2 tokens/sec. Is this useful for anything?.

35

17

439

Georgi Gerganov

@ggerganov

2 years

The plan for adding full-fledged GPU support in ggml is starting to take shape. Today I finally finished the ggml computation graph export / import functionality and demonstrated a basic MNIST inference on the Apple Silicon GPU using Metal.

8

63

429

Georgi Gerganov

@ggerganov

2 years

I'm color-coding Whisper tokens based on their probs -- green means confident. All models behave in a similar way (first 3 images), except for Large V2. The probs are all over the place (4th image) 🤔. Do I have a bug or is this model somehow unstable?

14

27

435

Georgi Gerganov

@ggerganov

2 years

4-bit integer quantisation in whisper.cpp / ggml. You can now run the Large Whisper model locally in a web page via WebAssembly SIMD.

11

65

434

Georgi Gerganov

@ggerganov

2 years

Very cool experiment by @chillgates_ . Distributed MPI inference using llama.cpp with 6 Raspberry Pis - each one with 8GB RAM "sees" 1/6 of the entire 65B model. Inference starts around ~1:10. Follow the progress here:.

Loki (cute/acc)

@chillgates_

2 years

Yeah. I have ChatGPT at home. Not a silly 7b model. A full-on 65B model that runs on my pi cluster, watch how the model gets loaded across the cluster with mmap and does round-robin inferencing 🫡 (10 seconds/token) (sped up 16x)

11

75

429

Georgi Gerganov

@ggerganov

4 months

CPU is all you need.

Matthew Carrigan

@carrigmat

4 months

Complete hardware + software setup for running Deepseek-R1 locally. The actual model, no distillations, and Q8 quantization for full quality. Total cost, $6,000. All download and part links below:.

17

36

435

Georgi Gerganov

@ggerganov

2 years

This is the prompt for anyone interested:.

8

51

416

Georgi Gerganov

@ggerganov

2 years

napkin math ahead:. - buy 8 mac mini (200GB/s, ~$1.2k each).- run LLAMA_METAL=1 LLAMA_MPI=1 for interleaved pipeline inference.- deploy on-premise, serve up to 8 clients in parallel at 25 t/s / 4-bit / 7B. is this cost efficient? energy wise?. thanks to @stanimirovb for idea.

24

25

406

Georgi Gerganov

@ggerganov

2 years

The new image segmentation model SAM by Meta looks extremely interesting.

16

14

402

Georgi Gerganov

@ggerganov

1 year

"inference on your head".

Joseph Semrai

@josephsemrai

1 year

inference on your head. mistral 7b (4bit quantized) running locally on apple vision pro

5

30

403

Georgi Gerganov

@ggerganov

2 years

This is LLaVA 7B v1.5 running on M2 Ultra thanks to the amazing work of GH user monatis. I'm surprised this works so well - downloaded a few photos from my phone and every single one was accurately described. Mind is blown!.

4

35

401

Georgi Gerganov

@ggerganov

2 years

"Wait, Georgi, how is this even possible?" you might ask. After all, the M2 Ultra only has 800GB/s bandwidth. Other people normally need 4 high-end GPUs to do this. The answer is: Speculative Sampling.

9

37

394

Georgi Gerganov

@ggerganov

2 years

RWKV port in ggml by the community:. I haven't had the chance to look at this in details yet, but it feels great that people are picking up ggml and applying it to more and more models.

3

52

382

Georgi Gerganov

@ggerganov

7 months

llama.vim is also pretty wild 🙃

Thomas Ricouard

@Dimillian

7 months

Yep GitHub Copilot for Xcode is pretty wild!

7

42

370

Georgi Gerganov

@ggerganov

2 months

whisper.cpp v1.7.5 is out

5

28

383

Georgi Gerganov

@ggerganov

2 years

Here I outline a potential strategy for adding GPU support to ggml. Not sure how feasible it is yet, but it could be a fun exercise for people with GPU programming experience.

6

46

370

Georgi Gerganov

@ggerganov

2 years

Powered by: ggml / whisper.cpp / llama.cpp / Core ML .STT: Whisper Small.LLM: 13B LLaMA.TTS: @elevenlabsio . The Whisper Encoder is running on Apple Neural Engine. Everything else is optimized via ARM NEON and Apple Accelerate.

10

18

362

Georgi Gerganov

@ggerganov

1 year

Playing some chess using voice. WASM whisper.cpp with a quantized tiny model + grammar sampling (by @ejones). Runs locally in the browser. Not perfect, but I think pretty good overall!. Try it here:

8

43

356