I was really excited that Mojo became publicly available and thinking which project can I implement to learn Mojo concepts.
Since I have already ported llama2.c to pure Python, I decided why not try to port to Mojo now 😀
And here is what I got...
The out-of-the-box features of
@Modular_AI
's Mojo are just incredible. We applied unrolling and now llama2.🔥 outperforms
@ggerganov
's llama.cpp by almost 20% in CPU inference speed.
BREAKING: I've implemented a prototype of a cutting-edge Q-Learning algorithm on Mojo 🔥, and now it's working 35,000x times faster than any existing (!) implementations. Thanks to Mojo's incredible feature that allows transparently import any Python modules!
I really hope this
llama2 inference in a pure Mojo 🔥
I found the SIMD Mojo primitives really interesting feature, since it helped to improve pretty awful performance of Python solution almost 250x times.
Internally I used vectorisation helpers for matmul so that now Mojo solution can beat original llama2.c by
@karpathy
by 20%
I think there is still some room for further improvements.
Now it achieves over 1000 tokens per second for inference on M1 Max! 🚀
Big shoutout to our contributor Michael Kowalski for helping make this possible:
I've got early access to the Mojo SDK 🔥 for Mac from
@Modular_AI
. And of course I always wanted to inference baby llama on pure Mojo on Apple Silicon.. True story not only on Mojo 😉
So far, results are mind-blowing! Here are some benchmarks
Mojo 🔥 0.5.0 is released! 🚀 Even more epic updates unleashed! 😱 Checkout the highlights ⬇️ or read the full changelog here & happy weekend hacking! 👩🏼💻 ➡️
GitHub's integration of GPU enabled M1 Apple Silicon hosts for action runners may have flown under the radar, but its implications are vast. It's a strong indicator of Apple Silicon's rising adoption among developers, hinting at a future where M2 and M3 become central to the
Wow! This is exciting! Thanks
@Modular_AI
for appreciating my efforts & congrats on the public release of Mojo! Seems that my port of llama2 inference is truly a "First prober ai written in Mojo" 😀
The first crack at llama2.🔥 is here 🚀
A Mojo 🔥 community member - Mojician - did a simple port from Python to Mojo, and shows its already 20% faster than Karpathys llama.c implementation 😱 How much faster can it go? 📈
@Modular_AI
Exciting!
I got early access to the Mojo SDK (Mac) week ago, and compared it's performance on baby-llama inference. Mojo VS Rust, C, Cpp, Go, Zig, and Julia. In total 12 implementations on 7 languages x 3 model x 30 rounds Check this out
I have the honor of authoring the first ever guest-post on the Modular AI blog, about my journey with porting
#llama2
inference into the
#mojo
lang
Kudos to
@shshnkp
for the incredible cooperation and support with preparing this article!
New blog post by Mojician 🔥 and guest contributor
@tairov
💯⬇️
Aydyn discusses his journey from discovering Mojo 🔥 to implementing llama2.🔥 which has over 1.2k stars 🤩 on GitHub! 🚀
Thanks to PR from Modular team member, parallelize works in llama2.🔥
Why not compare parallel execution with llama2.c?
And... llama2.c strikes back, now with OMP...
👀
"MLX has a Python API which closely follows NumPy. MLX also has a fully featured C++ API which closely mirrors the Python API"
Yet another attempt to fix ML models usability by implementing Python libs written on C++.
Now from Apple research
Just in time for the holidays, we are releasing some new software today from Apple machine learning research.
MLX is an efficient machine learning framework specifically designed for Apple silicon (i.e. your laptop!)
Code:
Docs:
From slowest to fastest — my Python and Mojo 🔥 ports of
@karpathy
llama2.c interestingly went to opposite ends of the perf spectrum. Recently I got a PR for
#python
using pypy & codon compilation on llama2-py that boosted it ~50x!
Despite my best efforts to attend
#ModCon
onsite this year, I sadly couldn't make it happen. But the event can't truly go on without the first ever Mojician v-attendance! 😅 Wishing everyone an insightful conference!
I tried to optimize all the
#llama2c
ports for max performance. Some don't support multithreading so the comparisons aren't completely apples-to-apples. But it's clear Mojo is here to stay
It's turned out that Mistral's team literally is not using Mojo to speedup training and inference 35,000x times and raised €120+M
Investors what are you doing 😢 this is horrendous!
Mistral's team literally just used learned weights over data instead of programmed rules and raised 120+M.
Investors what are you doing 😢 this is horrendous
I secured early access to the Mojo SDK on Mac before general release. Put all
#llama2c
ports through extensive benchmarks across 7 languages and 12 variations. Crafted custom benchmarking framework to test performance. Quite intriguing battle on M1 Mac - results are telling 👇
I built a test env to benchmark all
@karpathy
's
#llama2c
ports, including Mojo, Zig, Julia, Rust &
#llamacpp
converter by
@ggerganov
. Ran inference across 3 baby llama models in 30 rounds (multi/single threaded). Check out the full report
It seems that the Mistral team is setting a new trend in OS LLMs.
MoE — mixture of experts . If I understand correctly, GPT-4's impressive performance largely stems from a similar technique. In the backend, it features eight 'heads' or 'experts', each to be 250 billion parameter
@aniketvartak
@Modular_AI
@karpathy
The thing is llama2.mojo is also was implemented for understanding Mojo concepts and to have a real-world example. However that doesn't mean both projects couldn't be evolved further to squeeze everything you can out of the hardware while trying to maintain brevity
Kudos to the
@ziglang
community for improving and benchmarking llama2 inference in Zig under Apple M1/2 ! The llama2.zig implementation has solid single-threaded performance - it may be the fastest single-threaded inference of tiny-llama models so far on Macs. Surprisingly, no
@justthisguy
@Modular_AI
@ggerganov
llama2.🔥 is a port of
@karpathy
's llama2.c, biggest supported model so far is TinyLlama with 1.1B parameters. I hope soon we can run bigger quantized models as well
Project
#2
: LLM Visualization
So I created a web-page to visualize a small LLM, of the sort that's behind ChatGPT. Rendered in 3D, it shows all the steps to run a single token inference. (link in bio)
@clattner_llvm
This is what happens when you launch a new exciting technology that many enthusiasts have been eager to try out right before the weekend 👨💻
I think it's worthwhile to share, some interim results we got with llama2.🔥 speedup. Our incredible github contributors baked a draft PR. And here is what we have 👇
It shows additional details which loops were vectorized by gcc compiler
So far it seems that the very first comparison llama2.c vs Mojo was fair
gcc is aggressively vectorizing all loops it can find 😀
Who remember this?
From 2008, the Google Search Appliance led the way in on-prem search solutions for enterprises deployed as a rack form factor black box device.
It was discontinued post-2018
I discovered podcasts on X. Today I was invited to a really nice one! Thanks
@altryne
for the opportunity to share highlights & my experience with early release of Mojo SDK on Mac from
@Modular_AI
. PS. My moment of fame starts at the 59th min 🔊
T-minus 2 hours for
@thursdai_pod
live recording, and as always, if you can't make it to the live one, make sure you're subscribed on to receive the episode in newsletter and podcast form
Mojo 🔥 is coming to Mac 💻 very soon 😱
Here’s a little sneak peak of us testing LLama2.🔥 out of the box by
@tairov
. Look for this to drop in the next couple of weeks 💯🚀
@ggerganov
Sounds cool, but I think it might be hard to compete with AWS on its own territory 😀 from the costs perspective . They already have AWS bedrock service rolling out , it’s kind of API to many LLM models where you pay for tokens used
Epic work, the thing I love about this is how small and clean the code is - literally reimplementing everything down to the metal instead of depending on thick layers of magic.
Gemini, the avant-garde and trailblazing multi-modal virtuoso of language models, state-of-the-art titan, infused with wit and wisdom far beyond its digital peers. It's an inventive, quick-witted behemoth, eclipsing predecessors with its sterling adaptability!
Gemini:
There were some debates regarding fairness of C vs Mojo comparison. I was in doubt was it fair or not , since in Mojo I deliberately introduced SIMD operations.
After some research I found an intersting gcc switch `-fopt-info-vec `
@cpavel866
@Modular_AI
@ggerganov
I wouldn't say llama.cpp is 10x faster on M1 Metal. Probably it has 2x boost. I'm eager to benchmark Mojo with GPU support, once it released.
@4evaBehindSOTA
Hey
@4evaBehindSOTA
! Wanna give it "Roud 3" ? :)
We added unrolling improvements, now it hits 1000 tok/s for stories15M. Pull latest changes and use -j 6
seems that with threads = 6 it works even better
We've reached another milestone with the support for the 1.1B TinyLlama, which can now generate advanced responses, like explanation of Pythagorean theorem or providing Python code for calculating the Fibonacci sequence. Impressive performance for a 4GB sized model!
I’m eager to benchmark it with any other reference Q-Learning , as soon as one is available 🤓 This is probably the only opinionated prototype so far. I'm afraid the competitors don't stand a chance either way
Here is real value for whoever come to comments:
Modular is giving away free tickets to ModCon 2023 + swag 🎁. It seems it's still wide open and the competition is low! I see it's as a prime opportunity to implement some classic algo on Mojo for a solid chance at win. I’d love to
It's absolutely crazy how cheap the computing power is if you stay outside of the Cloud 🤯
The usage of iximiuz Labs keeps growing, so I'm upgrading my bare metal servers. And I just doubled the fleet's CPU capacity with the price going from $44 to $53 per server per month.
Gemini must be a beast . 85% performance on a typical Codeforces competition is wild! It’s like solving 4-5 medium/hard Leetcode problems within 2.5 hours.
So excited to share what the team and I have been working on these last months!
#AlphaCode
2 is powered by Gemini and performs better than 85% of competition participants in 12 contests on Codeforces! More details at
@GoogleDeepMind
@tairov
@karpathy
Thank you Aydyn! I love the idea of a pure Python implementation with no dependence to an external package. I started from your version and introduced numpy: ... using the standard interpreter we get to 2x slower from 10x slower compared to C version.
Meanwhile AMD is also presenting something, obviously for AI, I’m exhausted, have no time keep up with everything
Hey Grok , maybe your qdrant based vector search can help summarise this? 😀
Context is not all you need.
As this research highlights, LLMs struggle with basic contextual understanding as the reasoning context grows more complex. Without a framework firmly grounding symbols in reality, model performance degrades.
As it was demonstrated on OpenAI
Claude 2.1 (200K Tokens) - Pressure Testing Long Context Recall
We all love increasing context lengths - but what's performance like?
Anthropic reached out with early access to Claude 2.1 so I repeated the “needle in a haystack” analysis I did on GPT-4
Here's what I found:
@Modular_AI
Hi
@ylecun
! 🙌 I've been diving deep into the new Mojo lang by implementing
#Llama2
inference on it. We'd love to hear your insights on Mojo and its stated capabilities
Ideal explanation of all aspects of LLMs, including the security concerns, in such a condensed form.
It's brilliant how succinctly the information is conveyed.
@karpathy
's videos are examples of almost perfect compression of huge ML topics into an accessible form.
New YouTube video: 1hr general-audience introduction to Large Language Models
Based on a 30min talk I gave recently; It tries to be non-technical intro, covers mental models for LLM inference, training, finetuning, the emerging LLM OS and LLM Security.
Now, on average we're performing slightly better than multithreaded
#llama2c
. We're able to further improve vectorization/parallelization of transformers forward pass
You don’t have to train from scratch whenever developing a smaller model of an existing model family.
Sharing our latest work - “Initializing Models with Larger Ones”
arxiv preprint:
code: