And here you have it! The cover for Hands-On Large Language Models.
And the animal is:
*drum roll*
The Red Kangaroo!
Why the Red Kangaroo? The process of choosing cover animals is a closely guarded secret held deep within the legendary halls of
@OReillyMedia
.
@MaartenGr
and I
How GPT3 works. A visual thread.
A trained language model generates text.
We can optionally pass it some text as input, which influences its output.
The output is generated from what the model "learned" during its training period where it scanned vast amounts of text.
1/n
pip install scikit-learn
It's easy to take for granted, but this single command gives you functionality I'd value at hundreds of thousands of dollars, if not more.
Not to mention amazing documentation that beautifully weaves guides and references.
Hats off to
@scikit_learn
A 🧵looking at DeepMind's Retro Transformer, which at 7.5B parameters is on par with GPT3 and models 25X its size in knowledge-intensive tasks.
A big moment for Large Language Models (LLMs) for reasons I'll mention in this thread.
Presenting the Explainable AI Cheat Sheet:
Video:
Cheat Sheet:
A high-level map to major categories of ML Explainability. Informed by excellent work by
@ChristophMolnar
@IAugenstein
@sameer_
and others. Plenty of links!
AI image generation is the most recent mind-blowing AI capability.
#StableDiffusion
is a clear milestone in this development because it made a high-performance model available to the masses.
This is how it works.
1/n
Interfaces for Explaining Transformer Language Models
A new blog post (with interactive explorables) to make transformers more transparent. It shows input saliency for generated text, and (VASTLY more interesting) neuron activations
What makes LLM tokenizers different from each other? GPT4 vs. FlanT5 Vs. Starcoder Vs. BERT and more
Tokenizers are one of the key components of Large Language Models (LLMs). One of the best ways to understand what they do is to compare the behavior of different tokenizers.
In
Our new short course, “Large Language Models with Semantic Search" is now live! In it, you'll learn how to use LLMs to build the next generation of search systems using concepts like embedding and reranking. Hope you enjoy it!
What an incredible honor to
We just released "Large Language Models with Semantic Search”, built with
@cohere
, and taught by
@JayAlammar
and
@SerranoAcademy
. Search is a key part of many applications. Say, you need to retrieve documents or products in response to a user query; how can LLMs help? You’ll
If the rise of LLMs caught you by surprise, here's your chance to get a preview of what's likely to be the next monumental jump in AI capabilities:
LLM-backed agents that use software tools
In this video, I'll walk you through the concepts and code of building an LLM-backed
Probing Classifiers: A Gentle Intro (Explainable AI for Deep Learning)
New video!
Probing Classifiers are an Explainable AI tool used to make sense of the representations that deep neural networks learn for their inputs.
Finetuning Text Embedding Models
Achieving peak performance in tasks like text classification and semantic search often requires finetuning an embedding model.
This is one of the key intuitions one needs to build when using Large Language Models.
We're launching
@CohereAI
Sandbox – open-source libraries to help developers experiment with language AI
I've been working on topic modeling using LLMs:
-1-
Intro to Basic Semantic Search
A gentle guide to building simple semantic search features that go beyond keyword search. Uses sentence embeddings and Annoy to build a "similar questions" feature.
Ecco – See what your NLP language model is “thinking”
Ecstatic to release my first open-source project! Interactive visualizations in jupyter for
@huggingface
GPT2-based language models.
Github:
HN:
Big update to "The Illustrated Stable Diffusion" post
14 new and updated visuals.
The biggest update is that forward diffusion is more precisely explained -- not as a process of steps (that are easy to confuse with de-noising steps).
-1-
From the various tools that enable building solutions with large language models (LLMs), DSPy stands out to me as one of the most promising tools for building LLM pipelines.
I got to speak to
@lateinteraction
and ask him to introduce DSPy and what he envisions for its future.
ChatGPT has Never Seen a Single Word (Despite Reading All of The Internet). Glance at LLM Tokenizers.
New Video!
It's fascinating that the actual input to language models is not exactly the text we pass them! Learn more about tokenizers, a key component of LLMs.
Link in reply
When training binary classifiers in
@PyTorch
, make sure to use the correct binary loss for your network structure.
BCEWithLogitsLoss improves numeric stability, but make sure you pass the actual logit output because it will apply the sigmoid itself.
How GPT3 Works - Visualizations and Animations
A compilation of my threads explaining GPT3. I'll still post early drafts here on Twitter, but that post is the proper & final home for them all.
1/n of the second thread
The Illustrated Retrieval Transformer
New video!
Language models are improved by giving them the ability to query a database or search the web for information.
Here's a look at one way of doing that.
Tokenizers, and self-attention both lie at the heart of the LLM boom.
Learn about them and more in the most recent post on the newsletter.
LLM Tokenizers, Semantic Search Course, And Book Update
#2
The update on attention is a teaser to a chapter
Software is eating the world.
Machine learning is eating software.
Transformers are eating machine learning.
Oversimplifications, to be sure, but this trail of utility to economic value is evident and we don't yet understand how drastically it will shift economic value.
1/n
Intro to Large Language Models with Cohere
A high-level look at large language models and some of their applications for language processing. It covers text generation models (like GPT) and representation models (like BERT).
Inspecting Neural Networks with Canonical Correlation Analysis - A gentle Intro
New Video!
Methods like CKA, PWCCA, and SVCCA serve as similarity measures revealing to us insights into how a neural network processes its inputs.
Despite the Generative AI craze, one of the most exciting and reliably useful areas of AI is not generative at all.
It is search.
Learn about Neural Search from
@Nils_Reimers
, creator of Sentence Transformers, and
@CohereAI
director of ML/embeddings
Einsum is a key method in summing and multiplying tensors. It's implemented in
@numpy_team
,
@TensorFlow
, AND
@PyTorch
. Here's a visual intro to Einstein summation functions.
1/n
How GPT-3 Works - Easily Explained with Animations
A gentle and visual look at how the API/model works under the hood -- including how the model is trained, and how it calculates its predictions.
New Video!
This Intro to Deep Unsupervised Learning is excellent. It's presented by Alec Radford, the first author of papers including GPT, GPT2, DCGAN, and CLIP.
Covers word2vec, Glove, RNNs, ELMo, BERT, T5, Electra, and more.
A Visual Guide to Prompt Engineering
Large GPT language models are rising in prominence as language processing and generation tools. They can write, paraphrase, and summarize, but they can also classify.
This is a gentle starting guide to prompts.
How does BERT answer questions?
In this explorable,
@betty_v_a
shows how the layers of BERT successively mutate the representations of input words (question and context) so the correct answer ("bathroom") ends up isolated enough for the model to pick
Scatterplots are amazing for exploration. We use them all the time for text (using embeddings). It's the first time I get to explore a music scatter plot -- each point is 3 seconds of music.
Fascinating work by
@philtgun
at
Entity Extraction with Large Language Models
In this article and notebook,
@nickfrosst
and I walk you through extracting movie names from r/movies posts using a generative language model.
So many exciting things happening in ML these days.
DeepMind's Gato is the direction I'm excited about the most.
One small-ish model that learns text, images, playing video games, robotic sensors and control.
Everything is a sequence!
Let's work out how:
1/n
Two years in the making by a talented, collaborative, and fun team, and with enormous help and support from many others at
@DeepMind
. No better place to be! Congrats
@scott_e_reed
on this step.
The next generation of RAG applications will
1) include a query rewriting step
2) provide citations for its sources.
This is an incredible visual guide on how to build it end-to-end.
Colab:
The Chat endpoint with RAG is easy to use, but it's also customizable.
In document mode, the endpoint is highly modular. In this LLM University chapter, learn how to build a RAG-powered chatbot with the Chat, Embed, and Rerank endpoints.
New model alert!
@CohereAI
's new embedding model supports 100+ languages and delivers 3X better performance than existing open-source models.
See the post by
@Nils_Reimers
and
@amrmkayid
:
I'm writing an updated version of The Illustrated Transformer for the upcoming Hands-On LLMs book I'm co-writing with
@MaartenGr
.
What updates/developments in the past 5 years do you feel should be a definitive addition to an intro to the architecture? Lots of additions to
I caught up with
@abertsch72
at
#NeurIPS2023
, who was presenting Unlimiformer, a retrieval-augmentation method for encoder-decoder models allowing unlimited length inputs.
Paper: Unlimiformer: Long-Range Transformers with Unlimited Length Input
Work with
@urialon1
@gneubig
, and
Language Processing with BERT: The 3 Minute Intro (Deep learning for NLP)
New video! A brief and highly accessible intro to BERT, where you have used it, and the various applications it powers.
This week, we launched Command-R, which crowns
@cohere
's stack of RAG-optimized language models.
Join me in any of these upcoming dates as I break down this advanced-RAG, multilingual stack:
March 12, 3PM:
#SXSW2024
, Austin, Texas. Hacks / Hackers (Sign up:
Large Language Models for Real-World Applications - A Gentle Intro
My talk from
@PyData
London is now on online! It covers three top LLM use cases we see at
@CohereAI
(classification, semantic search, text generation).
Here are the five main slides:
Remaking Old Computer Graphics With AI Image Generation
New post!
I take Dream Studio, Midjourney, and DALL-E for a test drive: recreating an old video game cinematic.
In the end, I share my current impression of these services.
Ecstatic to see "Machine learning research communication via illustrated and interactive web articles" published at
@rethinkmlpapers
workshop at
#ICLR2021
In it, I describe my workflow for communicating ML to millions of readers.
Paper:
1/5
Just published!
My "Visual Intro to Machine Learning and Deep Learning" talk at QCon 2020. A gentle intro to ML for software engineers where I go over 10 foundational concepts, 4 applications, and 3 tools to get you started on your journey.
The Narrated Transformer Language Model
A new video!
A high-level overview of transformer language models. It addresses both the transformer architecture and language modeling (as that makes a simpler intro than machine translation)
A Gentle Intro to Transformer language models and how makes them more transparent
My talk at
@PydataKhobar
is now live! Thanks to the organizers.
Colab:
Seeing Voices: 1 - Intro to Spectrograms
New video!
I have been captivated with this method that visualizes sound. It's used in ML for speech recognition, but is also opening the door to better understand animal communication and intelligence.
If you're a visual learner, be sure to check out
@MeorAmer1
's Visual Intro to Deep Learning. Meor's ability to create visual language explaining ML concepts is absolutely remarkable.
I'm going to be honest. I hyperventilated a little when I saw this dataset internally.
All of Wikipedia. Embedded. Passage by passage. Not only English, but 9 other languages as well.
Ecstatic to get to put it in your hands
What could you build if you had the embeddings of ALL of wikipedia?
The Embedding Archives: Millions of Wikipedia Article Embeddings in Many Languages
We’re publishing ~100 million embedding vectors, covering Wikipedia in 10 languages. Get them now!
Ecco v0.1.0 is out! Massive update.
- Support for T5, T0, DeBERTa, and ability to add other/local models
- Feature attribution via Integrated Gradients and many other methods
- Support for Beam Search generation
Good morning
#ACL2023NLP
!
Excited for my first ACL since Gathertown. Would love to say hi if you're here!
I'll be tweeting my experience in this thread over the next few days.
AI Agents will take the abilities of LLMs to a whole new level.
Here's how to build a simple agent that can use software tools like searching the web or writing and running python code (LLMs love to write
@matplotlib
code for you).
Automate your enterprise workflows with Cohere's multi-step tool use. Our generative model Command R+ excels at leveraging external tools to execute complex tasks to streamline business operations.
Get started today!
Behavioral Testing of ML Models (Unit tests for machine learning)
New video!
Creating unit tests for ML models gives us higher resolution understanding of model performance -- allowing us to better compare models and observe degradation.
Applying massive language models in the real world with
@CohereAI
This is a round up of some of my recent writings and collaborations on applying large language models at Cohere. They contain a bunch of intuitions for problem solving with LLMs.
What's the big deal with Generative AI? Is it the future or the present?
New post!
This is part 1 of reflections on how best to think of the current state of AI products and features, & avoid pitfalls people tend to make with new tech.
Four main points:
The scale of
#NeurIPS2023
is staggering. This is a look at just one of the poster sessions.
If only AI could help us explore / understand / browse / better search all this knowledge..
Let's look at different tokenizers in action -- explaining so much of how a LLM "sees" text.
New Video! (link in response)
We have carefully crafted a piece of text that reveals so much about how a LLM parses its input. We pass it to BERT, GPT4, GPT2, Galactica, Starcoder,
LLM-backed agents have been some of the most futuristic LLM directions in 2023. The Voyager paper, presented here by coauthor
@yuqi_xie5
at a
#neurips2023
workshop, was certainly one of the most fascinating.
With the right framing, a (text+code only) LLM can successfully
Oversimplified example of self-attention, the concept behind a lot of the current progress in AI/ML.
Say a model needs to process the sentence:
"A robot must obey the orders given 𝗶𝘁 by human beings"
Self-attention helps the model resolve which word "𝗶𝘁" refers to.
Favorite AI/ML Books: Intro to ML with Python (Book Review)
New Video!
I go over the awesome "Intro to ML with Python" by
@amuellerml
and Sarah Guido. A book that helped me understand many applied ML methods.
What are inductive biases? Can models make different predictions when trained on the same data?
@RTomMcCoy
distills the concept incredibly well in this one graphic.
More:
Video:
Top-k and Top-p are key parameters for controlling the output of GPT models. They are two possible decoding strategies (Or let's call them 'token picking methods')
This is a visual look at how they work as the last step in GPT text generation.
LLMs are finally breaking free from short context lengths using methods like Ring Attention.
Don't miss this visual explainer by
@khshind
@simonguozirui
@bonniesjli
How do state-of-the-art LLMs like Gemini 1.5 and Claude 3 scale to long context windows beyond 1M tokens?
Well, Ring Attention by
@haoliuhl
presents a way to split attention calculation across GPUs while hiding the communication overhead in a ring, enabling zero overhead scaling
Jay's Visual Intro to AI
I made a video introducing AI and some of its key business applications. I talk about the motivation of using AI, and the simple trick that lies at the heart of the majority of AI/ML applications in the real world.
I like this graphic from a
@huggingface
notebook on tokenization ().
It shows three tokenization schemes with examples, and how vocabulary size increases across different schemes.
GPT's tokenization is similar to the one in the middle.
One of the best investments you can make in your AI Engineering skillset is to be comfortable with the ideas of using language models for search.
In "Using LLMs for Search with Dense Retrieval and Reranking",
@SerranoAcademy
and I give you the key intuitions for building this
In our latest blog post, Cohere's Head of Developer Relations
@SerranoAcademy
and Engineering Director
@JayAlammar
provide a comprehensive overview of how to use LLMs to power state-of-the-art search.
Guess the animal on the cover of our upcoming Hands-On Large Language Models book for a chance to win a free copy!
There's a secret method that assigns the animals of
@OReillyMedia
books. Even
@MaartenGr
and I as authors don't even know what the animal is until it is assigned.
Favorite python books: Effective Python
New video!
I go over
@haxor
's excellent advanced python book with recommendations on how to make your code more pythonic.
Finding the Words to Say: Hidden State Visualizations for Language Models
New post! Visualizations glancing at the "thought process" of language models & how it evolves between layers. Builds on awesome work by
@nostalgebraist
@lena_voita
@tallinzen
. 1/n
AI Agents are some of the most drastic technological changes on the horizon. I asked CMU professor
@gneubig
about how best to define the current crop of AI agents and where he sees them going.
Links to our full conversation are in a reply. We discussed LLM evaluations, new
So many fascinating ideas at yesterday's
#blackboxNLP
workshop at
#emnlp2020
. Too many bookmarked papers. Some takeaways:
1- There's more room to adopt input saliency methods in NLP. With Grad*input and Integrated Gradients being key gradient-based methods.
Self-attention is an important component of the transformer, but not the only one. Some might misunderstand
"Attention is all you need" to mean that all the key computation happens in attention layers.
In reality, it's more like "Attention can replace recurrence/convolutions"
Combing For Insight in 10,000 Hacker News Posts With Text Clustering
New blog post!
I embedded and clustered the top HN posts looking for insight on personal/career development. I built an interactive map and found ~700 posts that fit the bill.
1/n
AI Art Explained: How AI Generates Images
New video!
If you want to know how AI generation works and how it's trained, this video is for you! With tens of original figures explaining the internal mechanics of diffusion models.
I just learned that the creator of the excellent sklearn cheat sheet is
@amuellerml
. This comes a day after I shot a video about his excellent ML Intro book which REALLY helped me learn ML when I started out. Technical communication wizard.
Coming up next on the YouTube channel
The covariance matrix is a an essential tool for analyzing relationship in data. In numpy, you can use the np.cov() function to calculate it ().
Here's a shot at visualizing the elements of the covariance matrix and what they mean:
1/5
How GPT3 works. A visual thread.
A trained language model generates text.
We can optionally pass it some text as input, which influences its output.
The output is generated from what the model "learned" during its training period where it scanned vast amounts of text.
1/n
A Generalist Agent (Gato) - DeepMind's single model learns 600+ tasks
New video!
Gato's tokenization method maps tasks from text, vision, and control to token sequences learned by a single 1.18B param GPT model.
On the transformer side of
#acl2020nlp
, three works stood out to me as relevant if you've followed the Illustrated Transformer/BERT series on my blog:
1- SpanBERT
2- BART
3- Quantifying Attention Flow
(1/n)
I had the pleasure of hosting
@MaartenGr
to speak about BERTopic, and discuss topic modeling, visualization, API design, modularity, and other topics.
Watch it now! Episode
#1
of Talking Language AI:
Overview blogpost:
Ecstatic and honored that was published as an
#ACL2021NLP
demo paper!
Ecco: An Open Source Library for the Explainability of Transformer Language Models
v0.0.15 is out now!
A language model thinks this Dune review is negative:
"I have a well-documented weakness for sci-fi and expected Dune to feed my soul. I didn't expect it to entirely blow my mind."
Which input words lead to this prediction?
These. Darker is more important.
Be sure to check the awesome NLP course by
@lena_voita
. It's highly visual, well animated, and even has interactive explorables (scroll down to 'Sampling with temperature' in to get the intuition for the 'temperature' parameter in language models).
LLM Developers loved Command R, some called it the RAG King, well, hang on till you meet Command R+. Out now!
Open weights. Much, much more capable:
- Multi-hop RAG: It takes RAG capabilities to a whole new level, when dealing with complex questions, it’s able to search for
⌘R+
Welcoming Command R+, our latest model focused on scalability, RAG, and Tool Use. Like last time, we're releasing the weights for research use, we hope they're useful to everyone!
I've been enjoying learning the Trax deep learning library (). I've created an intro notebook to the Transformer Language Model (on which GPT is based):
It's a great way to start learning how transformer models are built. 1/n
We live in an AWESOME age of enlightenment.
Oh, you wanna learn about SVD? Have
@luis_likes_math
break it down for you[1]. Or have
@3blue1brown
show you how to bend space with your mind (& linear algebra) [2]. Or just attend the whole MIT course [3].
We're incredibly blessed.
The Unreasonable Effectiveness of RNNs (Article and Visualization Commentary)
New Video! I comment on one of my favorite ML articles which helped me break into ML and NLP. We take a look at its visualizations of neuron firings.