Excited to share a new model with
@ContextualAI
that tops the AlpacaEval 2.0 leaderboard!
How did we manage to rank higher than models like GPT4, Claude 3 and Mistral Medium? Enter iterative alignment… 🧵
Introducing the Multi-Game Decision Transformer: Learn how it trains an agent that can play 41 Atari games, can be quickly adapted to new games via fine-tuning, and significantly improves upon the few alternatives for training multi-game agents →
At Contextual AI, one of the biggest pain points for our customers doing LLM alignment is getting the preference data that current methods need.
Think about all the pain you’ve been through trying to collect + label training data — now imagine doing that at 100x the scale. 🧵 1/
Most of my time in undergrad was spent running research experiments instead of doing homework. I'd say it was pretty worth it 🤠
Grateful to have these goons still in my life!
Before Cohere For AI
@forai_ml
there was . In 2017, a group of students and a dropout responded to a slack message to do research together. ✨
Sometimes, embarking on a research idea can change your entire path in life.
Excited to share this work on Multi-Game Decision Transformers (MGT)! For large-scale applications, traditional RL methods have struggled with effectively leveraging optimal policies learned from data of multiple modalities. MGT is one step towards achieving a generalist agent:
How can we effectively train generalist multi-environment agents? We trained a single Decision Transformer model to play many Atari games simultaneously and compared it to alternative approaches:
Not convinced? We’re releasing Archangel: the largest-ever suite of feedback-aligned LLMs: 56 models in total from 1B to 30B aligned with 8 different methods (including KTO!) on a combination of SHP, Open Assistant, and Anthropic HH.
Try
@huggingface
: 4/
Pleased to announce we've trained a 100000B parameter LLM!
JUST KIDDING (:
Check out the new version of our Goldiprox paper (active selection of high information datapoints with proxy models)! Maybe we can think more deeply about the problem of data efficiency at large scale...
Tired of waiting 💤 while your model trains? Try skipping points that are already learned, not learnable or not worth learning! Robustly reduces required training steps 🏎 by >10x ! to reach the same accuracy on big web-scraped data
📜ICML 2022 paper:
Excited to share some co:ol news! Today
@CohereAI
is entering the scene with the most up-to-date, scalable, and responsible language models that help businesses + computers understand and leverage the rich information of our world
.... ....
Happy to release our work on Language Model Cascades. Read on to learn how we can unify existing methods for interacting models (scratchpad/chain of thought, verifiers, tool-use, …) in the language of probabilistic programming.
paper:
Super excited to be mentoring up 'n coming young researchers, and collaborating with those from before! This was how I got to learn about research back when I had zero ML experience and couldn't get into any labs at my university... Let's continue the legacy 💪
How can we make information retrieval optimal in settings where factuality is equally important as good generation?
Excited to share the latest from
@ContextualAI
! RAG2.0 jointly optimizes the key components of typical RAG systems reaching SOTA on many key benchmarks. Read more:
Our first set of RAG 2.0 models, Contextual Language Models (CLMs), significantly improve performance over current systems across axes critical for enterprise work: open-domain question answering, faithfulness, and freshness.
Introducing TF-Coder 🖥
A program synthesis tool that helps you write tensor manipulations in TensorFlow. Simply provide an input/output example of the desired behavior, and leave the rest to TF-Coder!
Learn more →
Try it out →
We will be presenting our poster (
#431
) this morning!
Come chat with myself and
@MichaelPoli6
about generative models, compression, and implicit representations!
When I asked my language model for math advice, I guess I got some pretty pragmatic gardening tips: "one plus two is the right amount of organic fertilizer that will support your plants."
@ContextualAI
Starting with Mistral 7b Instruct as the base, we progressively aligned with 3 rounds of Kahneman-Tversky Optimization (KTO) on 3 disjoint partitions of the PairRM DPO dataset that utilized prompts only from UltraFeedback without external LLMs. 2/
Isaac Gym -
@NVIDIAAI
physics simulation environment for reinforcement learning research (preview Release)
- End-to-End GPU accelerated
- Isaac Gym tensor-based APIs fo massively parallel sim
Also get in touch for potential internships to flex in Gym!
When I first learned about neural nets back when I was a hardcore genetics student I had interpreted them as mathematics turned to biology - yes, not the other way around. Apparently now a science sellout (according to my med friends), maybe there's more to my theory after all!
No words will ever do justice to how much I have learned and grown as a researcher and person these past few months. May the fourth forever be with this kind & brilliant team ❤️
Honored and humbled to be able to learn from and do great research with my amazing team of mentors and co-authors, especially
@kuanghueilee
and
@IMordatch
(:
@Massastrello
@MichaelPoli6
Many patterns in nature exhibit self–similarity, meaning they can be efficiently described via self–referential transformations. We propose to represent data as the parameters of a contractive, iterative map defined on partitions of the data space. This is a Collage Operator. 3/
@ContextualAI
It is becoming increasingly valuable in the enterprise setting to have high quality models that can leverage specialized data. With methods like KTO, base models can be easily fine-tuned and aligned without the need for human-annotated preference datasets. 3/
"You have to love <dancing> to stick to it. It gives you nothing back, no manuscripts to store away, no paintings to show on walls and maybe hang in museums, no poems to be printed and sold, nothing but that single fleeting moment when you feel alive." - Merce Cunningham
And... the first-ever ICLR test-of-time award goes to "Auto-Encoding Variational Bayes" by Kingma & Welling.
Runner-up: "Intriguing properties of neural networks", Szegedy et al.
The awards will be presented at 2:15pm tomorrow; there will be retrospective talks. Please attend!
📢The problem in model alignment no one talks about — the need for preference data, which costs $$$ and time!
Enter Kahneman-Tversky Optimization (KTO), which matches or exceeds DPO without paired preferences.
And with it, the largest-ever suite of feedback-aligned LLMs. 🧵
Join us for the second edition of the
#NeurIPS2022
workshop "The Symbiosis of Deep Learning and Differential Equations"🌀
We're looking for your AI <> DE ideas: neural diff. eqs., neural operators, diffusion models and novel applications!
website:
So it looks we can stick with the basic MLE/ELBO objectives (=compression!), as long as we combine it with the right kind of data augmentation.
Also, diffusion models have an interpretation as VAEs, so we can now again claim that VAEs are SOTA image generation models... 😅✌️
@Massastrello
@MichaelPoli6
In generative applications, a nice property of representing data implicitly as collage parameters is that they are resolution agnostic. This means that each unique, corresponding data sample can be decoded at any desired resolution (resources allowing) w/o the need to re-train 5/
@ContextualAI
I’m excited about a few more potential ways to incorporate intermediate and continuous signals into our alignment objectives, moving even further past binary preferences. Look out for more cool stuff coming soon! 😉 5/
@ContextualAI
The alignment protocol is KEY to achieving good results. We can see that tuning smaller base LLMs can be made more effective with richer signals such as rewards, curricula, or rankings. Methods like KTO allow us to incrementally improve with more nuanced data and more rounds. 4/
Introducing Imagen, a new text-to-image synthesis model that can generate high-fidelity, photorealistic images from a deep level of language understanding. Learn more and and check out some examples of
#imagen
at
Why? KTO only needs to know if an output is desirable or undesirable — not which of two outputs is better.
If you have a customer interaction that turned into a sale, that can turn into a datapoint for KTO. This is not the case for traditional preference optimization methods. 3/
Now we can do LLM alignment without all that hassle, all because of a new method for aligning LLMs called KTO. It’s as good as state-of-the-art methods like DPO while not needing pairs of preference data.
This makes it much much easier to use in the real world. 2/
⚡️ Accelerating TensorFlow 2 performance on Mac
@Apple
’s new Mac-optimized TensorFlow 2.4 fork lets you speed up training on Macs, resulting in up to 7x faster performance on platforms with the new M1 chip!
Learn how ↓
@Teknium1
@ContextualAI
yea basically 😅 all just dataloader changes and then alignment with KTO. i’m working on another version that leverages richer signals next, let’s see
@chhaviyadav_
@NeurIPSConf
In order to not reschedule my exams last year, I had brought my suitcase to the exam center and booked it to the airport right after!
So the real point of contention: which MOOC was better - Andrew Ng or Geoffrey Hinton?
Had a blast speaking on the panel today, thank you AI Squared Forum,
@UTMIST1
, and
@UofT
!
When generalized to higher-level value contexts and just country names, performance plummeted to 60% and below for the best models. Larger models improved with different training paradigms, but exhibited greater skew towards English/Western cultures. [6/n]
Elucidating the Design Space of Diffusion-Based Generative Models
abs:
improve efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of an existing ImageNet-64 model from 2.07 to near-SOTA 1.55
Check out our results comparing supervised DT against online/offline RL algorithms. I'm certainly looking forward to exploring more ways to combine this paradigm of supervised learning and sequence modeling to tackle more problems in control / unstructured envs
@lovetheusers
@dmdohan
@jekbradbury
@ylecun
you can compose complex probabilistic programs (over string random variables in our case) and then apply various existing inference algorithms by sampling via the cascade ppl
@huggingface
So if you’ve spent all your money on compute resources and don’t have any left over for annotating preferences, don’t fret. You can remove the critical P in DPO and still effectively custom align models 😈 5/
Comparing length normalized performance is a good idea, and no evaluation is perfect. Looking forward to exploring ways of getting our LLMs to generate better shorter responses :)
That said, notice the difference in length between the different models - this leaderboard (like any other) does not tell the full story. We need better evaluations!
Amazing job by
@winniethexu
!
@Teknium1
@huggingface
Enjoyed seeing all the sweeps! I've found similar things with increasing beta to get better perf on mt bench. If comparing between methods, not sure if you can compare via beta as for some methods it's likely there's a corresponding beta that reaches the same loss
Delving into Orca-Math's training strategy: It hinges on a strategic iterative process!🧠🔄 We kicked off with supervised finetuning (SFT), followed by two rounds of KTO. This SFT-KTO-KTO sequence proved more effective than a straight SFT series and surpassed DPO too.
@ericssunLeon
@gblazex
@ContextualAI
The new length adjustment method uses a simple regression model to Trying to estimate the counterfactual "what would the win rate be if the length was X" where X=len(baseline). Still not a perfect metric and very simplified assumption over certain variables + a fixed relationship
Happy to support our frens Abraham AI for those of you who were asking about inpainting and
#StableDiffusion
.
There is a *lot* to come.
Very excited and as we will release open source this will unleash a wave of innovation and creativity globally beyond what we seed.
Want to understand and/or play with variational diffusion models?
- See for a simple stand-alone implementation and explanation. (Thanks
@alemi
and
@poolio
for making this)!
- See for an even more basic implementation on 2D data.