Nice new read on tokenization!
You've heard about the SolidGoldMagikarp token, which breaks GPT-2 because it was present in the training set of the Tokenizer, but not the LLM later.
This paper digs in in a lot more depth and detail, on a lot more models, discovering a less…
Super excited to announce that I'm now
@CohereAI
! 🥳 I'm convinced that LLMs will create tremendous value in the next few years, and Cohere is a fantastic place to be to help contribute (and the people are awesome)! 😄
Excited to start as a research scientist intern
@DeepMind
today. Looking forward to working with
@huangposen
&
@Johannes_Welbl
! I'm fortunate enough to be there in person so ping me if you're around and want to chat or grab a coffee 🚀
Data collection is slow and expensive, so we give annotators a little help 🤝. Introducing Generative Annotation Assistants (GAAs) to make data collection more efficient and effective 🚀. Work will be presented at
#NAACL2022
in Seattle!
Paper:
[1/n]
This Valentine's day, to celebrate our love for dynamic adversarial data, the DADC (
@NAACLmeeting
'22) workshop is announcing our first call for papers. We would love for you to join us: ❤️
Incredibly proud of our world class team driving continuous improvements
@cohere
and delivering a best-in-class RAG-powered, long-context-capable, multilingual LLM available to the research and dev communities. Try it out at & 🚀
⌘-R
Introducing Command-R, a model focused on scalability, RAG, and Tool Use. We've also released the weights for research use, we hope they're useful to the community!
Beat the AI 🤔🆚🤖 Investigating Adversarial Human Annotation for Reading Comprehension (in TACL, ) w/
@ARoberts9
@Johannes_Welbl
@riedelcastro
& Pontus Stenetorp will be presented at
#emnlp2020
. Data & leaderboard also available: 1/N
Just 3.5wks after launching Command R, we are excited to release Command R+. It is bigger, better, bolder and goes where no Command model has gone before. It is the result of months of hard work by the incredible team
@Cohere
, and we're releasing model weights for you to use at…
⌘R+
Welcoming Command R+, our latest model focused on scalability, RAG, and Tool Use. Like last time, we're releasing the weights for research use, we hope they're useful to everyone!
Building best-in-class LLMs requires a rare combination of intuition, resources and an insanely talented team. Intuition can guide you, but you never really know where you're going to end up. So stoked to see how well our models have been received. Thank you for your support! 🙏
Announcing what we hope will be one of the best AI/NLP competitions in 2022: the DADC Shared Task ()
@NAACLmeeting
in Seattle 🏆! We have 3 awesome tracks for you to participate in (two data-centric & one model-centric). Details below 👇
[1/n]
📢 Glitch Tokens Detected 📢
Tokens are the building blocks of LLMs -- but there's a problem! Tokenizers and LLMs aren't trained on perfectly identical or static corpora, meaning that tokenizers and models are often out of sync, leading to unseen 'glitch tokens' that can make…
Our paper about reliably finding under-trained or 'glitch' tokens is out!
We find up to thousands of these tokens in some
#LLMs
, and give examples for most popular models.
More in 🧵
@douwekiela
from
@huggingface
will be with us in-person speaking about "Improving Multimodal Evaluation and Exploring Foundational Language and Vision Alignment" for the
@ucl_nlp
meetup at
@ai_ucl
on Wed, Jun 29th at 18:30. Join us: 🚀
Super excited to announce that our proposal has been accepted and The First Workshop on Dynamic Adversarial Data Collection (DADC) will take place at
#NAACL2022
@aclmeeting
in Seattle 🇺🇸!
#NLProc
Stay tuned, this is going to be fun! 🚀
Really excited to hear about the interest in our work on Improving QA Model Robustness with Synthetic Adversarial Data Generation () both from the research community and industry. To facilitate this, we're sharing our synthetic data & question generators!🥳
📢 Come delve into the details of human feedback with
@tomhosking
and I at
#ICLR2024
poster session
#130
at 10:45am. Your (hopefully human) feedback will be greatly appreciated!
🌐
Extremely insightful work by
@tomhosking
digging into what human feedback actually measures accepted
@iclr_conf
'24. 🚀
DM me if you're interested in learning more or if you're excited about exploring the limits of what we know about LLMs with a research internship
@cohere
!
"Human Feedback is not Gold Standard" was accepted at ICLR 2024 🥳
I'd love to chat about the limits of human feedback wrt LLM alignment (and about
@cohere
) if you're going to be at the conference! 🇦🇹
Thanks again to
@max_nlp
for making it an awesome internship experience ❤️
I'm sorry, but someone needs to say this. DROP was one of the most thoughtfully created, insightful and novel datasets of its time. Anybody using a dataset for eval, particularly with a new family of models, is responsible for basic postprocessing. Don't blame the dataset.
⚠️ We are removing DROP from the Open LLM Leaderboard!
With leaderboard evaluation data openly shared on 2000+ models, we did a deep dive with our friends
@AiEleuther
and
@try_zeno
, & found out that its original implementation is unfair to many models 😱
Human preference is complex, multi-dimensional and personal. This work is a treasure-trove of information. An absolute must read (at least twice) for anyone working with LLMs or generative AI systems that rely on human feedback 🤖🙋
Today we're launching PRISM, a new resource to diversify the voices contributing to alignment. We asked 1500 people around the world for their stated preferences over LLM behaviours, then we observed their contextual preferences in 8000 convos with 21 LLMs
Today, we launch Aya 23, a state-of-art multilingual 8B and 35B open weights release.
Aya 23 pairs a highly performant pre-trained model with the recent Aya dataset, making multilingual generative AI breakthroughs accessible to the research community. 🌍
It's been 1yr
@Cohere
already! Among the many exciting things we're working on, one I'm particularly excited about is using models in the loop for better data. If you're not using models in the loop to make your data collection efforts more effective, you're missing out!
Proud to announce that
@BloomsburyAI
's large-scale question answering system is available
#opensource
at . Looking forward to see what interesting projects it will be used for!
Made it to Seattle 🇺🇸 for
#NAACL2022
! Looking forward to my first in-person conference since 2019 🇮🇹. Ping me if you want to chat research or anything else! 🚀
Just noticed that
#AdversarialQA
is the 4th most downloaded QA dataset at 🤗 with nearly 25k downloads. Super excited to see what everyone's been working on! 🥳
The Aya Dataset paper coming soon to an ACL near you! 🥳 Massive congrats to all collaborators and the fantastic community who contributed to this open resource powering SOTA multilingual capabilities 🔥
I discovered at ICLR 2024 that a lot of what I take for granted about LLM evaluation is actually not that widely known...
So I made a blog!
- how do we do currently do LLM evaluation? ⚖️
- most importantly, what is it actually useful for? 🤔
Call for papers for our
#NeurIPS2020
workshop HAMLETS: Human And Model in the Loop Evaluation & Training Strategies is now live! Speakers include
@jennwvaughan
,
@ajratner
,
@EmmaBrunskill
, Sanjoy Dasgupta, Dan Weld, Kristen Grauman & Finale Doshi-Velez. See
Introducing Rerank 3: our newest foundation model purpose built to enhance enterprise search and Retrieval Augmented Generation (RAG) systems, enabling accurate retrieval of multi-aspect and semi-structured data in 100+ languages.
Check out our recent work on undersensitivity in Reading Comprehension and investigation of generalisable adversarially-robust training, achieving new SOTA on AddSent and AddOneSent w/
@Johannes_Welbl
@PMinervini
@riedelcastro
Q: What adversarial failure mode do reading comprehension chopsticks suffer from? A: Undersensitivity! (confidence: 99.7%) 🙂🥢 -- more in our "Undersensitivity in Neural Reading Comprehension" (), by
@Johannes_Welbl
@max_nlp
Pontus and
@riedelcastro
Fun story from our internal testing on Claude 3 Opus. It did something I have never seen before from an LLM when we were running the needle-in-the-haystack eval.
For background, this tests a model’s recall ability by inserting a target sentence (the "needle") into a corpus of…
We are excited to announce that our Chat API with Retrieval-Augmented Generation (RAG) is now available in a public beta.
The API is powered by Command, Cohere’s flagship generative LLM.
Grateful that
@sleepinyourhat
could visit us
@ucl_nlp
yesterday to share insights on navigating hype in
#NLProc
. I was curious whether the Table 1 () trends held for our best
@DynabenchAI
QA models, so I ran some experiments. Turns out that...
[1/n]
You'll sometimes see the meme that NLP is solved. That's hype, and it's doing harm in the real world. But it's worth thinking about what it'd look like to actually achieve what we're aiming for. (📄 paper, thread 🧵)
Azure will be the first cloud to offer Cohere's latest LLM, as we build on our commitment to offer customers the broadest selection of state of the art and open source models.
Performance also != Chatbot Arena Elo. But a massive improvement over previous plots! 🔥 The main takeaway for me here is that we need significantly better evals that reflect the real-world value created by LLMs
#EMNLP2021
want to know more about our work on Improving QA Model Robustness with Synthetic Adversarial Data Generation
@ucl_nlp
&
@facebookai
? Come say hi on !
If you're doing any synthetic data gen, I'd encourage you to explore synthetic adversarial data generation (e.g. ). Synth data can help fill gaps around existing data (~fancy paraphrasing) but it can also help elevate model capabilities!
Honoured that our work investigating the sensitivity of Large Language Models to prompt sample ordering has been selected as an
#ACL2022
outstanding paper! We also manage to find better orderings automatically without relying on held-out examples.➡️ 🚀
Excited to receive an ACL outstanding paper award, with
@max_nlp
@latticecut
@riedelcastro
@ucl_nlp
! TL;DR If prompting is not working, change the order, the performance may jump from random-guess to SOTA. How to find fantastically ordered prompts? Here➡️
Exciting work further confirming that adversarial data collection leads to harder examples, and highlighting important points on annotation quality and the potential benefits of using models in the dataset creation loop 🤔🤖
NLP benchmarks are increasingly saturated, making it difficult to measure further improvements in models.
What if we used adversarial filtering to identify the most challenging *evaluation* examples, and build benchmarks based on them?
🧵1/x
Less than 24 hours after release, C4AI Command-R claims the
#1
spot on the Hugging Face leaderboard!
We launched with the goal of making generative AI breakthroughs accessible to the research community - so exciting to see such a positive response. 🔥
So the people have lost interest in asking "draw an ASCII unicorn"-style questions? Would love to see a much deeper and more granular analysis of Chatbot Arena prompts and what real-world utility these correlate with
it is interesting that GPT4-o's ELO is lower at 1287, than its initial 1310 score.
On coding, it regressed even more absolute points, from 1369 to 1307.
Command R topping the Open Arabic LLM Leaderboard. Maltese is heavily influenced by Arabic so particularly excited to see progress towards models that will eventually speak my language! 🤩🇲🇹
New on the hub: Arabic LLM Leaderboard!
Arabic has at least 380M speakers & is one of the most spoken languages... but how good are LLMs at it?
@alielfilali01
contacted
@TIIuae
and
@huggingface
to know, and collaborate around a new leaderboard!
All the SOTA multimodal models we tested perform poorly on the specifically-constructed Winoground evaluation set for visio-linguistic compositional reasoning. Check it out! 👇
Happy to announce our new CVPR paper - Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality.
All tested SOTA multimodal models perform very poorly on our new vision-language eval dataset.
Paper:
#CVPR2022
,
#NLProc
1/5
QA in
#NLProc
is not yet solved! Can you come up with questions that an AI finds challenging while contributing to what may become the goto QA benchmark eval set? All for just 100 questions (~1.5hrs)! Sign up at
P.S. Prizes available! 🏆 Please share! 🚀
Track 1 of our Shared Task kicks off on Mon 2nd May. Do you have a knack for coming up with challenging QA examples? Prove it while competing against the AI and your peers! We will also have prizes thanks to
@commons_ml
. Sign up at 🚀
#NLProc
#NAACL2022
What. An. Event! 🎇 Massive thanks to our speakers, panelists, authors, shared task participants and a huge heartfelt thanks to our fantastic organising team and sponsors for making it all possible! ❤️
We had an absolute blast at our social, big up to
@RaphiRaph_
for the venue....truly inspirational 😍😍 But our
#NAACL
22 journey has come to an end so we'll be signing out until next year 🥹🥹 To all the
#DADC
fans, you've been dynamic, adversarial and awesome xox♥️
Today, we’re unlocking
@DynabenchAI
, a first-of-its-kind platform for dynamic AI benchmarking.
AI researchers can now create their own custom tasks to better evaluate the performance of
#NLP
models in more dynamic, & realistic settings for free.
Everyone is adding models to the MMLU vs activated params plot, so here is a super quick one with more models.
Everyone seems to forget about those not trained in the US/Europe: 01-ai Yi, InternLM, Qwen, and DeepSeek.
(btw just use to compare MMLU)
Cohere is making it dramatically easier to build applications using RAG. We've released code that makes connecting LLMs to your private sources of knowledge seamless.
Here's how to give your model access to 100 sources like Google Drive, Slack, GitHub, Pinecone, and more. 🧵
Interested in doing a PhD at
@UCL_DARK
?
@egrefen
and I are looking for strong&diverse applicants for UCL scholarships. Please email CV, personal statement, and research proposal to ucl-dark-phd-2020
@googlegroups
.com by Dec 1. Interviews Dec 7-11.
Lab site:
Really proud to be able to contribute to this exciting project changing the way we approach benchmarking for
#NLP
#NLProc
. Feel free to reach out if you'd like to learn more!
The Dynabench paper, accepted at
#NAACL2021
, is out! The paper introduces our unified research platform for dynamic benchmarking on (so far) four initial NLU tasks.
We also address some potential concerns and talk about future plans.
(1/4)
Do LLMs learn impossible languages (that humans wouldn’t be able to acquire) just as well as they learn possible human languages?
We find evidence that they don’t! Check out our new paper…
💥 Mission: Impossible Language Models 💥
ArXiv:
🧵
🚨 New paper 🚨
I’m excited to share the findings from my internship at
@cohere
with
@max_nlp
tl;dr Human feedback under-represents the factuality of LLM output, and annotators are less likely to spot factual errors in more assertive outputs!
Trying out a few examples for the
@DADCworkshop
shared task () and I'm blown away. The AI should not be THIS good! Think you can do better? Try it out at
The
#ShARC
🦈 dataset from our
#emnlp18
paper () and CodaLab challenge are now LIVE at ! How will your models perform on this task involving rule interpretation, reasoning, question generation and
#qa
?
#NLP
#NLProc
Misinformation is a global problem. Fascinating to hear what's being done about it around the world from such a diverse panel of global experts ranging from Argentina to Africa to India and beyond. Thanks
@TTOConference
#TTOCon
Also a massive thanks to Dirk Groeneveld,
@soldni
and
@natolambert
from
@allen_ai
, and
@BlancheMinerva
from
@AiEleuther
for the extremely valuable discussions and feedback (and for their commitment to developing open models that make such investigations possible)!
We will be presenting a few papers at
@emnlp2020
in November (7 in the main conference, 2 in Findings), together with some amazing collaborators! 🤖 We are looking forward to discuss our research with you 🙂
#EMNLP2020
1/N
Here are Kong's AI Gateway most popular LLM providers that have been used since it's release last month,
@OpenAI
taking half the cake :) We are about to ship new exciting AI infrastructure capabilities to simplify building AI applications, and managing them at scale.
You can…
Alright, strap in. Support for Command-R+ was merged into llama.cpp exactly 4 hours ago. We're going to start talking to a GPT-4 level model on local hardware without a GPU. If you have 64GB of RAM, feel free to follow along 🧵
🚨Life update🚨 After 4 wonderful years, I’ve decided it’s time for me to move on from FAIR, and today is my first day at
@cohereAI
! Super excited for the next chapter and to work with Cohere’s world class team, working out of the beautiful London office in Soho! (1/3)
RAG in 4 minutes on an iPad with Command R+ 🤩 During a Q&A session yesterday I was asked why we chose to make model weights available. This is why. Let the builders build! 🚀
Ever wanted to learn how to set up RAG with an LLM?
Sounds intimidating so you’ve avoided it?
It’s SO easy now because of amazing dev tooling.
Here’s how I did it in 4 minutes, on my iPad, while I was waiting for my train from DC back to Connecticut:
1. Signed up for a free…
We know that large language models are very sensitive to prompts in few-shot learning, but
@maxbartolo
pointed out to me that the ground truth labels don’t actually matter all that much! Check out this example for BLOOM, with opposing labels--is this something that’s well-known?
Following extensions by other NAACL workshops, we have also decided to extend the submission deadline for papers from April 8 to April 15 (AoE). We look forward to your submissions!
Details:
Paper Submission:
#NAACL2022
#NLProc
I'll be discussing recent work on adversarial human annotation in collaboration with industry during the
@AI_UCL
session at
#TheAlgo2020
conference (Nov. 12th
@UCL
Online). Register here:
1 [Agenda thread]/
#Algo2020
conference!
12 Nov.
@UCL
Online -Free & Open to all
#AI
enthusiasts!
Register:
One-day, multi-stakeholder conference on AI & other Disruptive Tech.
Check out the agenda in this thread & here:
I'm looking forward to presenting this at
@iclr_conf
in Vienna next week! 🇦🇹
If you'd like to discuss the paper, human feedback, discrete representations for NLP or
@cohere
come and find me and
@max_nlp
in poster session
#3
on Wednesday @ 0945h local time! 🎉
@BlackHC
@huggingface
@CohereForAI
@cohere
Not sure if this affects anything but are these the results for CohereForAI/c4ai-command-r-plus or CohereForAI/c4ai-command-r-plus-4bit? Also, curious, are the numbers reported the strict-match or flexible-extract results from the Eleuther LM Eval Harness?
This is a fantastic opportunity to participate in a shared task and compete with researchers from around the world. Can YOU beat the AI?? 🤔🆚🤖 All it takes is 100 examples (track 1)! Sign up today:
You now have until May 15th (the end of the Track 1 example creation window) to register your team for the DADC shared task (). Join the 10 awesome teams from around the world who have already signed up! We'll also have prizes thanks to
@commons_ml
!! 🏆🤩
Exciting new work on using data generation to mitigate spurious correlations along with de-biased versions of popular NLI datasets being made available!
🚨New ACL2022 paper!🚨“Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets”. Read the paper here: , and check out the thread below
w/
@nlpmattg
, Pontus Stenetorp,
@pdasigi
@ai2_allennlp
🧵1/N
Check out this blog post for an intro to dynamic adversarial data collection. And don't forget to join us at
#NAACL2022
for the
@DADCworkshop
on July 14th!
DADC improves AI accuracy through higher quality and more diverse data collected with human and AI collaboration. We believe this will help the community build robust ML and why are sponsoring
@DADCWorkshop
@DynabenchAI
at
#NAACL2022
"
@DynabenchAI
relies on crowdworkers"...
But it doesn't have to! The
@DADCworkshop
shared task is a fantastic opportunity for the wider
#NLProc
community to contribute -- and it's currently underway. Can YOU beat the AI?: 🚀