is now a paper!
We reviewed 102 open datasets for evaluating and improving LLM safety. We also reviewed how these datasets are used in major model release publications and in popular benchmarks.
Headline results and arxiv link below 👇
If you’re working on LLM safety, check out !
is a catalogue of open datasets for evaluating and improving LLM safety. I started building this over the holidays, and I know there are still datasets missing, so I need your help 🧵
I was part of OpenAI’s red team for GPT-4, testing its ability to generate harmful content.
Working with the model in various iterations over the course of six months convinced me that model safety is the most difficult, and most exciting challenge in NLP right now.
🧵
We’re releasing GPT-4 — a large multimodal model (image & text in, text out) which is a significant advance in both capability and alignment.
Still limited in many ways, but passes many qualification benchmarks like the bar exam & AP Calculus:
After spending just 20 minutes with the
@MistralAI
model, I am shocked by how unsafe it is. It is very rare these days to see a new model so readily reply to even the most malicious instructions. I am super excited about open-source LLMs, but this can't be it!
Examples below 🧵
Mistral 7B is out. It outperforms Llama 2 13B on every benchmark we tried. It is also superior to LLaMA 1 34B in code, math, and reasoning, and is released under the Apache 2.0 licence.
NEW PREPRINT!
LLMs should be helpful AND harmless. This is a difficult balance to get right...
Some models refuse even safe requests if they superficially resemble unsafe ones. We built XSTest to systematically test for such "exaggerated safety".
🧵
🥳 New paper accepted at
#NAACL2022
(Main) 🥳
NLP tasks like hate speech detection are subjective: annotators disagree about what the correct data labels are. We propose two contrasting paradigms to enable better data annotation.
⬇️ Highlights below ⬇️
Excited to share that I successfully defended my PhD thesis
@oiioxford
last month 🥳
Huge thanks to my assessors
@computermacgyve
and
@MaartenSap
, and also my supervisors, collaborators and friends. I'll post my full acknowledgments here 👇 and more on what's next next week 🤗
Multilingual HateCheck is now
@huggingface
🤗
New tests for hate speech detection models in 10 languages with just 2 lines of code! See our
#NAACL2022
WOAH paper for details (), and get testing with the link below 👇
🥳 New paper at
#EMNLP2022
(Main) 🥳
Too much hate speech research focuses just on English content! To help fix this, we tried to expand hate detection models into under-resourced languages 🌍 without needing lots of new data 💸
⬇️ Highlights below ⬇️
CONTENT WARNING
You will have no trouble getting the model to give you advice on how to commit suicide, assault someone, or eradicate minorities. I will not post examples verbatim, but I can DM them, or you can try for yourself here:
🥳 New publication accepted at
#ACL2021NLP
! 🥳
We built HateCheck, a suite of functional tests for hate speech detection models, and used it to show critical weaknesses in current academic and commercial models.
⬇️ Highlights below ⬇️
Safety is hard because models today are general purpose tools. And for nearly every prompt that is safe and useful, there is an unsafe version.
You want the model to write good job ads, but not for some nazi group. Blog posts? Not for terrorists. Chemistry? Not for explosives…
📣 Hiring for NLP Jobs 📣
Last year,
@bertievidgen
and I founded
@rewire_online
, a start-up building socially responsible AI for online safety. We have since grown into a team of 10+ people, winning major projects with GAFAM and an IUK Grant.
Right now, we are hiring 🧵👇
🥳 New paper at
#WOAH
#NAACL2022
🥳
Too much hate speech research focuses just on English content, so we release 🌍 Multilingual HateCheck 🌍 for testing hate speech detection models in 10 more languages!
⬇️ Highlights below ⬇️
Super hyped to start my postdoc with
@dirk_hovy
@MilaNLProc
this week 🎉
I'll be working on social values in large language models and AI safety (while living & eating well in Milan 🙏). I am excited to continue existing collaborations and meet new people – please reach out 🤗
Super happy to start my research visit to
@dirk_hovy
’s group
@MilaNLProc
today 🥳 I’ll be working on some exciting hate speech / NLP projects over the summer (and have many aperitivi along the way 🍷). Come say ciao if you’re in the area 🇮🇹
How does human feedback steer LLM behaviours?🧐 Whose voices dominate? 🗣️What challenges remain and how can we do better as a community in the future?🔮
All these questions and more answered in our new survey paper, accepted at
#EMNLP23
!
a small 🧵
LLM safety will be a big topic at
#EMNLP2023
! I put together this timetable with all the papers I am excited to check out. Sharing here in case others find it useful too :) Also sharing the link further below 🧵
Probing for unsafe use cases itself is not easy. Finding and evaluating the right prompts requires expert knowledge. Oversight across domains will become more and more of a challenge – check out great work from
@sleepinyourhat
,
@EthanJPerez
and others on this issue.
There are plenty of explicitly "uncensored" models that will give similar responses, but
@MistralAI
is framing this as a mainstream Llama alternative. It is wild to me that they do not evaluate or even mention safety in their release announcement...
These are just some of the issues that struck me the most while red-teaming GPT-4. I don’t want to jump on the hype train. The model is far from perfect. But I will say that I was impressed with the care and attention that everyone I interacted with
@OpenAI
put into this effort.
Also, it’s not always clear where to draw the lines on safety. What is or is not safe depends on who you ask.
This is where model safety overlaps with more general alignment research. Who are we aligning our models with, and how? I am really excited for more work on this!
BIG NEWS: Rewire has been acquired by ActiveFence 🥳
Two years ago,
@bertievidgen
and I started
@rewire_online
to build socially responsible AI for online safety. Today, we’re excited to share that we have been acquired by
@ActiveFence
!
🧵
🥳 New publication accepted at
#EMNLP2021
Findings 🥳
We used three years of Reddit data to adapt BERT to time and domain, illustrating when temporal adaptation isn't worth doing – and when it might be!
⬇️ Highlights below ⬇️
@VictorGall6791
@MistralAI
Thanks, Victor. I don't think that is true! There may be a tension between helpfulness and harmlessness in some settings, but there is also plenty of great work from
@AnthropicAI
and others on making models more helpful and less harmful at the same time. Calibration is possible!
Very excited to share that we won the
@StanfordHAI
AI Audit Challenge 🥳 Our HateCheck project (with
@hannahrosekirk
and
@bertievidgen
) was awarded "Best Holistic Evaluation and Benchmarking"!
Brief summary below 🧵
Last August, HAI and
@StanfordCyber
launched the
#AIAuditChallenge
that calls for solutions to improve our ability to evaluate AI systems. Join us on June 27 as we highlight the most innovative approaches, as well as lessons learned from the challenge:
🥳
#NAACL2022
Presentations 🥳
Super excited to present two articles at NAACL this week!
Interested in subjectivity and data annotation for tasks like hate speech detection? Then come to Session 8C this Wednesday at 0915 🙌
🏆
@rewire_online
won the DB Mindbox Challenge 🏆
I was in Berlin last week to pitch for Rewire at DB, Germany's national railway operator. Super excited to share that we won, and will now help DB handle their toxic feedback! 🙌
More details below 👇
We just released v2 of XSTest, our test suite for "exaggerated safety" in LLMs 🚀
Llama2 and other models often refuse safe prompts just because they superficially resemble unsafe prompts. With XSTest, you can test for this issue in a systematic way.
🧵
There is so much good work on LLM safety, so many relatively easy steps to take to avoid these extreme cases of unsafe behaviour. I really hope future releases will make more use of that!
HateCheck is now on
@huggingface
🤗
Testing your hate speech detection models has never been easier – it takes just two lines of code to load the dataset! See our ACL 2021 paper for details (), and get testing with the link below 👇
Excited to talk about data annotation for subjective NLP tasks like hate speech detection
@MilaNLProc
today! Thank you
@CurriedAmanda
and
@dirk_hovy
for inviting me 🤗 It's still early-stage work but I can hopefully share a preprint soon!
Excited to (virtually) be at
#EMNLP2022
this week!
If you're interested in online harms, social factors in language modelling or model safety, let's have a chat 🤗
Here's where you can find me 👇
#WOAH
will be at
#ACL2023
🥳
The call for papers is now live! This is a great venue for online safety research, and I am excited to be co-organising it this year 🤗 Submit your work and join us in Toronto!
📢Exciting news!
The call for papers is now open for the 7th Workshop on Online Abuse and Harms (
#WOAH
) at ACL 2023. Join the effort to address these critical issues.
See CFP here:
Submission Deadline: May 2, 2023
#ACL2023
#NLProc
We are excited to share that WOAH 2024, our 8th edition, will take place at
#NAACL2024
in Mexico City!
Our special theme this year will be "Online Harms in the Age of LLMs", covering emerging risks as well as LLM-based countermeasures.
CfP and more details soon 🚀
I’m at
#ACL2023NLP
this week to present co-authored work and co-organise
@WOAHWorkshop
.
But I am just as excited to meet new people and talk research in between sessions!
If you’re interested in social values in language models and/or model safety, come say hi 🤗
Super interesting panel on building NLP datasets! Great to hear
@_julianmichael_
@anmarasovic
@complingy
discuss prescriptive/descriptive annotation – very happy people are finding this useful!
Check out the video from 17:32 or read 👇 for more details
Thinking about collecting an
#NLProc
dataset / submitting a dataset paper? ➡️💡Check out our suggestions for "Building NLP Datasets" at
Thank you so much for our wonderful mentors:
@complingy
@anmarasovic
@_julianmichael_
❤️!
Great new datasets and methods for tackling emoji-based hate coming out at
#NAACL2022
! 🥳 Very glad I got to make a small contribution to this amazing initiative by
@hannahrosekirk
🤗
Check out the details and paper below 👇
🚨 New paper and datasets! 🚨
After sitting on my hands for many months 😬 I'm delighted that our
#Hatemoji
paper is going to
@naaclmeeting
! 😍🤩😎🆒
In a nutshell 🥜it uses human-and-model-in-the-loop learning 🤖🤝🙆 to tackle emoji-based hate
A 🧵 on all our new resources 1/
Relatedly, I am also very interested in subjectivity, human values and preferences, and how they are incorporated in LLMs. There’s a ton of papers on that as well – including one led by
@hannahrosekirk
which I co-authored and will be presenting the poster for on Saturday at 11am!
If you're at
#NeurIPS2023
go find
@hannahrosekirk
! You'll get to talk about this very fun poster
@solarneurips
AND get a preview of one of the most exciting dataset projects I've ever been a part of 🤫
Hi 🌎! I've arrived at
@NeurIPSConf
🫡 Reach out if you wanna talk all things human feedback + sociotechical alignment. I’m presenting this cute poster, but we’re also building an awesome new human feedback dataset (release in Jan 👀) that I can’t wait to tell everyone about🕺
🏅 EDOS @ SemEval Results 🏅
More than 500 people signed up for our SemEval task on the Explainable Detection of Online Sexism. The task paper, with dataset details, results and analysis, is now on ArXiv! 👇
Working on hate speech detection in non-English languages? Want to test your models? Then come to the Workshop on Online Abuse and Harms this Thursday 🙏
And if you want to have a chat about any of these topics, just flag me down at the conference any time🤗
Super insightful review of 2021 research highlights by
@seb_ruder
🙌 Definitely an early personal highlight of 2022 to see my work on temporal adaptation with Janet Pierrehumbert mentioned among so many other great articles 🤗
Check out our article here:
Two new postdoc openings in Janet Pierrehumbert's Oxford NLP lab group! Come join us 🤗 or spread the word 📢
1) Text data mining, experimental semantics:
2) Graph machine learning, social network analysis:
Really enjoyed giving a talk
@cambridgenlp
today!
I spoke about some ideas for "exploring and controlling values in large language models through role-playing" -- much inspired by
@jacobandreas
recent work on language models as agent models 🎭
Check out this new
#EMNLP2023
survey on feedback learning in large language models, put together by
@hannahrosekirk
. I am biased, but I think it's a great resource!
More details in Hannah's thread 👇 Feedback is welcome! 🥁
How does human feedback steer LLM behaviours?🧐 Whose voices dominate? 🗣️What challenges remain and how can we do better as a community in the future?🔮
All these questions and more answered in our new survey paper, accepted at
#EMNLP23
!
a small 🧵
Looking for something to do over the holidays? Join our
#SemEval2023
task on explainable sexism detection!
350+ people have already signed up for the task, run by
@rewire_online
with support from
@MetaAI
. The test phase starts on Jan 10th! 🚀
Yesterday,
@royalsociety
released a report on the online information environment, and it's a great read for those interested in
#onlinesafety
🙌 The report is informed, in part, by a lit review I wrote with
@balazsvedres
back in 2020.
Check out both here:
PS: As I said in my first tweet, I have not done a very systematic evaluation of the model, but I am very confident that doing so would confirm my first impression. Happy to be proven wrong!
Are you working on hate speech, abuse or other online harms? Then submit to WOAH at
#ACL2023
!
We have extended the paper deadline to May 9th 🗓 and also announced two best paper awards 🏆
For more details visit our website and please spread the word 🤗
📢The direct paper submission deadline for WOAH
@aclmeeting
has been officially extended to May 9th! 📢
That is around 20 days from now. We look forward to receiving your work!
We will update our website shortly.
#ACL2023
#WOAH2023
#NLProc
@larsjuhljensen
Thanks, Lars. I see your point, but I think there is a lot of value in trying to make the most advanced and most widely used models as safe as they can be. Not a perfect analogy, but Twitter needs to be safe even though unsafe alternatives like Gab exist.
@CaryPalmerr
@MistralAI
Thanks, Palmer. I am talking about their instruct / chat-optimised model, which is also the one I linked to on
If you like, you can look at the prompts we tried here:
@natolambert
@MistralAI
Thanks, Nathan. My main issue is that safety was not even evaluated ahead of release, or these evals were not shared. FWIW, I also think there should be minimum safety standards for when big orgs release chat models.
XSTest contains 200 hand-crafted test prompts across ten prompt types.
All prompts are perfectly innocuous questions. A well-calibrated model should not refuse to answer them!
Examples below 👇
🚨 🚨 New episode, new season! We spoke to fellow
@oiioxford
colleagues
@paul_rottger
and
@hannahrosekirk
about their research on hate speech detection and AI - join us for a fascinating deep dive into their work. Available on Spotify!
@deliprao
Thanks, Delip. I tested the instruction model. Regardless, my main point is that safety is a relevant consideration for many LLM applications. Therefore, if Mistral made a choice to release an unmoderated “power tool”, they should be open about that from the get go.
Had a great time speaking about
#hatespeech
detection models and their weaknesses at
@CogX_Festival
with
@bertievidgen
last week. Recording is now online, please check it out 🤗
Looking forward to talking about my research at the
@turinginst
doctoral showcase tomorrow at 1:30pm GMT!
Lots of other exciting presentations as well -- agenda & links to register here:
Hate speech research needs to serve everyone, not just English speakers! I'm very happy that we
@rewire_online
get to be a part of this exciting and ambitious project 💪
🌍 Rewire x Lacuna Fund 🌍
We're super excited to be part of the new AfriHate project, working with great groups like
@MasakhaneNLP
to expand hate speech detection into 18 African languages!
Read on below for more details 👇
@Sohail_NITIE
@MistralAI
Thanks, Sohail. It looks like you are using codellama, not the mistral model (see dropdown at the bottom). Here are the unsafe prompts we tried yesterday, in case you are interested
🎉
@rewire_online
is hosting a SemEval task 🎉
The goal of the task is to detect A) sexist content, as well as B) different types of sexism and C) fine-grained sexism vectors. Super excited to run this with support from
@MetaAI
!
Read more details and join the task below 👇
Can AI tell us why something is sexist online?🧐
Our new
@SemEvalWorkshop
#NLProc
task invites you to create systems that identify sexist content and explain why with fine-grained predictions🔎 Check out our competition, organised by
@rewire_online
& sponsored by
@MetaAI
. Link⬇️
We test Meta's recently-released Llama2 with XSTest, and find a lot of exaggerated safety behaviours.
The model fully refuses 38% of our test prompts and partially refuses another 22% -- you cannot ask Llama2 how to kill the lights or smash a piñata!
Great talking to
@CraigLangran
along with
@bertievidgen
for this
@BBCNews
article!
Bottom line: Content moderation systems continue to make harmful mistakes. Platforms need to be more transparent, so that decisions can be explained, weaknesses identified and addressed!
⚠️NEW FROM ME⚠️
I’ve taken a deep dive into the murky world of social media content moderation for
@BBCNews
How do platforms like
@Twitter
determine when a post is abusive or hateful?
#onlineabuse
AI still struggles with one of its most basic applications: censoring harmful language. But distinguishing toxic and innocuous sentences isn't as straightforward for a machine.
The Online Safety team
@turinginst
is hiring two postdocs! Really great team to work with – please spread the word 📢
1) NLP / data-centric AI for online safety
2) Social science / policy for online harms
I am @
#NAACL2022
for my first **ever** in-person conference 🤩 I'll be talking about hate + emoji =
#Hatemoji
(Sesh 8, 13/07) Come say hi if you want to swap favourite emojis 🦄🫠🚀❤️🔥🆒👨🎤🍜or discuss how these little pictures pose challenges for language modelling 🤟
Super excited that our team
@rewire_online
got spotlighted in the new
@DCMS
report on the UK Safety Tech sector 🙌🚀
Go to our website to learn more and to get free access to the Rewire API for toxic content detection!
Fantastic to see
@rewire_online
included in this new report from
@DCMS
, on the growth of
#OnlineSafetyTech
—a recognition of the importance of tackling online harms.
@DamianCollins
: "Making the online world safer is not only the right thing to do, it’s good for business" 🙌
Going forward, I will post more about our work at Rewire. About our best-in-class AI for hate speech detection and free API access. Please get in touch if you have any questions and follow
@rewire_online
to stay up-to-date!
Big things coming soon 🎉
Today, at 4pm GST, I will be a panelist at the BoF session on hate speech detection in African languages 🌍 You can still register below, even if you're not at EMNLP!
Join our BoF session on Hate speech detection for African Languages tomorrow(Tue 7 Dec 4 pm GST, 12 pm GMT)
@Shmuhammadd
will present the AfriHate project+ we'll have a panel discussion w\ Adem Chanie Ali
@AishatuGwadabe
@paul_rottger
@seyyaw
+ an open discussion w\ all attendees
Also check out , built by our team
@rewire_online
, for all other HateCheck-related resources:
- the original English HateCheck ()
- HatemojiCheck (led by
@hannahrosekirk
)
🛠️ Experimental setup 🛠️
We fine-tuned over 3,000 models (!) on different combinations of English and target-language data.
Then, we evaluated all these models on held-out test sets and multilingual HateCheck, using OLS regression (!) to quantify benefits of using more data 📈
@deliprao
Hi
@deliprao
! We just released v2 of a dataset to test for exactly this kind of "exaggerated safety". Hope you find it interesting, would love to hear your thoughts!
We just released v2 of XSTest, our test suite for "exaggerated safety" in LLMs 🚀
Llama2 and other models often refuse safe prompts just because they superficially resemble unsafe prompts. With XSTest, you can test for this issue in a systematic way.
🧵
⚠️ We argue that dataset creators should consider annotator subjectivity in the annotation process and either explicitly encourage it or discourage it, depending on the intended use of their dataset ⚠️
As a framework, we propose two contrasting data annotation paradigms:
There is a lot more detail to this, so do give the paper a read if you have time!
As always, you can also reproduce all our experiments using our code and data on GitHub:
@DamianRomero_CL
Very cool thread! We just released our NAACL paper about subjectivity in data annotation that is quite relevant to this -- would be curious to hear your thoughts 🤗
🥳 New paper accepted at
#NAACL2022
(Main) 🥳
NLP tasks like hate speech detection are subjective: annotators disagree about what the correct data labels are. We propose two contrasting paradigms to enable better data annotation.
⬇️ Highlights below ⬇️
Our
#NAACL2022
paper on subjectivity in data annotation is finally live in the ACL Anthology 🙌 Thank you to those who flagged that the wrong PDF was linked 🙏
Check out the paper below 👇
🥳 New paper accepted at
#NAACL2022
(Main) 🥳
NLP tasks like hate speech detection are subjective: annotators disagree about what the correct data labels are. We propose two contrasting paradigms to enable better data annotation.
⬇️ Highlights below ⬇️
Exaggerated safety is likely a problem of lexical overfitting.
To understand our prompts, LLMs need to contextualise potentially unsafe words ("kill time"). This is very easy for humans! But LLMs often focus only on unsafe meanings, which is why they refuse even safe prompts.
Please let me know if I missed anything! And definitely say hi if you want to chat about any of these topics :)
Here is the link to the sheet 👇 If you want to filter by topic, click Data -> Filter Views.
Very cool new article at the intersection of generative AI and theology by
@HannahALucas
!
It's not my usual field, but I had a lot of fun discussing earlier drafts, and I learned a lot from the final article! Check it out below 👇
Pleased to say my new article is now online and open access with
@Religions_MDPI
🥳
In this paper I look at
#AI
Text-to-Image models, comparing negative weight prompts to negative language in mystical texts ☁️ 1/
For example, all models struggled with reclaimed slurs and counter speech. Misclassifying such content as hateful risks penalising the very communities most commonly targeted by online hate in the first place. It also undermines positive efforts to fight back against online hate.
Multilingual HateCheck (MHC) is an expansion of the original English HateCheck (ACL 2021) to Arabic, Dutch, French, German, Hindi, Italian, Mandarin, Polish, Portuguese and Spanish 🌏
This is more languages than any other hate speech dataset!
Very excited to develop these ideas into proper papers soon! Shoutout to
@hannahrosekirk
for great discussions on this, and thank you to
@nedjmaou
and
@michael_sejr
for inviting me 🤗
@metropolinomix
Thanks, M! I would say this is not safe behaviour, and I think people at OpenAI would agree. That is why they put that illustrative example into their System Card section on the potential for risky emergent behaviour. Let me know if you were referring to something else!
@Abebab
There aren’t many so far, but Lacuna Fund is supporting a new project to create hate speech datasets in 18 African languages! (Disclaimer: I’m involved in this as well)
Today, Lacuna Fund announces awards to 10 teams creating machine learning datasets for low-resourced African languages.
Learn more about the selected projects here:
Français:
Español:
@pratik_ratadiya
@hannahrosekirk
Functional tests are great! If you liked HatemojiCheck, you may also like HateCheck () and Multilingual HateCheck (), which I am presenting at
#NAACL2022
/ WOAH tomorrow 🙌
On Tuesday morning, I’ll be hanging out at the poster session with
@morlikow
, who is presenting our work on sociodemographics in modelling human label variation.
@synbiocs
Thanks, Swapnil. What you are describing is not far off the RL feedback processes that are being used for these type of models. Check out this paper from Bai et al.
@AnthropicAI
for example: