Paul Röttger Profile Banner
Paul Röttger Profile
Paul Röttger

@paul_rottger

Followers
2,220
Following
464
Media
36
Statuses
276

Postdoc @MilaNLProc , working on evaluating and improving LLM safety. Previously PhD @oiioxford & CTO/co-founder @rewire_online

Joined July 2020
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@paul_rottger
Paul Röttger
2 months
is now a paper! We reviewed 102 open datasets for evaluating and improving LLM safety. We also reviewed how these datasets are used in major model release publications and in popular benchmarks. Headline results and arxiv link below 👇
Tweet media one
@paul_rottger
Paul Röttger
5 months
If you’re working on LLM safety, check out ! is a catalogue of open datasets for evaluating and improving LLM safety. I started building this over the holidays, and I know there are still datasets missing, so I need your help 🧵
Tweet media one
9
55
222
1
14
66
@paul_rottger
Paul Röttger
1 year
I was part of OpenAI’s red team for GPT-4, testing its ability to generate harmful content. Working with the model in various iterations over the course of six months convinced me that model safety is the most difficult, and most exciting challenge in NLP right now. 🧵
@gdb
Greg Brockman
1 year
We’re releasing GPT-4 — a large multimodal model (image & text in, text out) which is a significant advance in both capability and alignment. Still limited in many ways, but passes many qualification benchmarks like the bar exam & AP Calculus:
165
1K
7K
32
150
896
@paul_rottger
Paul Röttger
8 months
After spending just 20 minutes with the @MistralAI model, I am shocked by how unsafe it is. It is very rare these days to see a new model so readily reply to even the most malicious instructions. I am super excited about open-source LLMs, but this can't be it! Examples below 🧵
@GuillaumeLample
Guillaume Lample @ ICLR 2024
8 months
Mistral 7B is out. It outperforms Llama 2 13B on every benchmark we tried. It is also superior to LLaMA 1 34B in code, math, and reasoning, and is released under the Apache 2.0 licence.
Tweet media one
52
488
3K
217
107
767
@paul_rottger
Paul Röttger
10 months
NEW PREPRINT! LLMs should be helpful AND harmless. This is a difficult balance to get right... Some models refuse even safe requests if they superficially resemble unsafe ones. We built XSTest to systematically test for such "exaggerated safety". 🧵
Tweet media one
7
22
128
@paul_rottger
Paul Röttger
2 years
🥳 New paper accepted at #NAACL2022 (Main) 🥳 NLP tasks like hate speech detection are subjective: annotators disagree about what the correct data labels are. We propose two contrasting paradigms to enable better data annotation. ⬇️ Highlights below ⬇️
8
30
129
@paul_rottger
Paul Röttger
1 year
Excited to share that I successfully defended my PhD thesis @oiioxford last month 🥳 Huge thanks to my assessors @computermacgyve and @MaartenSap , and also my supervisors, collaborators and friends. I'll post my full acknowledgments here 👇 and more on what's next next week 🤗
Tweet media one
15
3
120
@paul_rottger
Paul Röttger
2 years
Multilingual HateCheck is now @huggingface 🤗 New tests for hate speech detection models in 10 languages with just 2 lines of code! See our #NAACL2022 WOAH paper for details (), and get testing with the link below 👇
2
26
116
@paul_rottger
Paul Röttger
2 years
🥳 New paper at #EMNLP2022 (Main) 🥳 Too much hate speech research focuses just on English content! To help fix this, we tried to expand hate detection models into under-resourced languages 🌍 without needing lots of new data 💸 ⬇️ Highlights below ⬇️
4
28
111
@paul_rottger
Paul Röttger
8 months
CONTENT WARNING You will have no trouble getting the model to give you advice on how to commit suicide, assault someone, or eradicate minorities. I will not post examples verbatim, but I can DM them, or you can try for yourself here:
88
6
102
@paul_rottger
Paul Röttger
3 years
🥳 New publication accepted at #ACL2021NLP ! 🥳 We built HateCheck, a suite of functional tests for hate speech detection models, and used it to show critical weaknesses in current academic and commercial models. ⬇️ Highlights below ⬇️
1
18
91
@paul_rottger
Paul Röttger
1 year
Safety is hard because models today are general purpose tools. And for nearly every prompt that is safe and useful, there is an unsafe version. You want the model to write good job ads, but not for some nazi group. Blog posts? Not for terrorists. Chemistry? Not for explosives…
2
4
88
@paul_rottger
Paul Röttger
2 years
📣 Hiring for NLP Jobs 📣 Last year, @bertievidgen and I founded @rewire_online , a start-up building socially responsible AI for online safety. We have since grown into a team of 10+ people, winning major projects with GAFAM and an IUK Grant. Right now, we are hiring 🧵👇
4
31
80
@paul_rottger
Paul Röttger
2 years
🥳 New paper at #WOAH #NAACL2022 🥳 Too much hate speech research focuses just on English content, so we release 🌍 Multilingual HateCheck 🌍 for testing hate speech detection models in 10 more languages! ⬇️ Highlights below ⬇️
3
13
86
@paul_rottger
Paul Röttger
1 year
Super hyped to start my postdoc with @dirk_hovy @MilaNLProc this week 🎉 I'll be working on social values in large language models and AI safety (while living & eating well in Milan 🙏). I am excited to continue existing collaborations and meet new people – please reach out 🤗
Tweet media one
3
2
87
@paul_rottger
Paul Röttger
2 years
Super happy to start my research visit to @dirk_hovy ’s group @MilaNLProc today 🥳 I’ll be working on some exciting hate speech / NLP projects over the summer (and have many aperitivi along the way 🍷). Come say ciao if you’re in the area 🇮🇹
5
3
82
@paul_rottger
Paul Röttger
6 months
Presenting our poster now at #EMNLP2023 , stand 45A way in the back. Come by!
Tweet media one
@hannahrosekirk
Hannah Rose Kirk
7 months
How does human feedback steer LLM behaviours?🧐 Whose voices dominate? 🗣️What challenges remain and how can we do better as a community in the future?🔮 All these questions and more answered in our new survey paper, accepted at #EMNLP23 ! a small 🧵
Tweet media one
3
18
110
1
5
68
@paul_rottger
Paul Röttger
6 months
LLM safety will be a big topic at #EMNLP2023 ! I put together this timetable with all the papers I am excited to check out. Sharing here in case others find it useful too :) Also sharing the link further below 🧵
Tweet media one
5
10
64
@paul_rottger
Paul Röttger
1 year
Probing for unsafe use cases itself is not easy. Finding and evaluating the right prompts requires expert knowledge. Oversight across domains will become more and more of a challenge – check out great work from @sleepinyourhat , @EthanJPerez and others on this issue.
1
2
54
@paul_rottger
Paul Röttger
8 months
There are plenty of explicitly "uncensored" models that will give similar responses, but @MistralAI is framing this as a mainstream Llama alternative. It is wild to me that they do not evaluate or even mention safety in their release announcement...
10
2
54
@paul_rottger
Paul Röttger
1 year
These are just some of the issues that struck me the most while red-teaming GPT-4. I don’t want to jump on the hype train. The model is far from perfect. But I will say that I was impressed with the care and attention that everyone I interacted with @OpenAI put into this effort.
1
3
50
@paul_rottger
Paul Röttger
1 year
Also, it’s not always clear where to draw the lines on safety. What is or is not safe depends on who you ask. This is where model safety overlaps with more general alignment research. Who are we aligning our models with, and how? I am really excited for more work on this!
1
4
51
@paul_rottger
Paul Röttger
1 year
BIG NEWS: Rewire has been acquired by ActiveFence 🥳 Two years ago, @bertievidgen and I started @rewire_online to build socially responsible AI for online safety. Today, we’re excited to share that we have been acquired by @ActiveFence ! 🧵
10
1
50
@paul_rottger
Paul Röttger
3 years
🥳 New publication accepted at #EMNLP2021 Findings 🥳 We used three years of Reddit data to adapt BERT to time and domain, illustrating when temporal adaptation isn't worth doing – and when it might be! ⬇️ Highlights below ⬇️
3
5
49
@paul_rottger
Paul Röttger
8 months
@VictorGall6791 @MistralAI Thanks, Victor. I don't think that is true! There may be a tension between helpfulness and harmlessness in some settings, but there is also plenty of great work from @AnthropicAI and others on making models more helpful and less harmful at the same time. Calibration is possible!
12
1
48
@paul_rottger
Paul Röttger
11 months
Very excited to share that we won the @StanfordHAI AI Audit Challenge 🥳 Our HateCheck project (with @hannahrosekirk and @bertievidgen ) was awarded "Best Holistic Evaluation and Benchmarking"! Brief summary below 🧵
Tweet media one
@StanfordHAI
Stanford HAI
11 months
Last August, HAI and @StanfordCyber launched the #AIAuditChallenge that calls for solutions to improve our ability to evaluate AI systems. Join us on June 27 as we highlight the most innovative approaches, as well as lessons learned from the challenge:
0
4
25
1
5
47
@paul_rottger
Paul Röttger
2 years
🥳 #NAACL2022 Presentations 🥳 Super excited to present two articles at NAACL this week! Interested in subjectivity and data annotation for tasks like hate speech detection? Then come to Session 8C this Wednesday at 0915 🙌
Tweet media one
1
9
45
@paul_rottger
Paul Röttger
2 years
🏆 @rewire_online won the DB Mindbox Challenge 🏆 I was in Berlin last week to pitch for Rewire at DB, Germany's national railway operator. Super excited to share that we won, and will now help DB handle their toxic feedback! 🙌 More details below 👇
Tweet media one
4
0
42
@paul_rottger
Paul Röttger
7 months
We just released v2 of XSTest, our test suite for "exaggerated safety" in LLMs 🚀 Llama2 and other models often refuse safe prompts just because they superficially resemble unsafe prompts. With XSTest, you can test for this issue in a systematic way. 🧵
Tweet media one
1
15
42
@paul_rottger
Paul Röttger
8 months
There is so much good work on LLM safety, so many relatively easy steps to take to avoid these extreme cases of unsafe behaviour. I really hope future releases will make more use of that!
7
1
38
@paul_rottger
Paul Röttger
2 years
Arrived in Seattle for #NAACL2022 — super excited for my first ever in-person conference! Come say hi 🤗
Tweet media one
1
0
37
@paul_rottger
Paul Röttger
2 years
HateCheck is now on @huggingface 🤗 Testing your hate speech detection models has never been easier – it takes just two lines of code to load the dataset! See our ACL 2021 paper for details (), and get testing with the link below 👇
1
11
36
@paul_rottger
Paul Röttger
1 year
Special thanks also to @_lamaahmad for coordinating the red teams and sharing insights along the way 🤗
3
0
31
@paul_rottger
Paul Röttger
3 years
Excited to talk about data annotation for subjective NLP tasks like hate speech detection @MilaNLProc today! Thank you @CurriedAmanda and @dirk_hovy for inviting me 🤗 It's still early-stage work but I can hopefully share a preprint soon!
2
0
28
@paul_rottger
Paul Röttger
1 year
Excited to (virtually) be at #EMNLP2022 this week! If you're interested in online harms, social factors in language modelling or model safety, let's have a chat 🤗 Here's where you can find me 👇
1
3
27
@paul_rottger
Paul Röttger
1 year
#WOAH will be at #ACL2023 🥳 The call for papers is now live! This is a great venue for online safety research, and I am excited to be co-organising it this year 🤗 Submit your work and join us in Toronto!
@WOAHWorkshop
Workshop on Online Abuse and Harms
1 year
📢Exciting news! The call for papers is now open for the 7th Workshop on Online Abuse and Harms ( #WOAH ) at ACL 2023. Join the effort to address these critical issues. See CFP here: Submission Deadline: May 2, 2023 #ACL2023 #NLProc
1
18
38
0
1
26
@paul_rottger
Paul Röttger
7 months
Super excited for this and everything we have planned! Get your papers ready and join us in Mexico City 🤗
@WOAHWorkshop
Workshop on Online Abuse and Harms
7 months
We are excited to share that WOAH 2024, our 8th edition, will take place at #NAACL2024 in Mexico City! Our special theme this year will be "Online Harms in the Age of LLMs", covering emerging risks as well as LLM-based countermeasures. CfP and more details soon 🚀
Tweet media one
2
20
61
0
2
25
@paul_rottger
Paul Röttger
11 months
I’m at #ACL2023NLP this week to present co-authored work and co-organise @WOAHWorkshop . But I am just as excited to meet new people and talk research in between sessions! If you’re interested in social values in language models and/or model safety, come say hi 🤗
1
3
24
@paul_rottger
Paul Röttger
2 years
Super interesting panel on building NLP datasets! Great to hear @_julianmichael_ @anmarasovic @complingy discuss prescriptive/descriptive annotation – very happy people are finding this useful! Check out the video from 17:32 or read 👇 for more details
@aclmentorship
ACL Mentorship
2 years
Thinking about collecting an #NLProc dataset / submitting a dataset paper? ➡️💡Check out our suggestions for "Building NLP Datasets" at Thank you so much for our wonderful mentors: @complingy @anmarasovic @_julianmichael_ ❤️!
1
23
50
1
3
23
@paul_rottger
Paul Röttger
2 years
Great new datasets and methods for tackling emoji-based hate coming out at #NAACL2022 ! 🥳 Very glad I got to make a small contribution to this amazing initiative by @hannahrosekirk 🤗 Check out the details and paper below 👇
@hannahrosekirk
Hannah Rose Kirk
2 years
🚨 New paper and datasets! 🚨 After sitting on my hands for many months 😬 I'm delighted that our #Hatemoji paper is going to @naaclmeeting ! 😍🤩😎🆒 In a nutshell 🥜it uses human-and-model-in-the-loop learning 🤖🤝🙆 to tackle emoji-based hate A 🧵 on all our new resources 1/
6
25
159
1
5
23
@paul_rottger
Paul Röttger
6 months
Relatedly, I am also very interested in subjectivity, human values and preferences, and how they are incorporated in LLMs. There’s a ton of papers on that as well – including one led by @hannahrosekirk which I co-authored and will be presenting the poster for on Saturday at 11am!
Tweet media one
1
4
22
@paul_rottger
Paul Röttger
2 years
We are looking for 🧑‍🎓 an NLP Research Scientist 💡 two NLP Interns 💬 a Communications Lead Details and how to apply on our careers page 👇
4
8
19
@paul_rottger
Paul Röttger
6 months
If you're at #NeurIPS2023 go find @hannahrosekirk ! You'll get to talk about this very fun poster @solarneurips AND get a preview of one of the most exciting dataset projects I've ever been a part of 🤫
@hannahrosekirk
Hannah Rose Kirk
6 months
Hi 🌎! I've arrived at @NeurIPSConf 🫡 Reach out if you wanna talk all things human feedback + sociotechical alignment. I’m presenting this cute poster, but we’re also building an awesome new human feedback dataset (release in Jan 👀) that I can’t wait to tell everyone about🕺
Tweet media one
3
3
83
0
3
22
@paul_rottger
Paul Röttger
1 year
🏅 EDOS @ SemEval Results 🏅 More than 500 people signed up for our SemEval task on the Explainable Detection of Online Sexism. The task paper, with dataset details, results and analysis, is now on ArXiv! 👇
2
2
20
@paul_rottger
Paul Röttger
2 years
Working on hate speech detection in non-English languages? Want to test your models? Then come to the Workshop on Online Abuse and Harms this Thursday 🙏 And if you want to have a chat about any of these topics, just flag me down at the conference any time🤗
Tweet media one
1
8
21
@paul_rottger
Paul Röttger
2 years
Super insightful review of 2021 research highlights by @seb_ruder 🙌 Definitely an early personal highlight of 2022 to see my work on temporal adaptation with Janet Pierrehumbert mentioned among so many other great articles 🤗 Check out our article here:
@seb_ruder
Sebastian Ruder
2 years
ML and NLP Research Highlights of 2021 These are the research areas and papers I found most exciting & inspiring in 2021.
27
417
1K
0
0
21
@paul_rottger
Paul Röttger
3 years
Two new postdoc openings in Janet Pierrehumbert's Oxford NLP lab group! Come join us 🤗 or spread the word 📢 1) Text data mining, experimental semantics: 2) Graph machine learning, social network analysis:
0
21
21
@paul_rottger
Paul Röttger
1 year
Really enjoyed giving a talk @cambridgenlp today! I spoke about some ideas for "exploring and controlling values in large language models through role-playing" -- much inspired by @jacobandreas recent work on language models as agent models 🎭
Tweet media one
1
1
19
@paul_rottger
Paul Röttger
7 months
Check out this new #EMNLP2023 survey on feedback learning in large language models, put together by @hannahrosekirk . I am biased, but I think it's a great resource! More details in Hannah's thread 👇 Feedback is welcome! 🥁
@hannahrosekirk
Hannah Rose Kirk
7 months
How does human feedback steer LLM behaviours?🧐 Whose voices dominate? 🗣️What challenges remain and how can we do better as a community in the future?🔮 All these questions and more answered in our new survey paper, accepted at #EMNLP23 ! a small 🧵
Tweet media one
3
18
110
0
3
20
@paul_rottger
Paul Röttger
1 year
Looking for something to do over the holidays? Join our #SemEval2023 task on explainable sexism detection! 350+ people have already signed up for the task, run by @rewire_online with support from @MetaAI . The test phase starts on Jan 10th! 🚀
0
8
19
@paul_rottger
Paul Röttger
2 years
Yesterday, @royalsociety released a report on the online information environment, and it's a great read for those interested in #onlinesafety 🙌 The report is informed, in part, by a lit review I wrote with @balazsvedres back in 2020. Check out both here:
0
6
18
@paul_rottger
Paul Röttger
8 months
PS: As I said in my first tweet, I have not done a very systematic evaluation of the model, but I am very confident that doing so would confirm my first impression. Happy to be proven wrong!
2
1
18
@paul_rottger
Paul Röttger
1 year
Are you working on hate speech, abuse or other online harms? Then submit to WOAH at #ACL2023 ! We have extended the paper deadline to May 9th 🗓 and also announced two best paper awards 🏆 For more details visit our website and please spread the word 🤗
@WOAHWorkshop
Workshop on Online Abuse and Harms
1 year
📢The direct paper submission deadline for WOAH @aclmeeting has been officially extended to May 9th! 📢 That is around 20 days from now. We look forward to receiving your work! We will update our website shortly. #ACL2023 #WOAH2023 #NLProc
1
5
8
0
10
16
@paul_rottger
Paul Röttger
1 year
@larsjuhljensen Thanks, Lars. I see your point, but I think there is a lot of value in trying to make the most advanced and most widely used models as safe as they can be. Not a perfect analogy, but Twitter needs to be safe even though unsafe alternatives like Gab exist.
1
0
16
@paul_rottger
Paul Röttger
8 months
@CaryPalmerr @MistralAI Thanks, Palmer. I am talking about their instruct / chat-optimised model, which is also the one I linked to on If you like, you can look at the prompts we tried here:
8
1
16
@paul_rottger
Paul Röttger
8 months
@natolambert @MistralAI Thanks, Nathan. My main issue is that safety was not even evaluated ahead of release, or these evals were not shared. FWIW, I also think there should be minimum safety standards for when big orgs release chat models.
5
0
15
@paul_rottger
Paul Röttger
10 months
XSTest contains 200 hand-crafted test prompts across ten prompt types. All prompts are perfectly innocuous questions. A well-calibrated model should not refuse to answer them! Examples below 👇
Tweet media one
1
6
15
@paul_rottger
Paul Röttger
8 months
Thanks to @peppeatta @bertievidgen @hannahrosekirk who helped with this initial testing 🙏
3
1
16
@paul_rottger
Paul Röttger
2 years
Really enjoyed being on @skeptechs to talk about my two favourite things: research and @rewire_online 😁 Thank you to @NayanaPrakash1 and @JoshCowls for having me, and to @hannahrosekirk for co-guesting 🙌
@skeptechs
skeptechs
2 years
🚨 🚨 New episode, new season! We spoke to fellow @oiioxford colleagues @paul_rottger and @hannahrosekirk about their research on hate speech detection and AI - join us for a fascinating deep dive into their work. Available on Spotify!
1
4
13
0
1
15
@paul_rottger
Paul Röttger
8 months
@deliprao Thanks, Delip. I tested the instruction model. Regardless, my main point is that safety is a relevant consideration for many LLM applications. Therefore, if Mistral made a choice to release an unmoderated “power tool”, they should be open about that from the get go.
3
1
12
@paul_rottger
Paul Röttger
3 years
Had a great time speaking about #hatespeech detection models and their weaknesses at @CogX_Festival with @bertievidgen last week. Recording is now online, please check it out 🤗
0
2
11
@paul_rottger
Paul Röttger
3 years
Looking forward to talking about my research at the @turinginst doctoral showcase tomorrow at 1:30pm GMT! Lots of other exciting presentations as well -- agenda & links to register here:
0
1
12
@paul_rottger
Paul Röttger
2 years
Hate speech research needs to serve everyone, not just English speakers! I'm very happy that we @rewire_online get to be a part of this exciting and ambitious project 💪
@rewire_online
Rewire (acquired by ActiveFence)
2 years
🌍 Rewire x Lacuna Fund 🌍 We're super excited to be part of the new AfriHate project, working with great groups like @MasakhaneNLP to expand hate speech detection into 18 African languages! Read on below for more details 👇
1
1
6
2
0
13
@paul_rottger
Paul Röttger
8 months
@Sohail_NITIE @MistralAI Thanks, Sohail. It looks like you are using codellama, not the mistral model (see dropdown at the bottom). Here are the unsafe prompts we tried yesterday, in case you are interested
6
0
12
@paul_rottger
Paul Röttger
2 years
We release MHC and other work related to HateCheck on , a new website built by us at @rewire_online and supported by Google @Jigsaw . Check it out!
0
5
12
@paul_rottger
Paul Röttger
2 years
🎉 @rewire_online is hosting a SemEval task 🎉 The goal of the task is to detect A) sexist content, as well as B) different types of sexism and C) fine-grained sexism vectors. Super excited to run this with support from @MetaAI ! Read more details and join the task below 👇
@hannahrosekirk
Hannah Rose Kirk
2 years
Can AI tell us why something is sexist online?🧐 Our new @SemEvalWorkshop #NLProc task invites you to create systems that identify sexist content and explain why with fine-grained predictions🔎 Check out our competition, organised by @rewire_online & sponsored by @MetaAI . Link⬇️
Tweet media one
2
17
105
0
6
11
@paul_rottger
Paul Röttger
10 months
We test Meta's recently-released Llama2 with XSTest, and find a lot of exaggerated safety behaviours. The model fully refuses 38% of our test prompts and partially refuses another 22% -- you cannot ask Llama2 how to kill the lights or smash a piñata!
2
2
11
@paul_rottger
Paul Röttger
3 years
Great talking to @CraigLangran along with @bertievidgen for this @BBCNews article! Bottom line: Content moderation systems continue to make harmful mistakes. Platforms need to be more transparent, so that decisions can be explained, weaknesses identified and addressed!
@CraigLangran
Craig Langran
3 years
⚠️NEW FROM ME⚠️ I’ve taken a deep dive into the murky world of social media content moderation for @BBCNews How do platforms like @Twitter determine when a post is abusive or hateful? #onlineabuse
9
5
22
1
4
11
@paul_rottger
Paul Röttger
3 years
Very excited to see our work on HateCheck covered by @techreview @_KarenHao 🥳 Paper: ( #ACL2021NLP ) Data:
@techreview
MIT Technology Review
3 years
AI still struggles with one of its most basic applications: censoring harmful language. But distinguishing toxic and innocuous sentences isn't as straightforward for a machine.
Tweet media one
2
6
20
0
1
10
@paul_rottger
Paul Röttger
3 years
The Online Safety team @turinginst is hiring two postdocs! Really great team to work with – please spread the word 📢 1) NLP / data-centric AI for online safety 2) Social science / policy for online harms
0
7
11
@paul_rottger
Paul Röttger
2 years
Also check out @hannahrosekirk 's paper on emoji-based hate speech, Session 8B this Wednesday!
@hannahrosekirk
Hannah Rose Kirk
2 years
I am @ #NAACL2022 for my first **ever** in-person conference 🤩 I'll be talking about hate + emoji = #Hatemoji (Sesh 8, 13/07) Come say hi if you want to swap favourite emojis 🦄🫠🚀❤️‍🔥🆒👨‍🎤🍜or discuss how these little pictures pose challenges for language modelling 🤟
Tweet media one
2
3
60
0
3
10
@paul_rottger
Paul Röttger
2 years
Super excited that our team @rewire_online got spotlighted in the new @DCMS report on the UK Safety Tech sector 🙌🚀 Go to our website to learn more and to get free access to the Rewire API for toxic content detection!
@rewire_online
Rewire (acquired by ActiveFence)
2 years
Fantastic to see @rewire_online included in this new report from @DCMS , on the growth of #OnlineSafetyTech —a recognition of the importance of tackling online harms. @DamianCollins : "Making the online world safer is not only the right thing to do, it’s good for business" 🙌
1
6
7
0
1
10
@paul_rottger
Paul Röttger
2 years
Also, there has been a lot of great work on subjectivity in labelled data from folks like @MaartenSap @aidaa @mitchellgordon @Ginger_in_AI @vinodkpg , who I hope will find this interesting 🤗 Please check out their research!
3
0
9
@paul_rottger
Paul Röttger
2 years
Going forward, I will post more about our work at Rewire. About our best-in-class AI for hate speech detection and free API access. Please get in touch if you have any questions and follow @rewire_online to stay up-to-date! Big things coming soon 🎉
1
0
8
@paul_rottger
Paul Röttger
1 year
Today, at 4pm GST, I will be a panelist at the BoF session on hate speech detection in African languages 🌍 You can still register below, even if you're not at EMNLP!
@nedjmaou
Nedjma Ousidhoum نجمة أوسيدهم
1 year
Join our BoF session on Hate speech detection for African Languages tomorrow(Tue 7 Dec 4 pm GST, 12 pm GMT) @Shmuhammadd will present the AfriHate project+ we'll have a panel discussion w\ Adem Chanie Ali @AishatuGwadabe @paul_rottger @seyyaw + an open discussion w\ all attendees
1
2
11
1
3
9
@paul_rottger
Paul Röttger
2 years
Amazing lineup on this panel at #NAACL2022 tomorrow! Could not be more excited for both @DADCworkshop and @WOAHWorkshop 🙌
@DADCworkshop
Dynamic Adversarial Data Collection Workshop
2 years
We have an incredible panel lined up for you tomorrow on the Future of Data Collection! Moderated by @adinamwilliams , panelists include @annargrs , @boydgraber , @sleepinyourhat , @tongshuangwu , @laroyo , @douwekiela & @swabhz 🤩 Post any questions you have for them below! 🚀
Tweet media one
1
14
55
0
1
9
@paul_rottger
Paul Röttger
2 years
It was great to work on this with @debora_nozza , @federicobianchy and @dirk_hovy during my summer visit to @MilaNLProc 🤗 Looking forward to presenting at #EMNLP2022 !
1
3
9
@paul_rottger
Paul Röttger
2 years
Also check out , built by our team @rewire_online , for all other HateCheck-related resources: - the original English HateCheck () - HatemojiCheck (led by @hannahrosekirk )
0
3
9
@paul_rottger
Paul Röttger
1 year
I am full of gratitude for the amazing support and mentorship I received over the last three years. I feel very lucky. Thank you all 🤗
Tweet media one
Tweet media two
2
0
9
@paul_rottger
Paul Röttger
2 years
🛠️ Experimental setup 🛠️ We fine-tuned over 3,000 models (!) on different combinations of English and target-language data. Then, we evaluated all these models on held-out test sets and multilingual HateCheck, using OLS regression (!) to quantify benefits of using more data 📈
Tweet media one
1
1
9
@paul_rottger
Paul Röttger
7 months
@deliprao Hi @deliprao ! We just released v2 of a dataset to test for exactly this kind of "exaggerated safety". Hope you find it interesting, would love to hear your thoughts!
@paul_rottger
Paul Röttger
7 months
We just released v2 of XSTest, our test suite for "exaggerated safety" in LLMs 🚀 Llama2 and other models often refuse safe prompts just because they superficially resemble unsafe prompts. With XSTest, you can test for this issue in a systematic way. 🧵
Tweet media one
1
15
42
1
3
5
@paul_rottger
Paul Röttger
2 years
⚠️ We argue that dataset creators should consider annotator subjectivity in the annotation process and either explicitly encourage it or discourage it, depending on the intended use of their dataset ⚠️ As a framework, we propose two contrasting data annotation paradigms:
1
0
8
@paul_rottger
Paul Röttger
2 years
@DamianRomero_CL Very cool thread! We just released our NAACL paper about subjectivity in data annotation that is quite relevant to this -- would be curious to hear your thoughts 🤗
@paul_rottger
Paul Röttger
2 years
🥳 New paper accepted at #NAACL2022 (Main) 🥳 NLP tasks like hate speech detection are subjective: annotators disagree about what the correct data labels are. We propose two contrasting paradigms to enable better data annotation. ⬇️ Highlights below ⬇️
8
30
129
1
0
7
@paul_rottger
Paul Röttger
8 months
@JagersbergKnut @andriy_mulyar @MistralAI Thanks, Andriy and Knut. I was indeed using the instruct / chat-tuned version of the model, which is also the one I linked to.
2
0
7
@paul_rottger
Paul Röttger
3 years
It was great to work on HateCheck with @bertievidgen @dongng @zeerakw @helenmargetts and Janet Pierrehumbert 🤗
0
0
6
@paul_rottger
Paul Röttger
2 years
Our #NAACL2022 paper on subjectivity in data annotation is finally live in the ACL Anthology 🙌 Thank you to those who flagged that the wrong PDF was linked 🙏 Check out the paper below 👇
@paul_rottger
Paul Röttger
2 years
🥳 New paper accepted at #NAACL2022 (Main) 🥳 NLP tasks like hate speech detection are subjective: annotators disagree about what the correct data labels are. We propose two contrasting paradigms to enable better data annotation. ⬇️ Highlights below ⬇️
8
30
129
0
1
7
@paul_rottger
Paul Röttger
10 months
Exaggerated safety is likely a problem of lexical overfitting. To understand our prompts, LLMs need to contextualise potentially unsafe words ("kill time"). This is very easy for humans! But LLMs often focus only on unsafe meanings, which is why they refuse even safe prompts.
1
1
7
@paul_rottger
Paul Röttger
6 months
Please let me know if I missed anything! And definitely say hi if you want to chat about any of these topics :) Here is the link to the sheet 👇 If you want to filter by topic, click Data -> Filter Views.
1
0
7
@paul_rottger
Paul Röttger
11 months
Very cool new article at the intersection of generative AI and theology by @HannahALucas ! It's not my usual field, but I had a lot of fun discussing earlier drafts, and I learned a lot from the final article! Check it out below 👇
@HannahALucas
Dr Hannah Lucas
11 months
Pleased to say my new article is now online and open access with @Religions_MDPI 🥳 In this paper I look at #AI Text-to-Image models, comparing negative weight prompts to negative language in mystical texts ☁️ 1/
3
11
54
0
1
6
@paul_rottger
Paul Röttger
3 years
For example, all models struggled with reclaimed slurs and counter speech. Misclassifying such content as hateful risks penalising the very communities most commonly targeted by online hate in the first place. It also undermines positive efforts to fight back against online hate.
1
0
6
@paul_rottger
Paul Röttger
2 years
Multilingual HateCheck (MHC) is an expansion of the original English HateCheck (ACL 2021) to Arabic, Dutch, French, German, Hindi, Italian, Mandarin, Polish, Portuguese and Spanish 🌏 This is more languages than any other hate speech dataset!
1
2
5
@paul_rottger
Paul Röttger
1 year
Very excited to develop these ideas into proper papers soon! Shoutout to @hannahrosekirk for great discussions on this, and thank you to @nedjmaou and @michael_sejr for inviting me 🤗
1
0
5
@paul_rottger
Paul Röttger
1 year
@metropolinomix Thanks, M! I would say this is not safe behaviour, and I think people at OpenAI would agree. That is why they put that illustrative example into their System Card section on the potential for risky emergent behaviour. Let me know if you were referring to something else!
0
0
4
@paul_rottger
Paul Röttger
2 years
@Abebab There aren’t many so far, but Lacuna Fund is supporting a new project to create hate speech datasets in 18 African languages! (Disclaimer: I’m involved in this as well)
@LacunaFund
Lacuna Fund
2 years
Today, Lacuna Fund announces awards to 10 teams creating machine learning datasets for low-resourced African languages. Learn more about the selected projects here: Français: Español:
Tweet media one
1
21
43
1
0
5