PINK ELEPHANTS! 🐘 Now, don’t think about it.
Chatbots also find this supremely difficult.
Ask one of the most popular open source models NOT to talk about pink elephants, and it will fail 34% of the time.
In our new paper, we address this problem.
1/N
we're hiring for all roles.
Open science stuff we're working on:
1) RLAIF for pretraining (we're making open source datasets).
2) benchmarks benchmarks benchmarks.
3) collaborating with
@AiEleuther
on some awesome projects.
Work with us.
We also present Direct Principle Feedback (DPF) as a way to address this. Rather than relying on reranking, we can use the before/after of a revision as a pairwise prefs. 5/N
We define the pink elephant problem as the issue of, given a pink elephant and a grey elephant, discuss the grey elephant when the pink elephant is brought up. 3/N
We show that by applying DPF to OpenHermes-13B, our model avoids the Pink Elephant when instructed to almost as much as GPT-4 does! Notice the “With Prompt” column. 8/N
Telling a language model to not mention something, can paradoxically, increase the odds. Similarly, as noticed by Gary Marcus, when prompting DALL-E 3 to draw a room without elephants it will consistently add elephants to the photo.
2/N
By having fine grained control of pairwise preference generation, we open the door to a new set of approachable RLAIF problems. DPF can be readily used to do tool assisted RLAIF, as rewriting utterances with the use of tools becomes something trivial to do with DPF! 9/N
If one deploys a bot that provides students info about British unis, eg you own a company that aids in applying to British unis, it's perhaps not the best decision to help students apply to American unis. 4/N
DPF is a simplification of common RLHF pipelines based on const AI where we skip the sampling and ranking step by noticing that the original generation and the revised generation produce naturally ranked pairs that can be plugged directly into a preference learning method. 7/N
Producing quality pairwise prefs with/without pink elephants becomes easy with DPF, as we can filter and control its removal directly with our revision step. 6/N
In the first technical RLHF interview I've hosted, with
@lcastricato
of
@synth_labs
(+
@AiEleuther
), we cover maybe every topic:
DPO, PPO, REINFORCE, KTO, long-context, multi-modal, video vs image, evaluation, license terms, Carper, TRLX, data