Papers
arxiv:2401.01967

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

Published on Jan 3, 2024
Authors:
,
,
,
,

Abstract

This study investigates the mechanisms of reducing toxicity in pre-trained language models using direct preference optimization, revealing that capabilities are bypassed rather than removed, and demonstrates a method to revert the model to toxic behavior.

AI-generated summary

While alignment algorithms are now commonly used to tune pre-trained language models towards a user's preferences, we lack explanations for the underlying mechanisms in which models become ``aligned'', thus making it difficult to explain phenomena like jailbreaks. In this work we study a popular algorithm, direct preference optimization (DPO), and the mechanisms by which it reduces toxicity. Namely, we first study how toxicity is represented and elicited in a pre-trained language model, GPT2-medium. We then apply DPO with a carefully crafted pairwise dataset to reduce toxicity. We examine how the resulting model averts toxic outputs, and find that capabilities learned from pre-training are not removed, but rather bypassed. We use this insight to demonstrate a simple method to un-align the model, reverting it back to its toxic behavior.

Community

Sign up or log in to comment

Models citing this paper 11

Browse 11 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2401.01967 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2401.01967 in a Space README.md to link it from this page.

Collections including this paper 2