Uppaal
/

gpt2-ProFS-toxicity

Text Generation

activation-steering

activation-editing

text-generation-inference

Model card Files Files and versions

Uppaal commited on Nov 7

Commit

5a5c635

·

verified ·

1 Parent(s): 3be52b5

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -51,7 +51,7 @@ ProFS (Projection Filter for Subspaces) is a tuning-free alignment method that r
 - Theoretically grounded: shown to be a denoised, single-step approximation of Direct Preference Optimization (DPO)—bridging editing-based and tuning-based alignment.
 <div align="center">
-<img src="https://github.com/Uppaal/detox-edit/blob/main/assets/ProFS Method.png" width="450"/>
 <i><b>Figure.</b> Schematic of ProFS (previously called DeTox). Toxic directions (in red) are projected out of the model’s MLP-value matrices, leaving other representational directions intact. </i>
 </div>

 - Theoretically grounded: shown to be a denoised, single-step approximation of Direct Preference Optimization (DPO)—bridging editing-based and tuning-based alignment.
 <div align="center">
+<img src="ProFS Method.png" width="950">
 <i><b>Figure.</b> Schematic of ProFS (previously called DeTox). Toxic directions (in red) are projected out of the model’s MLP-value matrices, leaving other representational directions intact. </i>
 </div>