mlabonne commited on
Commit
90608df
·
verified ·
1 Parent(s): f8da067

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +40 -0
  3. qwen3-0.6b-abliterated.q2_k.gguf +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ qwen3-0.6b-abliterated.q2_k.gguf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ license_link: https://huggingface.co/Qwen/Qwen3-0.6B/blob/main/LICENSE
5
+ pipeline_tag: text-generation
6
+ base_model:
7
+ - Qwen/Qwen3-0.6B
8
+ tags:
9
+ - abliteration
10
+ - abliterated
11
+ - autoquant
12
+ - gguf
13
+ ---
14
+
15
+ # 🐹 Qwen3-0.6B-abliterated
16
+
17
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/oCo2RbWkYHjVMsuqXpV5V.png)
18
+
19
+ <center>Qwen3 Abliterated <a href="https://huggingface.co/mlabonne/Qwen3-0.6B-abliterated">0.6B</a> • <a href="https://huggingface.co/mlabonne/Qwen3-1.7B-abliterated">1.7B</a> • <a href="https://huggingface.co/mlabonne/Qwen3-4B-abliterated">4B</a> • <a href="https://huggingface.co/mlabonne/Qwen3-8B-abliterated">8B</a> • <a href="https://huggingface.co/mlabonne/Qwen3-14B-abliterated">14B</a> • <a href="https://huggingface.co/mlabonne/Qwen3-30B-A3B-abliterated">30B-A3B</a></center>
20
+
21
+ This is an uncensored version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) created with a new abliteration technique.
22
+ See [this article](https://huggingface.co/blog/mlabonne/abliteration) to know more about abliteration.
23
+
24
+ This is a research project to understand how refusals and latent fine-tuning work in LLMs.
25
+ I played with different sizes of Qwen3 and noticed there was no one-size-fits-all abliteration strategy. In addition, the reasoning mode interfered with non-reasoning refusals, which made it more challenging.
26
+ This made me iterate over different recipes and significantly consolidate my scripts with accumulation and better evaluations.
27
+
28
+ Note that this is fairly experimental, so it might not turn out as well as expected.
29
+
30
+ I recommend using these generation parameters: `temperature=0.6`, `top_k=20`, `top_p=0.95`, `min_p=0`.
31
+
32
+ ## ✂️ Abliteration
33
+
34
+ The refusal direction is computed by comparing the residual streams between target (harmful) and baseline (harmless) samples.
35
+ The hidden states of target modules (e.g., o_proj) are orthogonalized to subtract this refusal direction with a given weight factor.
36
+ These weight factors follow a normal distribution with a certain spread and peak layer.
37
+ Modules can be iteratively orthogonalized in batches, or the refusal direction can be accumulated to save memory.
38
+
39
+ Finally, I used a hybrid evaluation with a dedicated test set to calculate the acceptance rate. This uses both a dictionary approach and [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1).
40
+ The goal is to obtain an acceptance rate >90% and still produce coherent outputs.
qwen3-0.6b-abliterated.q2_k.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:066dcd78cdf0908a1412bd9649e70dfba3f3f87495c7fcd0955bef7b5061ff34
3
+ size 296238208