Noelia Ferruz
commited on
Commit
·
4325144
1
Parent(s):
f85dcf3
Update README.md
Browse files
README.md
CHANGED
|
@@ -52,6 +52,24 @@ python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.tx
|
|
| 52 |
```
|
| 53 |
The HuggingFace script run_clm.py can be found here: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py
|
| 54 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
|
| 57 |
### **Training specs**
|
|
|
|
| 52 |
```
|
| 53 |
The HuggingFace script run_clm.py can be found here: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py
|
| 54 |
|
| 55 |
+
### **How to select the best sequences**
|
| 56 |
+
We've observed that perplexity values correlate with AlphaFold2's plddt. This plot shows perplexity vs. pldtt values for each of the 10,000 sequences in the ProtGPT2-generated dataset:
|
| 57 |
+
<div align="center">
|
| 58 |
+
<img src="https://huggingface.co/nferruz/ProtGPT2/edit/main/ppl-plddt.png" width="45%" />
|
| 59 |
+
</div>
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
We recommend to compute perplexity for each sequence with the HuggingFace evaluate method `perplexity`:
|
| 63 |
+
|
| 64 |
+
```
|
| 65 |
+
from evaluate import load
|
| 66 |
+
perplexity = load("perplexity", module_type="metric")
|
| 67 |
+
results = perplexity.compute(predictions=predictions, model_id='nferruz/ProtGPT2')
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
Where `predictions` is a list containing the generated sequences.
|
| 71 |
+
As a rule of thumb, sequences with perplexity values below 72 are more likely to have plddt values in line with natural sequences.
|
| 72 |
+
|
| 73 |
|
| 74 |
|
| 75 |
### **Training specs**
|