nferruz
/

ProtGPT2

@@ -52,6 +52,24 @@ python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.tx
 ```
 The HuggingFace script run_clm.py can be found here: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py
 ### **Training specs**

 ```
 The HuggingFace script run_clm.py can be found here: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py
+### **How to select the best sequences**
+We've observed that perplexity values correlate with AlphaFold2's plddt. This plot shows perplexity vs. pldtt values for each of the 10,000 sequences in the ProtGPT2-generated dataset:
+<div align="center">
+<img src="https://huggingface.co/nferruz/ProtGPT2/edit/main/ppl-plddt.png" width="45%" />
+</div>
+We recommend to compute perplexity for each sequence with the HuggingFace evaluate method `perplexity`:
+```
+from evaluate import load
+perplexity = load("perplexity", module_type="metric")
+results = perplexity.compute(predictions=predictions, model_id='nferruz/ProtGPT2')
+```
+Where `predictions` is a list containing the generated sequences.
+As a rule of thumb, sequences with perplexity values below 72 are more likely to have plddt values in line with natural sequences.
 ### **Training specs**