ProLigGPT

A generative drug design model based on GPT2.

🚩 Introduction

ProLigGPT presents a ligand design strategy based on the autoregressive model, GPT, focusing on chemical space exploration and the discovery of ligands for specific proteins. Deep learning language models have shown significant potential in various domains including protein design and biomedical text analysis, providing strong support for the proposition of ProLigGPT.

In this study, we employ the ProLigGPT model to learn a substantial amount of protein-ligand binding data, aiming to discover novel molecules that can bind with specific proteins. This strategy not only significantly improves the efficiency of ligand design but also offers a swift and effective avenue for the drug development process, bringing new possibilities to the pharmaceutical domain.

πŸ“₯ Deployment

  1. Clone
    git clone https://github.com/LIYUESEN/ProLigGPT.git
    cd ProLigGPT
    
    Or you can visit our GitHub repo and click Code>Download ZIP to download this repo.
  2. Create virtual environment
    conda create -n proliggpt python=3.8
    conda activate proliggpt
    
  3. Download python dependencies
    pip install torch --index-url https://download.pytorch.org/whl/cu117
    pip install datasets==3.1.0 transformers==4.46.3 scipy==1.10.1 scikit-learn==1.3.2 psutil==7.0.0
    conda install conda-forge/label/cf202003::openbabel
    

πŸ— How to use

Use proliggpt.py

Required parameters:

  • -p | --pro_seq: Input a protein amino acid sequence.

  • -f | --fasta: Input a FASTA file including one protein amino acid sequence.

    Only one of -p and -f should be specified.

  • -l | --ligand_prompt: Input a ligand prompt.

  • -n | --number: The expected number of molecules to be generated.

  • -d | --device: Hardware device to be used. Default is 'cuda'.

  • -o | --output: Output directory for generated molecules. Default is './ligand_output/'.

  • -b | --batch_size: The number of molecules to be generated per batch. Try to reduce this value if you have low RAM. Default is 16.

  • -t | --temperature: Adjusts the randomness of text generation; higher values produce more diverse outputs. Default is 1.0.

  • --top_k: The number of highest probability tokens to be considered for top-k sampling. Default is 9.

  • --top_p: The cumulative probability threshold (0.0 - 1.0) for top-p (nucleus) sampling. It defines the minimum subset of tokens to consider for random sampling. Default is 0.9.

  • --min_atoms: Minimum number of non-H atoms allowed for generation. Default is None.

  • --max_atoms: Maximum number of non-H atoms allowed for generation. Default is 35.

  • --no_limit: Disable the default max atoms limit.

    If the -l | --ligand_prompt option is used, the --max_atoms and --min_atoms parameters will be disregarded.

🌎 Run in Google Colab

Open in Colab

The Colab notebook contains detailed usage instructions.

πŸ”¬ Example usage

  • If you want to input a protein FASTA file

    python proliggpt.py -f PK3CA.fasta -n 50
    
  • If you want to input the amino acid sequence of the protein

    python proliggpt.py -p MPPRPSSGELWGIHLMPPRILVECLLPNGMIVTLECLREATLITIKHELFKEARKYPLHQLLQDESSYIFVSVTQEAEREEFFDETRRLCDLRLFQPFLKVIEPVGNREEKILNREIGFAIGMPVCEFDMVKDPEVQDFRRNILNVCKEAVDLRDLNSPHSRAMYVYPPNVESSPELPKHIYNKLDKGQIIVVIWVIVSPNNDKQKYTLKINHDCVPEQVIAEAIRKKTRSMLLSSEQLKLCVLEYQGKYILKVCGCDEYFLEKYPLSQYKYIRSCIMLGRMPNLMLMAKESLYSQLPMDCFTMPSYSRRISTATPYMNGETSTKSLWVINSALRIKILCATYVNVNIRDIDKIYVRTGIYHGGEPLCDNVNTQRVPCSNPRWNEWLNYDIYIPDLPRAARLCLSICSVKGRKGAKEEHCPLAWGNINLFDYTDTLVSGKMALNLWPVPHGLEDLLNPIGVTGSNPNKETPCLELEFDWFSSVVKFPDMSVIEEHANWSVSREAGFSYSHAGLSNRLARDNELRENDKEQLKAISTRDPLSEITEQEKDFLWSHRHYCVTIPEILPKLLLSVKWNSRDEVAQMYCLVKDWPPIKPEQAMELLDCNYPDPMVRGFAVRCLEKYLTDDKLSQYLIQLVQVLKYEQYLDNLLVRFLLKKALTNQRIGHFFFWHLKSEMHNKTVSQRFGLLLESYCRACGMYLKHLNRQVEAMEKLINLTDILKQEKKDETQKVQMKFLVEQMRRPDFMDALQGFLSPLNPAHQLGNLRLEECRIMSSAKRPLWLNWENPDIMSELLFQNNEIIFKNGDDLRQDMLTLQIIRIMENIWQNQGLDLRMLPYGCLSIGDCVGLIEVVRNSHTIMQIQCKGGLKGALQFNSHTLHQWLKDKNKGEIYDAAIDLFTRSCAGYCVATFILGIGDRHNSNIMVKDDGQLFHIDFGHFLDHKKKKFGYKRERVPFVLTQDFLIVISKGAQECTKTREFERFQEMCYKAYLAIRQHANLFINLFSMMLGSGMPELQSFDDIAYIRKTLALDKTEQEALEYFMKQMNDAHHGGWTTKMDWIFHTIKQHALN -n 50
    
  • If you want to provide a prompt for the ligand

    python proliggpt.py -f PK3CA.fasta -l CC1=C(SC(=N1)NC(=O)N2CCCC2C(=O)N) -n 50
    
  • Note: If you are running in a Linux environment, you need to enclose the ligand's prompt with single quotes ('').

    python proliggpt.py -f PK3CA.fasta -l 'CC1=C(SC(=N1)NC(=O)N2CCCC2C(=O)N)' -n 50
    

πŸ“ How to reference this work

TODO

βš– License

GNU General Public License v3.0

Downloads last month
77
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
BOOL
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support