Healdette: Secure Multi-Ethnic Antibody Sequence Generation Pipeline

https://doi.org/10.5281/zenodo.17213886

A secure and flexible computational pipeline for generating and validating antibody sequences with multi-ethnic support. The pipeline integrates ProtGPT2 for sequence generation with BioPython for structural analysis and includes multi-ethnic HLA frequency data for immunogenicity assessment, with optimizations for various population-specific binding motifs.

Features

Core Functionality

Antibody sequence generation using ProtGPT2 with template-based constraints
Multi-ethnic binding motif optimization with population-specific parameters
Comprehensive validation and analysis pipeline

Multi-Interface Support

Modern web interface for easy configuration management
Command-line interface for automation and scripting
Python API for programmatic access

Security Features

Comprehensive input validation and sanitization
CSRF protection and rate limiting
Secure file operations with integrity checks
Detailed security and audit logging
Automated backup system with validation
Population-specific sequence validation parameters: Celtic:
- Aromatic content: 15-27%
- Hydrophobic content: 35-45%
- Net charge: +5 to +15 Asian:
- Aromatic content: 12-25%
- Hydrophobic content: 30-40%
- Net charge: +3 to +12 Mediterranean:
- Aromatic content: 18-30%
- Hydrophobic content: 32-42%
- Net charge: +4 to +14
Population-specific immunogenicity assessment using HLA frequency data
Biophysical property analysis using BioPython
Structured output in JSON format with detailed analysis results

Requirements

Python 3.8 or higher
CUDA-capable GPU (recommended for ProtGPT2)
Required Python packages listed in requirements.txt

Installation

Clone the repository:

git clone https://github.com/Raiff1982/healdette.git
cd healdette

Create and activate a virtual environment:

python -m venv .venv
# On Windows:
.venv\Scripts\activate
# On Unix/MacOS:
source .venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Multi-Ethnic Configuration

Healdette now supports ancestry-weighted validation for multiple ethnic populations. The system uses:

Population-specific binding motifs and parameters
Ancestry weights from genetic analysis
HLA frequency data for immunogenicity assessment

Configuration Structure

Configuration files follow this structure:

{
    "global_params": {
        "sequence_length": {
            "min": 40,
            "max": 70
        },
        "structural_params": {
            "helix_propensity": {
                "min": 20,
                "max": 50
            },
            "sheet_propensity": {
                "min": 10,
                "max": 40
            }
        },
        "homopolymer_threshold": 4
    },
    "populations": {
        "french_german": {
            "ancestry_weight": 0.298,
            "binding_motifs": ["WY", "RF", "KH", "YF"],
            "biophysical_params": {
                "aromatic_content": {
                    "min": 16,
                    "max": 28
                },
                "hydrophobic_content": {
                    "min": 33,
                    "max": 43
                },
                "net_charge": {
                    "min": 4,
                    "max": 13
                }
            },
            "hla_frequencies": {
                "hla_a": {},
                "hla_b": {},
                "hla_c": {}
            }
        }
    }
}

Ancestry-Weighted Validation

The validation system considers:

Ancestry Weights: Each population's contribution is weighted by ancestry percentage
Blended Parameters: Biophysical parameters are blended based on ancestry weights
Multiple Binding Motifs: Scores binding motifs from all relevant populations
HLA Compatibility: Considers population-specific HLA frequencies

Population-Specific Parameters

Each population can define:

Binding Motifs: Amino acid pairs crucial for binding
Biophysical Parameters:
- Aromatic content ranges
- Hydrophobic content ranges
- Net charge requirements
HLA Frequencies: Population-specific HLA allele distributions

Usage

Create a configuration file following the schema (see examples/ directory):

{
    "global_params": {
        "sequence_length": {
            "min": 40,
            "max": 70
        }
    },
    "populations": {
        "french_german": {
            "ancestry_weight": 0.298,
            "binding_motifs": ["WY", "RF", "KH", "YF"],
            "biophysical_params": {
                "aromatic_content": {
                    "min": 16,
                    "max": 28
                }
            }
        },
        "finnish": {
            "ancestry_weight": 0.057,
            "binding_motifs": ["WH", "RF", "KY", "FF"],
            "biophysical_params": {
                "aromatic_content": {
                    "min": 14,
                    "max": 26
                }
            }
        }
    }
}

Validate sequences using the weighted validator:

from modules.weighted_validator import WeightedSequenceValidator
from modules.config_validator import ConfigValidator

# Load and validate configuration
config_validator = ConfigValidator()
config = "path/to/config.json"
if config_validator.validate_file(config)['valid']:
    # Create validator with ancestry-weighted parameters
    validator = WeightedSequenceValidator(sequence, config)
    
    # Get detailed validation results
    results = validator.validate_sequence()
    
    # Check population-specific scores
    pop_scores = results['population_scores']
    for pop, score in pop_scores.items():
        print(f"{pop}: {score['score']} (weight: {score['weight']})")

Run the pipeline with multi-ethnic configuration:

python main.py config.json output.json --num-candidates 15

Example Configurations

Complete example configurations are available in the examples/ directory:

european_populations_config.json: Configuration for European population clusters
multi_ethnic_config.json: General multi-ethnic configuration template
celtic_test_input.json: Celtic-specific test configuration

Understanding Validation Results

The weighted validator provides detailed results:

{
    "valid": true,
    "warnings": [],
    "metrics": {
        "aromatic_content": 22.5,
        "hydrophobic_content": 38.2,
        "binding_motifs": {
            "scores": {
                "french_german": {"score": 0.75, "weighted_score": 0.223},
                "finnish": {"score": 0.5, "weighted_score": 0.029}
            },
            "total_score": 0.252
        }
    },
    "population_scores": {
        "french_german": {
            "score": 0.8,
            "weight": 0.298
        },
        "finnish": {
            "score": 0.6,
            "weight": 0.057
        }
    }
}
    ],
    "num_sequences": 10,
    "global_validation_params": {
        "min_sequence_length": 40,
        "max_sequence_length": 70,
        "allow_homopolymers": false,
        "structure_requirements": {
            "helix_propensity": {
                "min": 0.2,
                "max": 0.5
            },
            "sheet_propensity": {
                "min": 0.1,
                "max": 0.4
            }
        }
    }
}

Run the pipeline:

python main.py --config input_config.json

Output Files

The pipeline generates two types of output files in the output directory:

Detailed JSON output (antibody_designs_{timestamp}.json):
- Generated antibody sequences with framework and CDR regions
- Celtic binding motif analysis
- Biophysical properties (hydrophobicity, charge, stability)
- Aromatic content and distribution
- Population-specific immunogenicity scores
- Validation results against therapeutic antibodies
Summary report (antibody_summary_{timestamp}.txt):
- Key metrics for each generated sequence
- Celtic motif occurrence statistics
- Population coverage statistics
- Validation summary

Reproducibility

To reproduce the results:

Use the same random seed for ProtGPT2:

import torch
torch.manual_seed(42)

Ensure consistent data sources:
- HLA frequency data: NetMHCpan 4.1 database
- Therapeutic antibody dataset: THERAb database v2.0
- Framework templates: IMGT database
- Celtic binding motif templates: Custom database
Run validation tests:

python -m unittest discover tests

License

MIT License. See LICENSE file for details.

Citation

If you use this software in your research, please cite:

@software{healdette2025,
  title = {Healdette: Celtic-Optimized Antibody Generation Pipeline},
  author = {Raiff, et al.},
  year = {2025},
  version = {1.0.0},
  url = {https://github.com/Raiff1982/healdette}
}

Harrison, J. (2025). Healdette: A Population-Aware Antibody Design Pipeline. GitHub repository: https://github.com/Raiff1982/healdette


## Author

Jonathan Harrison (Raiff1982)

Downloads last month: -

Model tree for Raiff1982/healdette

Base model

nferruz/ProtGPT2

Adapter

(7)

this model