Healdette: Secure Multi-Ethnic Antibody Sequence Generation Pipeline
https://doi.org/10.5281/zenodo.17213886
A secure and flexible computational pipeline for generating and validating antibody sequences with multi-ethnic support. The pipeline integrates ProtGPT2 for sequence generation with BioPython for structural analysis and includes multi-ethnic HLA frequency data for immunogenicity assessment, with optimizations for various population-specific binding motifs.
Features
Core Functionality
- Antibody sequence generation using ProtGPT2 with template-based constraints
- Multi-ethnic binding motif optimization with population-specific parameters
- Comprehensive validation and analysis pipeline
Multi-Interface Support
- Modern web interface for easy configuration management
- Command-line interface for automation and scripting
- Python API for programmatic access
Security Features
- Comprehensive input validation and sanitization
- CSRF protection and rate limiting
- Secure file operations with integrity checks
- Detailed security and audit logging
- Automated backup system with validation
- Population-specific sequence validation parameters:
Celtic:
- Aromatic content: 15-27%
- Hydrophobic content: 35-45%
- Net charge: +5 to +15 Asian:
- Aromatic content: 12-25%
- Hydrophobic content: 30-40%
- Net charge: +3 to +12 Mediterranean:
- Aromatic content: 18-30%
- Hydrophobic content: 32-42%
- Net charge: +4 to +14
- Population-specific immunogenicity assessment using HLA frequency data
- Biophysical property analysis using BioPython
- Structured output in JSON format with detailed analysis results
Requirements
- Python 3.8 or higher
- CUDA-capable GPU (recommended for ProtGPT2)
- Required Python packages listed in
requirements.txt
Installation
- Clone the repository:
git clone https://github.com/Raiff1982/healdette.git
cd healdette
- Create and activate a virtual environment:
python -m venv .venv
# On Windows:
.venv\Scripts\activate
# On Unix/MacOS:
source .venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
Multi-Ethnic Configuration
Healdette now supports ancestry-weighted validation for multiple ethnic populations. The system uses:
- Population-specific binding motifs and parameters
- Ancestry weights from genetic analysis
- HLA frequency data for immunogenicity assessment
Configuration Structure
Configuration files follow this structure:
{
"global_params": {
"sequence_length": {
"min": 40,
"max": 70
},
"structural_params": {
"helix_propensity": {
"min": 20,
"max": 50
},
"sheet_propensity": {
"min": 10,
"max": 40
}
},
"homopolymer_threshold": 4
},
"populations": {
"french_german": {
"ancestry_weight": 0.298,
"binding_motifs": ["WY", "RF", "KH", "YF"],
"biophysical_params": {
"aromatic_content": {
"min": 16,
"max": 28
},
"hydrophobic_content": {
"min": 33,
"max": 43
},
"net_charge": {
"min": 4,
"max": 13
}
},
"hla_frequencies": {
"hla_a": {},
"hla_b": {},
"hla_c": {}
}
}
}
}
Ancestry-Weighted Validation
The validation system considers:
- Ancestry Weights: Each population's contribution is weighted by ancestry percentage
- Blended Parameters: Biophysical parameters are blended based on ancestry weights
- Multiple Binding Motifs: Scores binding motifs from all relevant populations
- HLA Compatibility: Considers population-specific HLA frequencies
Population-Specific Parameters
Each population can define:
- Binding Motifs: Amino acid pairs crucial for binding
- Biophysical Parameters:
- Aromatic content ranges
- Hydrophobic content ranges
- Net charge requirements
- HLA Frequencies: Population-specific HLA allele distributions
Usage
- Create a configuration file following the schema (see
examples/directory):
{
"global_params": {
"sequence_length": {
"min": 40,
"max": 70
}
},
"populations": {
"french_german": {
"ancestry_weight": 0.298,
"binding_motifs": ["WY", "RF", "KH", "YF"],
"biophysical_params": {
"aromatic_content": {
"min": 16,
"max": 28
}
}
},
"finnish": {
"ancestry_weight": 0.057,
"binding_motifs": ["WH", "RF", "KY", "FF"],
"biophysical_params": {
"aromatic_content": {
"min": 14,
"max": 26
}
}
}
}
}
- Validate sequences using the weighted validator:
from modules.weighted_validator import WeightedSequenceValidator
from modules.config_validator import ConfigValidator
# Load and validate configuration
config_validator = ConfigValidator()
config = "path/to/config.json"
if config_validator.validate_file(config)['valid']:
# Create validator with ancestry-weighted parameters
validator = WeightedSequenceValidator(sequence, config)
# Get detailed validation results
results = validator.validate_sequence()
# Check population-specific scores
pop_scores = results['population_scores']
for pop, score in pop_scores.items():
print(f"{pop}: {score['score']} (weight: {score['weight']})")
- Run the pipeline with multi-ethnic configuration:
python main.py config.json output.json --num-candidates 15
Example Configurations
Complete example configurations are available in the examples/ directory:
european_populations_config.json: Configuration for European population clustersmulti_ethnic_config.json: General multi-ethnic configuration templateceltic_test_input.json: Celtic-specific test configuration
Understanding Validation Results
The weighted validator provides detailed results:
{
"valid": true,
"warnings": [],
"metrics": {
"aromatic_content": 22.5,
"hydrophobic_content": 38.2,
"binding_motifs": {
"scores": {
"french_german": {"score": 0.75, "weighted_score": 0.223},
"finnish": {"score": 0.5, "weighted_score": 0.029}
},
"total_score": 0.252
}
},
"population_scores": {
"french_german": {
"score": 0.8,
"weight": 0.298
},
"finnish": {
"score": 0.6,
"weight": 0.057
}
}
}
],
"num_sequences": 10,
"global_validation_params": {
"min_sequence_length": 40,
"max_sequence_length": 70,
"allow_homopolymers": false,
"structure_requirements": {
"helix_propensity": {
"min": 0.2,
"max": 0.5
},
"sheet_propensity": {
"min": 0.1,
"max": 0.4
}
}
}
}
- Run the pipeline:
python main.py --config input_config.json
Output Files
The pipeline generates two types of output files in the output directory:
Detailed JSON output (
antibody_designs_{timestamp}.json):- Generated antibody sequences with framework and CDR regions
- Celtic binding motif analysis
- Biophysical properties (hydrophobicity, charge, stability)
- Aromatic content and distribution
- Population-specific immunogenicity scores
- Validation results against therapeutic antibodies
Summary report (
antibody_summary_{timestamp}.txt):- Key metrics for each generated sequence
- Celtic motif occurrence statistics
- Population coverage statistics
- Validation summary
Reproducibility
To reproduce the results:
- Use the same random seed for ProtGPT2:
import torch
torch.manual_seed(42)
Ensure consistent data sources:
- HLA frequency data: NetMHCpan 4.1 database
- Therapeutic antibody dataset: THERAb database v2.0
- Framework templates: IMGT database
- Celtic binding motif templates: Custom database
Run validation tests:
python -m unittest discover tests
License
MIT License. See LICENSE file for details.
Citation
If you use this software in your research, please cite:
@software{healdette2025,
title = {Healdette: Celtic-Optimized Antibody Generation Pipeline},
author = {Raiff, et al.},
year = {2025},
version = {1.0.0},
url = {https://github.com/Raiff1982/healdette}
}
Harrison, J. (2025). Healdette: A Population-Aware Antibody Design Pipeline. GitHub repository: https://github.com/Raiff1982/healdette
## Author
Jonathan Harrison (Raiff1982)
- Downloads last month
- -
Model tree for Raiff1982/healdette
Base model
nferruz/ProtGPT2