A Python project to generate hierarchical trees of all PANGO lineages of SARS-CoV-2 using data from the official PANGO designation repository.
- Fetches up-to-date PANGO lineage data from GitHub
- Uses
pango_aliasorpackage to determine parent-child relationships - Builds hierarchical relationships between all SARS-CoV-2 lineages
- Generates phylogenetic trees in Newick format
- Creates separate trees for recombinant lineages (containing 'X')
- Visualizes trees using Biopython's Phylo package and matplotlib
- Saves all data and results in organized directories
- Python 3.7+
- pango_aliasor
- biopython
- requests
- matplotlib
pip install pango_aliasor biopython requests matplotlibpip install -r requirements.txtcd pango_tree_project
python src/main.py# Show help
python src/main.py --help
# Disable visualization plots (faster execution)
python src/main.py --no-plots
# Custom output directory
python src/main.py --output-dir my_results
# Custom data directory
python src/main.py --data-dir raw_data
# Verbose output
python src/main.py --verbose
# Custom DPI for plots
python src/main.py --dpi 150
# Limit number of recombinant plots
python src/main.py --max-recombinant-plots 3
# Combined example
python src/main.py --output-dir results --no-plots --verbose- With plots: 1-3 minutes (depending on system and number of recombinant plots)
- Without plots (
--no-plots): 30-60 seconds - Runtime depends on internet connection speed and system performance
pango_tree_project/
├── data/
│ ├── lineage_notes.txt # Raw lineage data from GitHub
│ └── alias_key.json # Alias mapping data
├── output/
│ ├── A_group/ # Group A lineages
│ │ ├── A_tree.newick # Phylogenetic tree for group A
│ │ └── A_lineages.txt # List of all A group lineages
│ ├── B_group/ # Group B lineages (includes Omicron)
│ │ ├── B_tree.newick # Phylogenetic tree for group B
│ │ └── B_lineages.txt # List of all B group lineages
│ ├── XBB_group/ # XBB recombinant group
│ │ ├── XBB_tree.newick # Phylogenetic tree for XBB group
│ │ └── XBB_lineages.txt # List of all XBB group lineages
│ ├── ... (176 other recombinant groups)
│ ├── global_tree/ # Integrated global tree
│ │ ├── global_tree.newick # Global tree with A, B, and integrated recombinants
│ │ └── recombinant_attachments.txt # Attachment points for each recombinant group
│ └── visualizations/ # PNG visualizations of trees
│ ├── A_group.png # Visualization of A group tree
│ ├── B_group.png # Visualization of B group tree
│ ├── XBB_group.png # Visualization of XBB group tree
│ └── global_tree.png # Visualization of integrated global tree
└── src/
└── main.py # Main script
The project now organizes lineages into biologically meaningful groups:
A Group (output/A_group/)
- Contains all lineages descending from the original A lineage
- Includes 34 lineages with proper hierarchical relationships
- Represents early SARS-CoV-2 diversity
B Group (output/B_group/)
- Contains all lineages descending from the original B lineage
- Includes 3,736 lineages including all major variants
- Includes Alpha, Beta, Gamma, Delta, and Omicron (BA.1, BA.2, etc.)
- Most comprehensive group with complex hierarchies
Recombinant Groups (output/X*/)
- 179 separate directories for each recombinant lineage group
- Each contains a phylogenetic tree and lineage list
- Examples: XBB (941 lineages), XDV (215 lineages), XFG (289 lineages)
- Proper hierarchical organization within each group
NEW FEATURE: Integrated global phylogenetic tree that combines:
- Main lineages A and B as primary branches
- 159 recombinant groups attached to their biological origins
- Scientifically meaningful organization based on parental relationships
- Root node "SARS-CoV-2" for comprehensive representation
global_tree.newick: Comprehensive tree in Newick format
recombinant_attachments.txt: Documentation of where each recombinant group is attached
A_group.png: Visualization of A group treeB_group.png: Visualization of B group tree (very large!)XBB_group.png: Visualization of XBB recombinant groupglobal_tree.png: Visualization of integrated global tree- High-resolution PNG images (configurable DPI)
- Suitable for publication or presentation
The script fetches data from two primary sources:
-
Lineage Notes: https://raw.githubusercontent.com/cov-lineages/pango-designation/refs/heads/master/lineage_notes.txt
- Contains all officially designated PANGO lineages
- Includes descriptions and metadata
-
- Maps alias names to full lineage names
- Used for proper lineage identification
The script provides several command-line options for customization:
| Option | Description | Default |
|---|---|---|
--output-dir |
Directory for all output files | output |
--data-dir |
Directory for downloaded data files | data |
--no-plots |
Disable visualization generation | False |
--max-recombinant-plots |
Number of recombinant plots to generate | 5 |
--dpi |
DPI for plot images | 300 |
--verbose |
Enable detailed output | False |
--help |
Show help message | - |
For faster execution on systems with limited resources:
python src/main.py --no-plots --max-recombinant-plots 0This skips all visualization steps and runs in ~30 seconds.
For automated pipelines, use custom directories to avoid conflicts:
python src/main.py --output-dir batch_$(date +%Y%m%d) --no-plots- Uses
pango_aliasor.Aliasor().parent()to determine parent-child relationships - Handles both compressed (e.g., "BA.1") and uncompressed (e.g., "B.1.1.529.1") lineage names
- Automatically identifies recombinant lineages by presence of 'X' in the name
- Creates Newick format strings recursively
- Uses Biopython's
Phylo.read()andPhylo.write()for tree I/O - Generates separate trees for recombinant lineages due to their complex origins
- Uses matplotlib for high-quality PNG output
- Main tree visualization uses a 20x10 inch figure for readability
- Recombinant visualizations use 10x5 inch figures
- All visualizations include proper titles and tight layout
- Recombinant lineages are treated as individual trees rather than integrated into the main tree
- Tree branch lengths are not biologically meaningful (all set to 0.0)
- Visualization of the full main tree may be too large for some recombinant-heavy lineages
- Integrate recombinant lineages into main tree with proper parent relationships
- Add biological meaning to branch lengths using mutation data
- Implement interactive visualization using Plotly or similar
- Add command-line arguments for customization
- Implement caching to avoid re-downloading data
- Add progress bars for long-running operations
This project is open source and available under the MIT License.
- COVID-19 Genomics UK (COG-UK) consortium for PANGO lineage system
- Developers of
pango_aliasor, Biopython, and other dependencies - GitHub for hosting the PANGO designation data