Skip to content

neherlab/pango-tree

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

PANGO Lineage Tree Generator

A Python project to generate hierarchical trees of all PANGO lineages of SARS-CoV-2 using data from the official PANGO designation repository.

Features

  • Fetches up-to-date PANGO lineage data from GitHub
  • Uses pango_aliasor package to determine parent-child relationships
  • Builds hierarchical relationships between all SARS-CoV-2 lineages
  • Generates phylogenetic trees in Newick format
  • Creates separate trees for recombinant lineages (containing 'X')
  • Visualizes trees using Biopython's Phylo package and matplotlib
  • Saves all data and results in organized directories

Requirements

  • Python 3.7+
  • pango_aliasor
  • biopython
  • requests
  • matplotlib

Installation

Using pip

pip install pango_aliasor biopython requests matplotlib

Using the provided requirements file

pip install -r requirements.txt

Usage

Basic usage

cd pango_tree_project
python src/main.py

Command-line options

# Show help
python src/main.py --help

# Disable visualization plots (faster execution)
python src/main.py --no-plots

# Custom output directory
python src/main.py --output-dir my_results

# Custom data directory
python src/main.py --data-dir raw_data

# Verbose output
python src/main.py --verbose

# Custom DPI for plots
python src/main.py --dpi 150

# Limit number of recombinant plots
python src/main.py --max-recombinant-plots 3

# Combined example
python src/main.py --output-dir results --no-plots --verbose

Expected runtime

  • With plots: 1-3 minutes (depending on system and number of recombinant plots)
  • Without plots (--no-plots): 30-60 seconds
  • Runtime depends on internet connection speed and system performance

Output Structure

pango_tree_project/
├── data/
│   ├── lineage_notes.txt          # Raw lineage data from GitHub
│   └── alias_key.json             # Alias mapping data
├── output/
│   ├── A_group/                  # Group A lineages
│   │   ├── A_tree.newick          # Phylogenetic tree for group A
│   │   └── A_lineages.txt         # List of all A group lineages
│   ├── B_group/                  # Group B lineages (includes Omicron)
│   │   ├── B_tree.newick          # Phylogenetic tree for group B
│   │   └── B_lineages.txt         # List of all B group lineages
│   ├── XBB_group/                # XBB recombinant group
│   │   ├── XBB_tree.newick        # Phylogenetic tree for XBB group
│   │   └── XBB_lineages.txt       # List of all XBB group lineages
│   ├── ... (176 other recombinant groups)
│   ├── global_tree/              # Integrated global tree
│   │   ├── global_tree.newick     # Global tree with A, B, and integrated recombinants
│   │   └── recombinant_attachments.txt  # Attachment points for each recombinant group
│   └── visualizations/            # PNG visualizations of trees
│       ├── A_group.png           # Visualization of A group tree
│       ├── B_group.png           # Visualization of B group tree
│       ├── XBB_group.png         # Visualization of XBB group tree
│       └── global_tree.png       # Visualization of integrated global tree
└── src/
    └── main.py                    # Main script

Output Details

Group Trees

The project now organizes lineages into biologically meaningful groups:

A Group (output/A_group/)

  • Contains all lineages descending from the original A lineage
  • Includes 34 lineages with proper hierarchical relationships
  • Represents early SARS-CoV-2 diversity

B Group (output/B_group/)

  • Contains all lineages descending from the original B lineage
  • Includes 3,736 lineages including all major variants
  • Includes Alpha, Beta, Gamma, Delta, and Omicron (BA.1, BA.2, etc.)
  • Most comprehensive group with complex hierarchies

Recombinant Groups (output/X*/)

  • 179 separate directories for each recombinant lineage group
  • Each contains a phylogenetic tree and lineage list
  • Examples: XBB (941 lineages), XDV (215 lineages), XFG (289 lineages)
  • Proper hierarchical organization within each group

Global Tree (output/global_tree/)

NEW FEATURE: Integrated global phylogenetic tree that combines:

  • Main lineages A and B as primary branches
  • 159 recombinant groups attached to their biological origins
  • Scientifically meaningful organization based on parental relationships
  • Root node "SARS-CoV-2" for comprehensive representation

global_tree.newick: Comprehensive tree in Newick format recombinant_attachments.txt: Documentation of where each recombinant group is attached

Visualizations (output/visualizations/)

  • A_group.png: Visualization of A group tree
  • B_group.png: Visualization of B group tree (very large!)
  • XBB_group.png: Visualization of XBB recombinant group
  • global_tree.png: Visualization of integrated global tree
  • High-resolution PNG images (configurable DPI)
  • Suitable for publication or presentation

Data Sources

The script fetches data from two primary sources:

  1. Lineage Notes: https://raw.githubusercontent.com/cov-lineages/pango-designation/refs/heads/master/lineage_notes.txt

    • Contains all officially designated PANGO lineages
    • Includes descriptions and metadata
  2. Alias Key: https://raw.githubusercontent.com/cov-lineages/pango-designation/refs/heads/master/pango_designation/alias_key.json

    • Maps alias names to full lineage names
    • Used for proper lineage identification

Command-Line Interface

The script provides several command-line options for customization:

Main Options

Option Description Default
--output-dir Directory for all output files output
--data-dir Directory for downloaded data files data
--no-plots Disable visualization generation False
--max-recombinant-plots Number of recombinant plots to generate 5
--dpi DPI for plot images 300
--verbose Enable detailed output False
--help Show help message -

Performance Optimization

For faster execution on systems with limited resources:

python src/main.py --no-plots --max-recombinant-plots 0

This skips all visualization steps and runs in ~30 seconds.

Batch Processing

For automated pipelines, use custom directories to avoid conflicts:

python src/main.py --output-dir batch_$(date +%Y%m%d) --no-plots

Technical Details

Hierarchy Construction

  • Uses pango_aliasor.Aliasor().parent() to determine parent-child relationships
  • Handles both compressed (e.g., "BA.1") and uncompressed (e.g., "B.1.1.529.1") lineage names
  • Automatically identifies recombinant lineages by presence of 'X' in the name

Tree Generation

  • Creates Newick format strings recursively
  • Uses Biopython's Phylo.read() and Phylo.write() for tree I/O
  • Generates separate trees for recombinant lineages due to their complex origins

Visualization

  • Uses matplotlib for high-quality PNG output
  • Main tree visualization uses a 20x10 inch figure for readability
  • Recombinant visualizations use 10x5 inch figures
  • All visualizations include proper titles and tight layout

Limitations and Future Improvements

Current Limitations

  • Recombinant lineages are treated as individual trees rather than integrated into the main tree
  • Tree branch lengths are not biologically meaningful (all set to 0.0)
  • Visualization of the full main tree may be too large for some recombinant-heavy lineages

Potential Future Enhancements

  • Integrate recombinant lineages into main tree with proper parent relationships
  • Add biological meaning to branch lengths using mutation data
  • Implement interactive visualization using Plotly or similar
  • Add command-line arguments for customization
  • Implement caching to avoid re-downloading data
  • Add progress bars for long-running operations

License

This project is open source and available under the MIT License.

Acknowledgments

  • COVID-19 Genomics UK (COG-UK) consortium for PANGO lineage system
  • Developers of pango_aliasor, Biopython, and other dependencies
  • GitHub for hosting the PANGO designation data

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages