A Python application for automating the transfer of sequencing data.
Dataflow Transfer monitors sequencing run directories and orchestrates the transfer of sequencing data via rsync. It supports multiple sequencer types (Illumina, Oxford Nanopore and Element), tracks transfer progress in a CouchDB-based status database, and handles both continuous and final transfer phases during and after sequencing completion.
- Illumina: NextSeq, MiSeqi100, NovaSeqXPlus (WIP), MiSeq (WIP)
- Oxford Nanopore (ONT): PromethION (WIP), MinION (WIP)
- Element: AVITI (WIP)
- Python 3.11+
- Dependencies listed in requirements.txt:
- PyYAML
- click
- xmltodict
- ibmcloudant
- run-one
- Clone the repository:
git clone <repository-url>
cd dataflow_transfer- Install the package:
pip install -e .Or with development dependencies:
pip install -e ".[dev]"dataflow_transfer [OPTIONS]-c, --config-file PATH: Path to configuration YAML file. Defaults to~/.df_transfer/df_transfer.yaml. Can also be set viaTRANSFER_CONFIGenvironment variable.-r, --run RUN_ID: Transfer a specific run (e.g.,20250528_LH00217_0219_A22TT52LT4). Requires--sequencer.-s, --sequencer TYPE: Sequencer type of the run (e.g.,NovaSeqXPlus,MiSeq,AVITI). Required with--run.--version: Show version and exit.
# Transfer all runs (uses configuration for sequencing directories)
dataflow_transfer
# Transfer a specific run
dataflow_transfer --run 20250528_LH00217_0219_A22TT52LT4 --sequencer NovaSeqXPlus
# Use a custom config file
dataflow_transfer --config-file /path/to/config.yamlCreate a YAML configuration file with the following structure:
log:
file: /path/to/dataflow_transfer.log
run_one_path: /usr/bin/run-one
transfer_details:
user: username
host: remote.host.com
statusdb:
username: couchdb_user
password: couchdb_password
url: couchdb.host.com
database: sequencing_runs
sequencers:
NovaSeqXPlus:
sequencing_path: /sequencing/NovaSeqXPlus
miarka_destination: /Illumina/NovaSeqXPlus
metadata_for_statusdb:
- RunInfo.xml
- RunParameters.xml
ignore_folders:
- nosync
rsync_options:
- --chmod=Dg+s,g+rw
# ... additional sequencer configurations- Discovery: Scans configured sequencing directories for run folders
- Validation: Confirms run ID matches expected format for the sequencer type
- Transfer Phases:
- Sequencing Phase: Starts continuous background rsync transfer while sequencing is ongoing (when the final sequencing file doesn't exist). Uploads status and metadata files (specified for each sequencer type in the config with
metadata_for_statusdb) to database. - Final Transfer: After sequencing completes (final sequencing file appears), initiates final rsync transfer and captures exit code.
- Completion: Updates database when transfer was successful.
- Sequencing Phase: Starts continuous background rsync transfer while sequencing is ongoing (when the final sequencing file doesn't exist). Uploads status and metadata files (specified for each sequencer type in the config with
Run status is tracked in CouchDB with events including:
| Status | Meaning | Occurs when |
|---|---|---|
sequencing_started |
Sequencing is ongoing | A run folder exists but the final sequencing file has not been created yet |
transfer_started |
Intermediate transfer was initiated | Sequencing is ongoing and an rsync has been started |
sequencing_finished |
Sequencing has completed | A run folder exists and the final sequencing file has been created |
final_transfer_started |
Final sync has started | A run folder exists and the final sequencing file has been created, but the final rsync exit code file has not yet been created or contains a non-zero exit code |
transferred_to_hpc |
Transfer completed successfully | A run folder exists, the final sequencing file has been created, and the final rsync exit code file contains a 0 exit code |
dataflow_transfer/run_classes/: Sequencer-specific run classesdataflow_transfer/utils/filesystem.py: File operations and rsync handlingdataflow_transfer/utils/statusdb.py: CouchDB session management with retry logicdataflow_transfer/cli.py: Command-line interface
- Run directories are named according to sequencer-specific ID formats (defined in run classes)
- Final completion is indicated by the presence of a sequencer-specific final file (e.g.,
RTAComplete.txtfor Illumina) - Remote storage is accessible via rsync over SSH
- CouchDB is accessible and the database exists
- Metadata files (e.g., RunInfo.xml) are present in run directories for status database updates
The logic of the script relies on the following status files:
run.final_file- The final file written by each sequencing machine. Used to indicate when the sequencing has completed.final_rsync_exitcode- Used to indicate when the final rsync is done, so that the final rsync can be run in the background. This is especially useful for restarts after long pauses of the cronjob.
pytestWith coverage:
pytest --cov --cov-branchRun linting and formatting checks:
ruff check .
ruff format --check .dataflow_transfer/
├── cli.py # Command-line interface
├── dataflow_transfer.py # Main transfer orchestration
├── log/ # Logging utilities
├── run_classes/ # Sequencer-specific run classes
├── utils/ # Utility modules (filesystem, statusdb)
└── tests/ # Unit tests
To add support for a new sequencer, add the following to dataflow_transfer:
- Add a new class for the sequencer in one of the run classes files below. Make sure it inherits from the manufacturer class (IlluminaRun, ElementRun, ONTRun)
dataflow_transfer/run_classes/illumina_runs.pydataflow_transfer/run_classes/element_runs.pydataflow_transfer/run_classes/ont_runs.py
- Import the new class in
dataflow_transfer/run_classes/__init__.py - Add a test fixture for the new run in
dataflow_transfer/tests/test_run_classes.pyand include it in the relevant tests - Add a section for the sequencer in the config file