diff --git a/.github/workflows/nf-test.yml b/.github/workflows/nf-test.yml index 3a48a15..6e96bc0 100644 --- a/.github/workflows/nf-test.yml +++ b/.github/workflows/nf-test.yml @@ -76,7 +76,7 @@ jobs: - isMain: false profile: "singularity" NXF_VER: - - "25.10.4" + - "25.10.2" - "latest-everything" env: NXF_ANSI_LOG: false diff --git a/CLAUDE.md b/CLAUDE.md index 3538185..832e8f4 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,61 +4,70 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co ## Project overview -DomainBenchmark is a Nextflow pipeline for benchmarking domain-domain interaction (DDIs) methods with protein data. It runs ML classifiers and graph-based models across multiple database splits, then produces a combined MultiQC evaluation report. - -- Root (`main.nf` / `wrapper.nf`) — training, graph models, evaluation. +DomainBenchmark is a Nextflow DSL2 pipeline for benchmarking domain-domain interaction (DDI) prediction methods. Built from the `nf-core/tools 4.0.2` template. For each database split it runs feature extraction → ML classifiers (RF, NN) → graph-based models (KGIDDI, DDIParsimony) → MultiQC evaluation, then aggregates across splits. ## Common commands ```bash -# full run across all db splits in nextflow.config, then combined eval -bash wrapper.sh +# full run across all database splits in the samplesheet +nextflow run . --input assets/samplesheet.csv -profile slurm,singularity -resume -# single-database run (results in results/) -nextflow run main.nf +# stub run (smoke test) +nextflow run . -profile test,singularity -stub-run -# combined evaluation only (after multiple main.nf runs) -nextflow run wrapper.nf --report_list --out_dir results +# single-database run via direct param +nextflow run . --input assets/samplesheet.csv -profile slurm,singularity --skip kgiddi,ddiparsimony -# profiles: standard (local, default), slurm -nextflow run main.nf -profile slurm -resume +# lint +nf-core pipelines lint --dir . +# nf-test +nf-test test tests/default.nf.test ``` -`wrapper.sh` reads `params.db_list` and `params.out_dir` from `nextflow.config`, calls `main.nf` for each database, then runs `wrapper.nf` to combine reports. Supported CLI overrides: `-profile`, `-c`, `-resume`, `--skip`, `--out_dir`. +Samplesheet schema (`assets/schema_input.json`): array of `{id, db_path}` rows. `db_path` must be a directory containing `train.sqlite3`, `test.sqlite3`, `optimization.sqlite3`. Skip stages via `--skip aacomp,kgiddi` (comma-separated, matches feature or graph model names). -No test suite, no linter config, no `pyproject.toml` / `requirements.txt`. Python deps managed via conda (`fopra.yml` top-level, per-module `environment.yml` files). +Python deps managed via conda — `environments/general.yml` (extraction/RF/graph/eval) and `environments/ml.yml` (PyTorch CU128 + cuML for NN training). No `pyproject.toml` / `requirements.txt`. ## Architecture ### Top-level layout -- `main.nf` — orchestrates per-database workflow. Includes modules for feature extraction, ML training, random forest, graph models (KGIDDI, DDI parsimony), DDI extraction, data loading, evaluation. Parses model JSON configs from `assets/` at runtime. -- `wrapper.nf` / `wrapper.sh` — iterate database splits, then aggregate evaluation across them. -- `nextflow.config` — single source of truth for `db_list`, `graph_models`, `machine_learning_features`, `skip`, `out_dir`, and executor profiles. -- `assets/.json` — per-model hyperparameter grid and search config. Filename **must** match `model_name` field and the Python script in `bin/`. -- `modules/local//main.nf` — Nextflow process definitions. Each stage may ship its own `environment.yml`. -- `bin/` — Python scripts invoked by modules (`run_models.py`, `random_forest.py`, `run_graph_models.py`, `kgiddi.py`, `ddiparsimony.py`, `extract_features.py`, `eval_multiqc.py`, `combine_eval.py`, `load_data_gm.py`, etc.). Must be executable and on `PATH` (Nextflow handles this from `bin/`). -- `bin/features/` — feature encoding implementations (`aacomp`, `aaencode`, `protdcal`, `embeddings`, `esm3_*`, `esmc_*`, `prott5_*`). New feature = new file here + entry in `params.machine_learning_features`. -- `environments/general.yml`, `fopra.yml`, `tower.yml` — conda / Tower configs. -- `docker/` — container definitions. +- `main.nf` — entry. Defines `DOMAINBENCHMARK` workflow (MultiQC + versions/methods boilerplate) and `DAISYBIO_DOMAINBENCHMARK` (the science workflow). +- `workflows/domainbenchmark.nf` — wires sample channel → `PER_DB_BENCHMARK` (scattered per DB) → `AGGREGATE_EVAL`. +- `subworkflows/local/per_db_benchmark/main.nf` — scatter: `DDI_EXTRACTION` → `FEATURE_EXTRACTION` (fan-out feature × split) → `NEURAL_NETWORK` + `RANDOM_FOREST` (per-feature singletons + one all-feature concat run, gated by `params.machine_learning_models`) + `GRAPH_MODEL` → `EVAL_ONE` (per-prediction) → `EVALUATION` (per-DB MultiQC reduce). +- `subworkflows/local/aggregate_eval/main.nf` — runs `COMBINE_EVAL` across per-DB reports to produce `results/evaluation/ddi_report.html`. +- `subworkflows/local/utils_nfcore_domainbenchmark_pipeline/main.nf` — nf-core boilerplate (initialise, completion, citations). +- `nextflow.config` — single source of truth for `db_list` (legacy), `graph_models`, `machine_learning_models`, `machine_learning_features`, `large_features`, `max_protein_combinations_per_ddi`, `skip`, `out_dir`, profiles. +- `conf/{base,slurm,test,test_full,modules}.config` — layered config. `conf/base.config` carries retry strategy and per-label resources. +- `assets/.json` — per-model hyperparameter grid + search config. Filename must match `model_name` and the Python script in `bin/`. +- `modules/local//main.nf` — Nextflow process defs (`ddi_extraction`, `feature_extraction`, `neural_network`, `random_forest`, `graph_model`, `evaluation`). +- `bin/` — Python entrypoints invoked by modules (`run_models.py`, `random_forest.py`, `run_graph_models.py`, `kgiddi.py`, `ddiparsimony.py`, `extract_features.py`, `eval_one.py`, `eval_multiqc.py`, `combine_eval.py`, `load_data_gm.py`). Auto on `PATH` from Nextflow. +- `bin/features/` — feature encoders (`aacomp`, `aaencode`, `dummy`, `embeddings`, `protdcal`, `esm3_*`, `esmc_*`, `prott5_*`). New feature = new file here + entry in `params.machine_learning_features`. Heavy ones go in `params.large_features` → routed to `process_gpu_large`. +- `docker/`, `containers_{docker,singularity,conda_lock}_{amd64,arm64}.config` — container/lock matrices. ### Data flow -1. Input: database split directory with `train.sqlite3`, `test.sqlite3`, `optimization.sqlite3` (tables: DDI, DGO, PD, DomSeq, PPI, PGO, Embeddings). -2. `feature_extraction` → writes per-feature `train/test/optimization.h5` under `results//data//`. -3. `machine_learning` / `random_forest` consume `.h5` features, grid-search via the model JSON, emit predictions to `results//ml_output/`. -4. `graph_model` stages (KGIDDI, DDI parsimony) run independently against the sqlite splits, output under `results//graph_models//`. -5. `evaluation` (MultiQC) combines everything into `results//evaluation/evaluation.html`; `wrapper.nf` merges across DBs into `results/evaluation/ddi_report.html`. +1. Input: samplesheet of `{id, db_path}`. Each `db_path` contains `train/test/optimization.sqlite3` (tables: DDI, DGO, PD, DomSeq, PPI, PGO, Embeddings). +2. `DDI_EXTRACTION` → SQL → CSV per split. +3. `FEATURE_EXTRACTION` (fan-out per feature × split) → per-feature `train/test/optimization.h5` under `results//data//`. +4. `NEURAL_NETWORK` / `RANDOM_FOREST` consume `.h5`, grid-search via model JSON, predictions to `results//nn_output/` and `results//rf_output/`. +5. `GRAPH_MODEL` (KGIDDI, DDIParsimony, KGIDDI_RANDOM) runs independently against sqlite splits → `results//graph_models//`. +6. `EVAL_ONE` per-prediction → `EVALUATION` per-DB MultiQC → `results//evaluation/evaluation.html`. +7. `AGGREGATE_EVAL` / `COMBINE_EVAL` → `results/evaluation/ddi_report.html`. + +The scatter design (`EVAL_ONE` → `EVALUATION` reduce) replaced a monolithic evaluation that hit 300 GB OOM. See comment in `modules/local/evaluation/main.nf`. ### Adding things -- **New ML model:** add `assets/.json` (must include `model_name`, `data`, `search_parameters`, `model_parameters`) and matching logic in the ML module. Name is auto-picked up by `main.nf`. -- **New feature encoding:** add `bin/features/.py` and append `` to `params.machine_learning_features` in `nextflow.config`. -- **Skip stages:** set `--skip aacomp,kgiddi` (comma-sep, matches feature or graph model names). +- **New ML model:** add `assets/.json` (must include `model_name`, `data`, `search_parameters`, `model_parameters`) + matching Python file in `bin/`. Picked up automatically. +- **New feature encoding:** add `bin/features/.py` and append `` to `params.machine_learning_features` in `nextflow.config`. Append to `params.large_features` if it needs GPU/big memory. +- **Skip stages:** `--skip aacomp,kgiddi` (comma-separated; matches feature or graph model names). ### Profiles - `standard`: local executor, conda enabled. -- `slurm`: slurm executor, 8 cpus / 200 GB / 48h per process, singularity cache at `/nfs/scratch/singularity_cache`. +- `slurm`: slurm executor, per-label resources via `conf/slurm.config`, singularity cache at `/nfs/scratch/singularity_cache`. +- `test` / `test_full`: minimal SQLite triplet under `tests/data/`, single feature. +- `daisybio`: site-specific defaults. -Default paths in `nextflow.config` point at `/nfs/data/CoBiNet_Masterpraktikum/databases/...` — override with `--db` / `--db_list` for local runs. +Default DB paths in `nextflow.config` point at `/nfs/data/CoBiNet_Masterpraktikum/databases/...` — override via samplesheet for local runs. ## MCP Tools: code-review-graph diff --git a/README.md b/README.md index 1b3e782..8fcba7f 100644 --- a/README.md +++ b/README.md @@ -2,10 +2,13 @@ [![Open in GitHub Codespaces](https://img.shields.io/badge/Open_In_GitHub_Codespaces-black?labelColor=grey&logo=github)](https://github.com/codespaces/new/daisybio/domainbenchmark) [![GitHub Actions CI Status](https://github.com/daisybio/domainbenchmark/actions/workflows/nf-test.yml/badge.svg)](https://github.com/daisybio/domainbenchmark/actions/workflows/nf-test.yml) -[![GitHub Actions Linting Status](https://github.com/daisybio/domainbenchmark/actions/workflows/linting.yml/badge.svg)](https://github.com/daisybio/domainbenchmark/actions/workflows/linting.yml)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX) +[![GitHub Actions Linting Status](https://github.com/daisybio/domainbenchmark/actions/workflows/linting.yml/badge.svg)](https://github.com/daisybio/domainbenchmark/actions/workflows/linting.yml) [![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com) -[![Nextflow](https://img.shields.io/badge/version-%E2%89%A525.10.4-green?style=flat&logo=nextflow&logoColor=white&color=%230DC09D&link=https%3A%2F%2Fnextflow.io)](https://www.nextflow.io/) + + + +[![Nextflow](https://img.shields.io/badge/version-%E2%89%A525.10.2-green?style=flat&logo=nextflow&logoColor=white&color=%230DC09D&link=https%3A%2F%2Fnextflow.io)](https://www.nextflow.io/) [![nf-core template version](https://img.shields.io/badge/nf--core_template-4.0.2-green?style=flat&logo=nfcore&logoColor=white&color=%2324B064&link=https%3A%2F%2Fnf-co.re)](https://github.com/nf-core/tools/releases/tag/4.0.2) [![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/) [![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/) @@ -27,9 +30,9 @@ The pipeline runs the following stages: 2. **Feature extraction** for every requested encoding (`aacomp`, `aaencode`, ProtT5 / ESM-3 / ESM-C protein and domain embeddings) — parallelized per `(feature × split)`. -3. **ML classifiers** trained on every feature combination up to - `--max_machine_learning_features`: - - `MACHINE_LEARNING` — neural network (PyTorch + skorch). +3. **ML classifiers** trained on each feature individually plus one + all-feature concatenation run (gated by `--machine_learning_models`): + - `NEURAL_NETWORK` — neural network (PyTorch + skorch). - `RANDOM_FOREST` — RAPIDS cuML random forest on GPU. 4. **Graph models** (`GRAPH_MODEL`): KGIDDI, DDI parsimony, KGIDDI-random. 5. **Per-prediction evaluation** (`EVAL_ONE`) → tiny per-model JSONs. @@ -41,61 +44,55 @@ flowchart LR subgraph "per database" ddi[DDI_EXTRACTION] feat[FEATURE_EXTRACTION] - ml[MACHINE_LEARNING] + nn[NEURAL_NETWORK] rf[RANDOM_FOREST] gm[GRAPH_MODEL] eo[EVAL_ONE] ev[EVALUATION] end db[(db_list)] --> ddi - db --> feat --> ml & rf + db --> feat --> nn & rf db --> gm - ddi --> ml & rf - ml & rf & gm --> eo --> ev + ddi --> nn & rf + nn & rf & gm --> eo --> ev ev --> agg[COMBINE_EVAL] --> report[ddi_report.html] ``` - ## Usage > [!NOTE] > If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/get_started/environment_setup/overview) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/get_started/run-your-first-pipeline) with `-profile test` before running the workflow on actual data. -### Single database +### Samplesheet (recommended entry) ```bash -<<<<<<< HEAD -nextflow run main.nf \ - -profile \ - --db /path/to/database_split \ +nextflow run . \ + -profile ,slurm \ + --input assets/samplesheet.csv \ --outdir results -======= -nextflow run daisybio/domainbenchmark \ - -profile \ - --input samplesheet.csv \ - --outdir ->>>>>>> TEMPLATE ``` -### Multiple databases (scatter + combined evaluation) +The samplesheet is a CSV with one row per database split: -```bash -nextflow run main.nf \ - -profile \ - --db_list "/path/to/db1,/path/to/db2,/path/to/db3" \ - --outdir results +```csv +id,db_path +random_denoise,/path/to/random_denoise +random_ddi,/path/to/random_ddi ``` +Schema in `assets/schema_input.json`. Each `db_path` must contain +`train.sqlite3`, `test.sqlite3`, `optimization.sqlite3`. + ### Cluster (Slurm + Singularity) ```bash -nextflow run main.nf \ +nextflow run . \ -profile slurm,singularity \ - --db_list "/nfs/data/CoBiNet_Masterpraktikum/databases/random_denoise,..." \ - --outdir /nfs/scratch/cobinet/results + --input assets/samplesheet.csv \ + --outdir /nfs/scratch/cobinet/results \ + -resume ``` ### Skipping stages @@ -105,18 +102,19 @@ names. For example, to skip the heavy embedding-based features and the two parsimony graph models: ```bash -nextflow run main.nf \ +nextflow run . \ -profile slurm,singularity \ + --input assets/samplesheet.csv \ --skip "kgiddi,ddiparsimony,prott5_protein_domain_embeddings,esm3_protein_domain_embeddings" ``` ### Test profile (in-repo fixture) ```bash -nextflow run main.nf -profile test,singularity --outdir results-test +nextflow run . -profile test,singularity --outdir results-test ``` -The `test` profile points `--db` at a tiny in-repo SQLite triple under +The `test` profile points at a tiny in-repo SQLite triple under `tests/data/` and disables every heavy feature except `aacomp`. Used by CI and by `nf-test`. @@ -124,20 +122,20 @@ CI and by `nf-test`. | Parameter | Default | Description | |---|---|---| -| `--input` | `null` | Optional samplesheet CSV (one row per database split). | -| `--db` | `null` | Path to a single database split directory. | -| `--db_list` | `null` | Comma-separated list of database splits. | +| `--input` | `null` | **Required.** Samplesheet CSV (one row per database split). | | `--outdir` | `./results` | Output directory. | | `--modeljson` | `${projectDir}/assets` | Directory of model hyperparameter JSONs. | | `--skip` | `''` | Comma-separated feature/model names to skip. | -| `--graph_models` | `kgiddi, ddiparsimony, kgiddi_random` | Graph models to run. | -| `--machine_learning_features` | `aacomp, aaencode, prott5_*, esm3_*, esmc_*` | Feature encodings to compute. | -| `--max_machine_learning_features` | `2` | Max features combined per ML run. | +| `--graph_models` | `kgiddi,ddiparsimony,kgiddi_random` | Graph models to run. | +| `--machine_learning_features` | `aacomp,aaencode,prott5_*,esm3_*,esmc_*` | Feature encodings to compute. | +| `--large_features` | `prott5_*,esm3_*,esmc_*` | Features routed to `process_gpu_large`. | +| `--machine_learning_models` | `neural_network,random_forest` | ML models to run. | +| `--max_protein_combinations_per_ddi` | `null` | Cap on protein-pair instantiations per DDI pair (sampled without replacement). Null = use all. | | `--seed` | `42` | Global RNG seed. | | `--publish_dir_mode` | `'copy'` | Nextflow `publishDir` mode. | Full schema with defaults, types, and descriptions: `nextflow_schema.json`. -Run `nextflow run main.nf --help` for a CLI summary. +Run `nextflow run . --help` for a CLI summary. ## Pipeline output @@ -150,8 +148,8 @@ For each database split processed, a subdirectory under `--outdir//`: │ ├── train.h5 │ ├── test.h5 │ └── optimization.h5 -├── ml_output/ -│ └── / +├── nn_output/ +│ └── neural_network_/ │ ├── predictions.parquet │ └── model/ ├── rf_output/ @@ -166,8 +164,8 @@ For each database split processed, a subdirectory under `--outdir//`: └── multiqc_report.html ``` -When `--db_list` is set, a top-level cross-database report is also written: -`/evaluation/ddi_report.html`. +When the samplesheet contains more than one database, a top-level +cross-database report is also written: `/evaluation/ddi_report.html`. A full description of each output is in [`docs/output.md`](docs/output.md). @@ -185,11 +183,27 @@ A full description of each output is in [`docs/output.md`](docs/output.md). ## Adding new components -- **New ML model:** add `assets/.json` (with `model_name`, `data`, - `search_parameters`, `model_parameters`) and the matching script in - `bin/`. The pipeline picks up the JSON automatically. -- **New feature encoding:** add `bin/features/.py` and append - `` to `params.machine_learning_features` in `nextflow.config`. +### New feature encoding + +Copy the template and implement your feature computation: + +1. Copy `bin/features/new_feature.py` to `bin/features/.py` +2. Implement `extract_features(conn, out_file)` — read from SQLite, write feature vectors to HDF5 +3. Append `` to `params.machine_learning_features` in `nextflow.config` +4. If it needs GPU or large memory, also add it to `params.large_features` + +See `bin/features/new_feature.py` for the full contract and examples. + +### New ML model + +Copy the template and implement your training/prediction logic: + +1. Copy `bin/new_model.py` to `bin/.py` +2. Subclass `DDIModelTrainer` and implement the required methods +3. Create `assets/.json` with hyperparameter grid (must include `model_name`, `data`, `search_parameters`, `model_parameters`) +4. Add a Nextflow process in `modules/local//main.nf` and wire it into `subworkflows/local/per_db_benchmark/main.nf` + +See `bin/new_model.py` for the full API and optional overrides. ## Credits @@ -205,10 +219,10 @@ If you would like to contribute to this pipeline, please see the [contributing g ## Citations - - - -An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file. +A Zenodo DOI for daisybio/domainbenchmark will be issued at the v1.0.0 +release; this section will be updated then. An extensive list of +references for the tools used by the pipeline can be found in the +[`CITATIONS.md`](CITATIONS.md) file. This pipeline uses code and infrastructure developed and maintained by the [nf-core](https://nf-co.re) community, reused here under the [MIT license](https://github.com/nf-core/tools/blob/main/LICENSE). diff --git a/assets/NeuralNetwork.json b/assets/NeuralNetwork.json index 76628d3..e955071 100644 --- a/assets/NeuralNetwork.json +++ b/assets/NeuralNetwork.json @@ -9,12 +9,10 @@ "models_to_evaluate": 100, "grid_search_epochs": 20, "retrain_epochs": 200, - "protein_sample_per_ddi_opt_set": 1, "balance_positive_and_negative_interactions_opt_set": true }, "model_parameters": { - "balance_positive_and_negative_interactions_train_set": [true], - "protein_sample_per_ddi_train_set": [1,2,5,10], + "balance_method": ["downsample"], "module__hidden_layer_sizes": [ [32, 32, 32], [128, 64, 32], diff --git a/assets/RandomForest.json b/assets/RandomForest.json index 0b813a4..9685d4d 100644 --- a/assets/RandomForest.json +++ b/assets/RandomForest.json @@ -4,13 +4,11 @@ "data": [], "jobs": 5, "search_parameters": { - "models_to_evaluate": 100, - "protein_sample_per_ddi_opt_set": 1, + "models_to_evaluate": 200, "balance_positive_and_negative_interactions_opt_set": true }, "model_parameters": { - "balance_positive_and_negative_interactions_train_set": [true, false], - "protein_sample_per_ddi_train_set": [1,2,5,10], + "balance_method": ["none", "downsample", "oversample"], "n_estimators": [100, 200, 500], "max_depth": [10, 20, 50], "min_samples_split": [2, 5, 10], diff --git a/assets/kgiddi.json b/assets/kgiddi.json index c17007d..e3173c7 100644 --- a/assets/kgiddi.json +++ b/assets/kgiddi.json @@ -5,7 +5,7 @@ "chi_square_cutoff": [0.1, 0.3, 0.5, 0.7, 0.9, 1.0], "bicluster_cutoff": [0.5, 0.6, 0.7, 0.8], "threshold": 3, - "go_graph": "/nfs/home/students/c.thomas/CoBiNet/go-basic.obo" + "go_graph": "/nfs/data/CoBiNet_Masterpraktikum/go-basic.obo" }, "permutation": false, "optimized": { diff --git a/assets/kgiddi_random.json b/assets/kgiddi_random.json index db305f0..5970d53 100644 --- a/assets/kgiddi_random.json +++ b/assets/kgiddi_random.json @@ -5,7 +5,7 @@ "chi_square_cutoff": [0.1, 0.3, 0.5, 0.7, 0.9, 1.0], "bicluster_cutoff": [0.5, 0.6, 0.7, 0.8], "threshold": 3, - "go_graph": "/nfs/home/students/c.thomas/CoBiNet/go-basic.obo" + "go_graph": "/nfs/data/CoBiNet_Masterpraktikum/go-basic.obo" }, "permutation": true, "optimized": { diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv index 5f653ab..9dbed6a 100644 --- a/assets/samplesheet.csv +++ b/assets/samplesheet.csv @@ -1,3 +1,2 @@ -sample,fastq_1,fastq_2 -SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz -SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz, +id,db_path +minimal_leakage,/nfs/data/CoBiNet_Masterpraktikum/databases/minimal_leakage_domain diff --git a/assets/samplesheet_full.csv b/assets/samplesheet_full.csv new file mode 100644 index 0000000..bd5f139 --- /dev/null +++ b/assets/samplesheet_full.csv @@ -0,0 +1,5 @@ +id,db_path +minimal_leakage_domain,/nfs/data/CoBiNet_Masterpraktikum/databases/minimal_leakage_domain +random_ddi,/nfs/data/CoBiNet_Masterpraktikum/databases/random_ddi +random_denoise,/nfs/data/CoBiNet_Masterpraktikum/databases/random_denoise +random_discovery,/nfs/data/CoBiNet_Masterpraktikum/databases/random_discovery diff --git a/assets/samplesheet_test.csv b/assets/samplesheet_test.csv new file mode 100644 index 0000000..4a11561 --- /dev/null +++ b/assets/samplesheet_test.csv @@ -0,0 +1,2 @@ +id,db_path +db_minimal,tests/data/db_minimal diff --git a/bin/combine_eval.py b/bin/combine_eval.py index 461f778..a3e93aa 100755 --- a/bin/combine_eval.py +++ b/bin/combine_eval.py @@ -56,12 +56,6 @@ def parse_arguments(): return p.parse_args() -def create_section_header(section_id, section_name, outdir): - block = {"id": section_id, "section_name": section_name} - with open(os.path.join(outdir, f"{section_id}_mqc.json"), "w") as f: - json.dump(block, f, indent=2) - - def write_multiqc_config(outdir) -> str: # Write MultiQC config file to specify module order # @outdir: output directory where MultiQC JSON files are located @@ -101,7 +95,9 @@ def pick(regex): roc_curves_block = pick(r"combined_roc_curves") pr_curves_block = pick(r"combined_pr_curves") roc_heatmap_block = pick(r"model_performance_heatmap_roc$") + roc_heatmap_ci = pick(r"model_performance_heatmap_roc_ci$") pr_heatmep_block = pick(r"model_performance_heatmap_pr$") + pr_heatmap_ci = pick(r"model_performance_heatmap_pr_ci$") db_blocks = pick(r"database_analysis") degree_blocks = pick(r"_degree_distribution") @@ -119,7 +115,9 @@ def pick(regex): + clustering_blocks + metric_blocks + roc_heatmap_block + + roc_heatmap_ci + pr_heatmep_block + + pr_heatmap_ci + roc_curves_block + pr_curves_block ) @@ -400,103 +398,121 @@ def combined_pr_curves(pr_data, outdir): json.dump(pr_block, f, indent=2) -def combined_metrics_heatmap(metrics_data, outdir): - - # Heatmap: rows: models/databases, columns: metrics (precision, recall, f1), values: metric scorFoes - # Ordered by model name on y-axis, metric name on x-axis - # Create data structure for heatmap: {model_name: {metric_name: value}} - # Ids are: combined_metrics_table_{db_name} - - # Split the metrics in two parts, for the samples, tp, tn, fp, fn, and for the metrics (Accuracy, Recall, Specificity, Precision, Balanced Accuracy, F1 Score) - # Calculate the percentages for the sample sizes - - # For the metrics, create a - - data = {} - for block_id, block in metrics_data.items(): - db_name = block_id.replace("combined_metrics_table_", "") - metrics = block.get("data", {}) - # Each model has a list of metrices - for model in metrics.keys(): - model_name = f"{model} ({db_name})" - data[model_name] = metrics[model] - - # force ordering by model_name - ordered_model_names = sorted(data.keys()) - data = {model_name: data[model_name] for model_name in ordered_model_names} - - # Order data by model_name - data = {model_name: data[model_name] for model_name in sorted(data.keys())} - - metrics_heatmap_block = { - "id": "combined_metrics_heatmap", - "section_name": "Combined Metrics Heatmap", - "plot_type": "heatmap", - "pconfig": { - "id": "combined_metrics_heatmap", - "title": "Combined Metrics Heatmap", - "xlab": "Model", - "ylab": "Metric", - }, - "data": data, - } - with open(os.path.join(outdir, "combined_metrics_heatmap_mqc.json"), "w") as f: - json.dump(metrics_heatmap_block, f, indent=2) - - def model_performance_heatmap(combined_metrics, outdir): + # Heatmap colour + clustering use the numeric mean (float). The + # bootstrap CI lives in a sibling "<...> CI" string column emitted by + # eval_multiqc.py and is rendered as a companion label-table next to the + # heatmap so the CI string is visible alongside the colour cell. + + def _to_float(v): + if isinstance(v, (int, float)): + return float(v) + if isinstance(v, str): + # backwards-compat: "0.948 [0.938, 0.956]" or "0.948" + try: + return float(v.split()[0].strip("[],")) + except Exception: + return None + return None - roc_data = {} - pr_data = {} + roc_mean, roc_ci = {}, {} + pr_mean, pr_ci = {}, {} for block_id, block in combined_metrics.items(): db_name = block_id.replace("combined_metrics_table_", "") metrics = block.get("data", {}) - # Each model has a list of metrices - for model in metrics.keys(): - for metric_name, metric_value in metrics[model].items(): - model_name = f"{model}" - if "roc" in metric_name.lower(): - roc_data.setdefault(model_name, {})[db_name] = metric_value - if "pr" in metric_name.lower(): - pr_data.setdefault(model_name, {})[db_name] = metric_value + for model, row in metrics.items(): + rv = _to_float(row.get("ROC AUC")) + pv = _to_float(row.get("PR AP")) + if rv is not None: + roc_mean.setdefault(model, {})[db_name] = rv + if pv is not None: + pr_mean.setdefault(model, {})[db_name] = pv + rc = row.get("ROC AUC CI") or "" + pc = row.get("PR AP CI") or "" + roc_ci.setdefault(model, {})[db_name] = ( + f"{rv:.3f} {rc}".strip() if rv is not None else (rc or "") + ) + pr_ci.setdefault(model, {})[db_name] = ( + f"{pv:.3f} {pc}".strip() if pv is not None else (pc or "") + ) - # Order data by model_name - pr_data = {model_name: pr_data[model_name] for model_name in sorted(pr_data.keys())} - roc_data = { - model_name: roc_data[model_name] for model_name in sorted(roc_data.keys()) - } + # Sort rows + roc_mean = {m: roc_mean[m] for m in sorted(roc_mean)} + pr_mean = {m: pr_mean[m] for m in sorted(pr_mean)} + roc_ci = {m: roc_ci[m] for m in sorted(roc_ci)} + pr_ci = {m: pr_ci[m] for m in sorted(pr_ci)} - model_eval_block = { - "id": "model_performance_heatmap_roc", - "section_name": "AUC Heatmap", - "plot_type": "heatmap", - "pconfig": { - "id": "model_performance_heatmap_roc", - "title": "AUC Heatmap", - "xlab": "Database", - "ylab": "Model", - "colstops": colstops, - }, - "data": roc_data, - } - with open(os.path.join(outdir, "model_performance_heatmap_roc_mqc.json"), "w") as f: - json.dump(model_eval_block, f, indent=2) + def _write_heatmap(block_id, title, data, fname): + block = { + "id": block_id, + "section_name": title, + "plot_type": "heatmap", + "pconfig": { + "id": block_id, + "title": title, + "xlab": "Database", + "ylab": "Model", + "colstops": colstops, + "min": 0.0, + "max": 1.0, + "decimalPlaces": 3, + "display_values": True, + }, + "data": data, + } + with open(os.path.join(outdir, fname), "w") as f: + json.dump(block, f, indent=2) + + def _write_ci_table(block_id, title, data, fname): + # Table sharing the same model/db grid; cells = "mean [lo, hi]" strings. + # Acts as the CI label panel rendered directly below the heatmap. + db_cols = sorted({db for row in data.values() for db in row}) + headers = { + db: {"title": db, "scale": False, "description": f"{title} ({db})"} + for db in db_cols + } + block = { + "id": block_id, + "section_name": f"{title} — 95% CI labels", + "plot_type": "table", + "pconfig": { + "id": block_id, + "title": f"{title} (95% CI)", + "col1_header": "Model", + "no_violin": True, + "sortRows": False, + }, + "headers": headers, + "data": data, + } + with open(os.path.join(outdir, fname), "w") as f: + json.dump(block, f, indent=2) + + _write_heatmap( + "model_performance_heatmap_roc", + "AUC Heatmap", + roc_mean, + "model_performance_heatmap_roc_mqc.json", + ) + _write_ci_table( + "model_performance_heatmap_roc_ci", + "AUC Heatmap", + roc_ci, + "model_performance_heatmap_roc_ci_mqc.json", + ) - model_eval_block = { - "id": "model_performance_heatmap_pr", - "section_name": "Average Precision Heatmap", - "plot_type": "heatmap", - "pconfig": { - "id": "model_performance_heatmap_pr", - "title": "Average Precision Heatmap", - "xlab": "Database", - "ylab": "Model", - "colstops": colstops, - }, - "data": pr_data, - } - with open(os.path.join(outdir, "model_performance_heatmap_pr_mqc.json"), "w") as f: - json.dump(model_eval_block, f, indent=2) + _write_heatmap( + "model_performance_heatmap_pr", + "Average Precision Heatmap", + pr_mean, + "model_performance_heatmap_pr_mqc.json", + ) + _write_ci_table( + "model_performance_heatmap_pr_ci", + "Average Precision Heatmap", + pr_ci, + "model_performance_heatmap_pr_ci_mqc.json", + ) def combine_db_analysis(db_blocks, outdir): diff --git a/bin/create_networks.py b/bin/create_networks.py deleted file mode 100644 index fc10cc9..0000000 --- a/bin/create_networks.py +++ /dev/null @@ -1,184 +0,0 @@ -#! /usr/bin/env python3 - -import matplotlib.pyplot as plt -import networkx as nx -import random -import itertools -import os -import numpy as np -from pathlib import Path - -from load_data_gm import load_ddi, load_ppi - - -def add_random_edges(G, percent_increase=0.1, seed=None): - """ - Add random edges to a NetworkX graph, increasing the number of edges by percent_increase. - The relative degree distribution is preserved. - """ - num_edges = G.number_of_edges() - num_new_edges = int(percent_increase * num_edges) - nodes = list(G.nodes()) - existing_edges = set(G.edges()) - all_possible_edges = set(itertools.combinations(nodes, 2)) - possible_edges = list(all_possible_edges - existing_edges) - if num_new_edges > len(possible_edges): - num_new_edges = len(possible_edges) - random.seed(seed) - new_edges = random.sample(possible_edges, num_new_edges) - G.add_edges_from(new_edges) - return G - - -def remove_random_edges(G, percent_decrease=0.1, seed=None): - """ - Remove random edges from a NetworkX graph, decreasing the number of edges by percent_decrease. - """ - num_edges = G.number_of_edges() - num_remove = int(percent_decrease * num_edges) - edges = list(G.edges()) - if num_remove > len(edges): - num_remove = len(edges) - random.seed(seed) - edges_to_remove = random.sample(edges, num_remove) - G.remove_edges_from(edges_to_remove) - return G - - -def plot_degree_distribution(G, plot_dir, name_suffix=""): - degrees = [d for n, d in G.degree()] - plt.figure(figsize=(8, 6)) - plt.hist( - degrees, - bins=range(min(degrees), max(degrees) + 2), - align="left", - color="steelblue", - edgecolor="black", - ) - plt.xlabel("Node Degree") - plt.ylabel("Frequency") - plt.title("Node Degree Distribution" + (f" ({name_suffix})" if name_suffix else "")) - plt.tight_layout() - file_path = os.path.join(plot_dir, f"degree_distribution{name_suffix}.png") - plt.savefig(file_path) - plt.close() - print(f"Degree distribution plot saved to {file_path}") - - -def plot_degree_change(G_before, G_after, plot_dir, name_suffix=""): - nodes = list(G_before.nodes()) - degree_before = dict(G_before.degree()) - degree_after = dict(G_after.degree()) - # Compute relative change, handle division by zero - degree_rel_change = [] - for n in nodes: - before = degree_before[n] - after = degree_after[n] - if before == 0: - # If node had degree 0, use np.nan or skip - degree_rel_change.append(float("nan")) - else: - degree_rel_change.append((after - before) / before) - # Remove NaNs for plotting - degree_rel_change = [x for x in degree_rel_change if not np.isnan(x)] - # Add mean/median and std lines - mean_dc = float(np.mean(degree_rel_change)) - median_dc = float(np.median(degree_rel_change)) - std_dc = float(np.std(degree_rel_change)) - plt.figure(figsize=(8, 6)) - plt.hist(degree_rel_change, bins=50, color="lightblue", edgecolor="black") - plt.axvline( - mean_dc, - color="red", - linestyle="dashed", - linewidth=1.5, - label=f"Mean: {mean_dc:.4f}", - ) - plt.axvline( - median_dc, - color="green", - linestyle="dashed", - linewidth=1.5, - label=f"Median: {median_dc:.4f}", - ) - plt.axvline( - mean_dc + std_dc, - color="orange", - linestyle="dashed", - linewidth=1.5, - label=f"Mean + 1 SD: {mean_dc + std_dc:.4f}", - ) - plt.axvline( - mean_dc - std_dc, - color="orange", - linestyle="dashed", - linewidth=1.5, - label=f"Mean - 1 SD: {mean_dc - std_dc:.4f}", - ) - plt.legend() - plt.xlabel("Relative Degree Change per Node") - plt.ylabel("Frequency") - plt.title( - "Node Relative Degree Change Distribution" - + (f" ({name_suffix})" if name_suffix else "") - ) - plt.tight_layout() - file_path = os.path.join( - plot_dir, f"degree_rel_change_distribution{name_suffix}.png" - ) - plt.savefig(file_path) - plt.close() - print(f"Relative degree change distribution plot saved to {file_path}") - - -if __name__ == "__main__": - # Example usage - plot_dir = "tmp/plots" - os.makedirs(plot_dir, exist_ok=True) - # Load data from data_test/ - - train_db_path = Path("data_test/train.sqlite3") - test_db_path = Path("data_test/test.sqlite3") - ddi_dt_train = load_ddi(train_db_path) - ppi_dt_train = load_ppi(train_db_path) - ddi_dt_test = load_ddi(test_db_path) - ppi_dt_test = load_ppi(test_db_path) - - # Convert dataframes to NetworkX graphs - G_ddi_train = nx.from_pandas_edgelist(ddi_dt_train, "domain_a", "domain_b") - G_ppi_train = nx.from_pandas_edgelist(ppi_dt_train, "protein_1", "protein_2") - G_ddi_test = nx.from_pandas_edgelist(ddi_dt_test, "domain_a", "domain_b") - G_ppi_test = nx.from_pandas_edgelist(ppi_dt_test, "protein_1", "protein_2") - - name_list = ["G_ddi_train", "G_ppi_train", "G_ddi_test", "G_ppi_test"] - graph_list = [G_ddi_train, G_ppi_train, G_ddi_test, G_ppi_test] - for i, G in enumerate(graph_list): - print(f"Graph has {G.number_of_nodes()} nodes and {G.number_of_edges()} edges.") - G_augmented = add_random_edges(G.copy(), percent_increase=0.2, seed=42) - G_reduced = remove_random_edges(G.copy(), percent_decrease=0.1, seed=42) - - plot_degree_distribution(G, plot_dir=plot_dir, name_suffix=f"_{name_list[i]}") - plot_degree_distribution( - G_augmented, plot_dir=plot_dir, name_suffix=f"_{name_list[i]}_augmented" - ) - plot_degree_distribution( - G_reduced, plot_dir=plot_dir, name_suffix=f"_{name_list[i]}_reduced" - ) - - plot_degree_change( - G, G_augmented, plot_dir=plot_dir, name_suffix=f"_{name_list[i]}_augmented" - ) - plot_degree_change( - G, G_reduced, plot_dir=plot_dir, name_suffix=f"_{name_list[i]}_reduced" - ) - - # G = nx.erdos_renyi_graph(5000, 0.05) - # G_augmented = add_random_edges(G.copy(), percent_increase=0.2, seed=42) - # G_reduced = remove_random_edges(G.copy(), percent_decrease=0.1, seed=42) - - # plot_degree_distribution(G, plot_dir=plot_dir, name_suffix="_example") - # plot_degree_distribution(G_augmented, plot_dir=plot_dir, name_suffix="_augmented") - # plot_degree_distribution(G_reduced, plot_dir=plot_dir, name_suffix="_reduced") - - # plot_degree_change(G, G_augmented, plot_dir=plot_dir, name_suffix="_augmented") - # plot_degree_change(G, G_reduced, plot_dir=plot_dir, name_suffix="_reduced") diff --git a/bin/ddiparsimony.py b/bin/ddiparsimony.py index 7019dcb..181f811 100644 --- a/bin/ddiparsimony.py +++ b/bin/ddiparsimony.py @@ -138,7 +138,7 @@ def evaluate_reliability_test( def preprocessing( - db_path: Path, output_dir: Path + db_path: Path, output_dir: Path, threads: int = 1 ) -> Tuple[ List[Tuple[str, str]], np.ndarray, @@ -246,7 +246,7 @@ def preprocessing( proteins, ddi_score, num_iterations=1000, - max_workers=os.cpu_count() or 4, + max_workers=threads, ) np.save(rxm_path, random_x_matrix) @@ -262,7 +262,7 @@ def preprocessing( def run_ddiparsimony( - database: str, params_file: str, out_dir: str, out_predictions: str + database: str, params_file: str, out_dir: str, out_predictions: str, threads: int = 1 ): db_train = Path(os.path.join(database, "train.sqlite3")) db_test = Path(os.path.join(database, "test.sqlite3")) @@ -306,7 +306,7 @@ def run_ddiparsimony( protein_domains, ppi_list, ddi_score, - ) = preprocessing(db_train, Path(out_dir)) + ) = preprocessing(db_train, Path(out_dir), threads) import gc domain_pair_to_idx = {pair: idx for idx, pair in enumerate(domain_pairs)} @@ -327,7 +327,7 @@ def run_ddiparsimony( ] train_logs = [] - with ProcessPoolExecutor() as executor: + with ProcessPoolExecutor(max_workers=threads) as executor: results = list( tqdm( executor.map(evaluate_reliability_train, param_grid), @@ -409,7 +409,7 @@ def run_ddiparsimony( ) # Preprocessing on test data domain_pairs, random_x_matrix, ddi_dict, protein_domains, ppi_list, ddi_score = ( - preprocessing(db_test, Path(out_dir)) + preprocessing(db_test, Path(out_dir), threads) ) import gc diff --git a/bin/ddiparsimony_functions.py b/bin/ddiparsimony_functions.py index a687d11..72d0101 100644 --- a/bin/ddiparsimony_functions.py +++ b/bin/ddiparsimony_functions.py @@ -225,7 +225,7 @@ def compute_random_x_matrix_parallel( proteins, ddi_score, num_iterations=1000, - max_workers=8, + max_workers=1, ): random_x_matrix = np.zeros((num_iterations, len(domain_pairs))) with ProcessPoolExecutor(max_workers=max_workers) as executor: diff --git a/bin/domain_split.py b/bin/domain_split.py deleted file mode 100644 index 08b99b2..0000000 --- a/bin/domain_split.py +++ /dev/null @@ -1,151 +0,0 @@ -import numpy as np -import pandas as pd -import random -import sqlite3 -from functools import partial -from tqdm import tqdm - -ANNEALING_STEPS = 1 - -annealing_step_fractions = [ - (i + 1) / (ANNEALING_STEPS + 1) for i in range(ANNEALING_STEPS) -] - -split_fractions = {"train": 0.8, "val": 0.1, "test": 0.1} - -conn = sqlite3.connect("cobinet.sqlite3") -# Load the protein-domain mapping into a DataFrame (not used in the current implementation, but may be useful for future extensions) -# pd_mapping = pd.read_sql(""" -# SELECT domain_id, protein_id FROM domain_protein_map -# WHERE domain_id IS NOT NULL AND protein_id IS NOT NULL LIMIT 20; -# """, conn) - -ddi = pd.read_sql( - """ - SELECT id AS ddi_id, domain_id_a, domain_id_b FROM domain_domain_interaction; -""", - conn, -) - -# cluster sequences are defined as - -cluster_df = pd.read_csv( - "domain_clusters.tsv", sep="\t", header=None, names=["centroid", "member"] -) - -# strip the protein_id from the sequence ids in the cluster dataframe -cluster_df = cluster_df.map(lambda x: x.split("-")[0]).map(int) - -# Aggregate member domains for each centroid -domain_clusters = cluster_df.groupby("centroid")["member"].apply(set).tolist() - -split_domains = {split_name: set() for split_name in split_fractions.keys()} -split_ddis = {split_name: set() for split_name in split_fractions.keys()} - - -def get_new_ddis_for_domains(domains_in_split, new_domains): - prefiltered_ddi = ddi[ - (ddi["domain_id_a"].isin(new_domains)) | (ddi["domain_id_b"].isin(new_domains)) - ] - new_ddis = { - ddi_id - for ddi_id, domain_a, domain_b in prefiltered_ddi.itertuples(index=False) - if (domain_a in domains_in_split or domain_b in domains_in_split) - } - return new_ddis - - -def get_split_ddis(split_name, new_domains=None): - if new_domains is None: - # compute from scratch (more expensive) - return { - ddi_id - for ddi_id, domain_a, domain_b in ddi.itertuples(index=False) - if domain_a in split_domains[split_name] - and domain_b in split_domains[split_name] - } - else: - # compute incrementally (more efficient) - new_ddis = get_new_ddis_for_domains(split_domains[split_name], new_domains) - return split_ddis[split_name].union(new_ddis) - - -def cluster_rank_function(split_name, cluster_domains): - # Calculate the total number of interactions that would be added to the split if this cluster were added - # no need to normalize by split size since we are comparing clusters for the same split - # return len(get_split_ddis(split_name, new_domains=cluster_domains)) - - return len(get_new_ddis_for_domains(split_domains[split_name], cluster_domains)) - - -def split(list, fraction): - split_index = int(len(list) * fraction) - return list[:split_index], list[split_index:] - - -# as initialization, assign clusters to splits according to the annealing fraction, then iteratively assign remaining clusters to the split that would gain the most interactions (normalized by split size) -for (split_name, fraction), clusters in zip( - split_fractions.items(), np.array_split(domain_clusters, len(split_fractions)) -): - split_domains[split_name] = set.union(*clusters) - -for annealing_step, annealing_fraction in enumerate(annealing_step_fractions): - print( - f"Starting annealing step {annealing_step + 1}/{ANNEALING_STEPS} with fraction {annealing_fraction:.2f}" - ) - print("Initial DDI counts: ") - for split_name in split_fractions.keys(): - split_ddis[split_name] = get_split_ddis(split_name) - print(f"\t{split_name}: {len(split_ddis[split_name])} interactions") - - random.shuffle(domain_clusters) - clusters_to_keep, clusters_queue = split(domain_clusters, annealing_fraction) - domains_to_keep = set.union(*clusters_to_keep) - - # Remove domains from splits that are not in the clusters to keep - for split_name in split_fractions.keys(): - split_domains[split_name] = split_domains[split_name].intersection( - domains_to_keep - ) - split_ddis[split_name] = get_split_ddis(split_name) - - print("DDI counts after initialization:") - for split_name in split_fractions.keys(): - print(f"\t{split_name}: {len(split_ddis[split_name])} interactions") - - tqdm_ = tqdm( - total=len(domain_clusters), - initial=len(clusters_to_keep), - desc=f"Annealing step {annealing_step + 1}/{ANNEALING_STEPS}", - ) - while clusters_queue: - # take the split that has the least number of interactions (normalized by split size) - current_split_name = min( - split_fractions.keys(), - key=lambda split_name: ( - len(split_ddis[split_name]) / split_fractions[split_name] - ), - ) - - # rank the clusters by the number of interactions they would add to the split - cluster = max( - clusters_queue, key=partial(cluster_rank_function, current_split_name) - ) - clusters_queue.remove(cluster) - - split_domains[current_split_name].update(cluster) - split_ddis[current_split_name] = get_split_ddis( - current_split_name, new_domains=cluster - ) - - tqdm_.update(1) - -print("Final DDI counts:") -for split_name in split_fractions.keys(): - print(f"\t{split_name}: {len(split_ddis[split_name])} interactions") - - # write out the domain ids in the split to a file - with open(split_name, "w") as f: - f.write("ddi_id\n") - for ddi_id in split_ddis[split_name]: - f.write(ddi_id) - f.write("\n") diff --git a/bin/eval_multiqc.py b/bin/eval_multiqc.py index 58870e9..aaec828 100755 --- a/bin/eval_multiqc.py +++ b/bin/eval_multiqc.py @@ -22,6 +22,7 @@ calc_curves_roc_pr, analyse_database, aggregate_per_model_metrics, + paired_bootstrap_diff, ) import logging @@ -88,13 +89,6 @@ def add_db_name_to_block(block, db_name): return block -def load_old_multiqc_data(old_report_dir): - # @old_report_dir: directory containing old MultiQC report - # Returns loaded MultiQC object with parsed logs - search_path = f"{old_report_dir}/{REPORT_NAME}_data/" - multiqc.parse_logs(search_path, preserve_module_raw_data=True) - return multiqc - def copy_old_report_blocks(old_report_dir, out_dir): # @old_report_dir: directory containing old MultiQC report @@ -130,6 +124,29 @@ def merge_data(old, new): return merged +def _write_distribution_block(data, metric, label, interaction_type, db_name, outdir): + block_id = f"{DB_PREFIX}{interaction_type}_{metric}_distribution" + block = { + "id": block_id, + "section_name": f"{interaction_type.upper()} {label} Distribution", + "plot_type": "box", + "pconfig": { + "id": block_id, + "title": f"{interaction_type.upper()} {label} Distribution", + "xlab": "Database", + "ylab": label, + }, + "data": data, + "raw_data": data, + } + block = add_db_name_to_block(block, db_name) + with open( + os.path.join(outdir, f"{interaction_type}_{metric}_distribution_{db_name}_db_mqc.json"), + "w", + ) as f: + json.dump(block, f, indent=2) + + def write_multiqc_json_database_analysis(db_analysis, outdir, db_name) -> None: # Write MultiQC JSON blocks for database analysis results # @db_analysis: dict with database analysis results @@ -198,79 +215,19 @@ def write_multiqc_json_database_analysis(db_analysis, outdir, db_name) -> None: f"{network_type}_network_data" ]["clustering_coefficient"] - # Write MultiQC JSON blocks for box plots, each model_type/interaction_type combination gets its own block - # for db_type in degree_distributions: for interaction_type in ["ppi", "ddi"]: - # Degree Distribution Block - degree_block = { - "id": f"{DB_PREFIX}{interaction_type}_degree_distribution", - "section_name": f"{interaction_type.upper()} Degree Distribution", - "plot_type": "box", - "pconfig": { - "id": f"{DB_PREFIX}{interaction_type}_degree_distribution", - "title": f"{interaction_type.upper()} Degree Distribution", - "xlab": "Database", - "ylab": "Degree", - }, - "data": degree_distributions[interaction_type], - "raw_data": degree_distributions[interaction_type], - } - degree_block = add_db_name_to_block(degree_block, db_name) - with open( - os.path.join( - outdir, f"{interaction_type}_degree_distribution_{db_name}_db_mqc.json" - ), - "w", - ) as f: - json.dump(degree_block, f, indent=2) - - # Betweenness Centrality Block - betweenness_block = { - "id": f"{DB_PREFIX}{interaction_type}_betweenness_distribution", - "section_name": f"{interaction_type.upper()} Betweenness Centrality Distribution", - "plot_type": "box", - "pconfig": { - "id": f"{DB_PREFIX}{interaction_type}_betweenness_distribution", - "title": f"{interaction_type.upper()} Betweenness Centrality Distribution", - "xlab": "Database", - "ylab": "Betweenness Centrality", - }, - "data": betweenness_distributions[interaction_type], - "raw_data": betweenness_distributions[interaction_type], - } - betweenness_block = add_db_name_to_block(betweenness_block, db_name) - with open( - os.path.join( - outdir, - f"{interaction_type}_betweenness_distribution_{db_name}_db_mqc.json", - ), - "w", - ) as f: - json.dump(betweenness_block, f, indent=2) - - # Clustering Coefficient Block - clustering_block = { - "id": f"{DB_PREFIX}{interaction_type}_clustering_distribution", - "section_name": f"{interaction_type.upper()} Clustering Coefficient Distribution", - "plot_type": "box", - "pconfig": { - "id": f"{DB_PREFIX}{interaction_type}_clustering_distribution", - "title": f"{interaction_type.upper()} Clustering Coefficient Distribution", - "xlab": "Database", - "ylab": "Clustering Coefficient", - }, - "data": clustering_distributions[interaction_type], - "raw_data": clustering_distributions[interaction_type], - } - clustering_block = add_db_name_to_block(clustering_block, db_name) - with open( - os.path.join( - outdir, - f"{interaction_type}_clustering_distribution_{db_name}_db_mqc.json", - ), - "w", - ) as f: - json.dump(clustering_block, f, indent=2) + _write_distribution_block( + degree_distributions[interaction_type], "degree", "Degree", + interaction_type, db_name, outdir, + ) + _write_distribution_block( + betweenness_distributions[interaction_type], "betweenness", "Betweenness Centrality", + interaction_type, db_name, outdir, + ) + _write_distribution_block( + clustering_distributions[interaction_type], "clustering", "Clustering Coefficient", + interaction_type, db_name, outdir, + ) def load_old_json_block(old_report_dir, file_name_suffix, block_id): @@ -294,7 +251,7 @@ def load_old_json_block(old_report_dir, file_name_suffix, block_id): return None -def write_multiqc_json_metrices( +def write_multiqc_json_metrics( metrics_aucap, roc_curves, pr_curves, @@ -313,13 +270,23 @@ def write_multiqc_json_metrices( n_models = len(metrics_aucap) - # --- Block 1: AUC/AP table --- + def _fmt_ci(ci): + if ci is None: + return "" + lo, hi = ci + return f"[{float(lo):.3f}, {float(hi):.3f}]" + + # --- Block 1: AUC/AP table ----------------------------------------------- + # Numeric mean kept as float (enables table colour-scale + downstream + # heatmap clustering); CI rendered in a sibling string column. table_id = f"{prefix}_metrics_table" file_name_suffix = "_metrics_table_mqc.json" new_table_data = { m: { - "ROC AUC": f"{float(metrics_aucap[m]['ROC_AUC']):.3f}", - "PR AP": f"{float(metrics_aucap[m]['PR_AP']):.3f}", + "ROC AUC": float(metrics_aucap[m]["ROC_AUC"]), + "ROC AUC CI": _fmt_ci(metrics_aucap[m].get("ROC_AUC_CI")), + "PR AP": float(metrics_aucap[m]["PR_AP"]), + "PR AP CI": _fmt_ci(metrics_aucap[m].get("PR_AP_CI")), } for m in metrics_aucap } @@ -341,6 +308,30 @@ def write_multiqc_json_metrices( "title": "Model performance (AUC / AP)", "col1_header": "Model", }, + "headers": { + "ROC AUC": { + "title": "ROC AUC", + "min": 0.0, + "max": 1.0, + "scale": "RdYlGn", + "format": "{:,.3f}", + }, + "ROC AUC CI": { + "title": "ROC AUC 95% CI", + "scale": False, + }, + "PR AP": { + "title": "PR AP", + "min": 0.0, + "max": 1.0, + "scale": "RdYlGn", + "format": "{:,.3f}", + }, + "PR AP CI": { + "title": "PR AP 95% CI", + "scale": False, + }, + }, "data": merged_table_data, } with open(os.path.join(outdir, f"{prefix}_metrics_table_mqc.json"), "w") as f: @@ -439,6 +430,47 @@ def write_multiqc_json_metrices( with open(os.path.join(outdir, "model_eval_metrics_mqc.json"), "w") as f: json.dump(metrics_block, f, indent=2) + # --- Block 5: Pairwise significance (Phase 4) --- + # Built only when every model carries bootstrap samples. Uses the unpaired + # stochastic-dominance approximation documented on paired_bootstrap_diff. + sample_models = sorted( + m for m in metrics_aucap if "ROC_AUC_SAMPLES" in metrics_aucap[m] + ) + if len(sample_models) >= 2: + pairwise_id = f"{prefix}_pairwise_significance" + pair_data = {} + for a in sample_models: + row = {} + for b in sample_models: + if a == b: + row[b] = "—" + continue + p = paired_bootstrap_diff( + metrics_aucap[a]["ROC_AUC_SAMPLES"], + metrics_aucap[b]["ROC_AUC_SAMPLES"], + ) + row[b] = f"{p:.3f}" + pair_data[a] = row + pairwise_block = { + "id": pairwise_id, + "section_name": "Pairwise significance (ROC AUC)", + "plot_type": "table", + "pconfig": { + "id": pairwise_id, + "title": "Pairwise significance — ROC AUC (two-sided p-value)", + "col1_header": "Model", + }, + "data": pair_data, + } + with open( + os.path.join(outdir, f"{prefix}_pairwise_significance_mqc.json"), "w" + ) as f: + json.dump(pairwise_block, f, indent=2) + logging.info( + "[OK] Wrote pairwise significance block " + f"({len(sample_models)} models)" + ) + logging.info("[OK] Wrote MultiQC JSON (ROC + PR)") @@ -517,6 +549,7 @@ def pick(regex): roc_blocks = pick(r"_roc$") pr_blocks = pick(r"_pr$") metric_blocks = pick(r"model_eval_metrics$") + pairwise_blocks = pick(r"_pairwise_significance$") db_blocks = pick(r"database_analysis") degree_blocks = pick(r"_degree_distribution") @@ -527,6 +560,7 @@ def pick(regex): white_tbl + metric_blocks + auc_ap_tbl + + pairwise_blocks + roc_blocks + pr_blocks + db_blocks @@ -724,7 +758,7 @@ def main(): print(f"[INFO] Database analysis completed for: {db_name}") # Part4: MultiQC JSON - write_multiqc_json_metrices( + write_multiqc_json_metrics( metrics_auc_pr, roc_curves, pr_curves, diff --git a/bin/eval_multiqc_functions.py b/bin/eval_multiqc_functions.py index 1c62c1c..703ba97 100644 --- a/bin/eval_multiqc_functions.py +++ b/bin/eval_multiqc_functions.py @@ -20,6 +20,75 @@ import networkx as nx import logging + +### Bootstrap helpers ### + + +def bootstrap_metric( + y_true, + y_score, + metric_fn, + n_resamples: int = 1000, + ci: float = 0.95, + seed: int = 42, +): + """Bootstrap a binary-classification ranking metric. + + Returns (point, lo, hi, samples). `samples` is the array of per-resample + metric values, kept so downstream code can run pairwise comparisons without + re-running the bootstrap. Identical `seed` across models means the resample + *strategy* is identical even though the indices differ in length when + models cover different test rows. + """ + y_true = np.asarray(y_true) + y_score = np.asarray(y_score) + n = len(y_true) + rng = np.random.default_rng(seed) + samples = np.empty(n_resamples, dtype=np.float64) + for i in range(n_resamples): + idx = rng.integers(0, n, size=n) + yt = y_true[idx] + ys = y_score[idx] + # Degenerate resamples (single class) make ranking metrics undefined. + if yt.min() == yt.max(): + samples[i] = np.nan + continue + samples[i] = float(metric_fn(yt, ys)) + valid = samples[~np.isnan(samples)] + if valid.size == 0: + point = float(metric_fn(y_true, y_score)) + return point, point, point, samples + alpha = (1.0 - ci) / 2.0 + lo = float(np.quantile(valid, alpha)) + hi = float(np.quantile(valid, 1.0 - alpha)) + point = float(metric_fn(y_true, y_score)) + return point, lo, hi, samples + + +def paired_bootstrap_diff(samples_a, samples_b) -> float: + """Two-sided p-value that metric_a differs from metric_b. + + Operates on the per-resample arrays returned by `bootstrap_metric`. When the + arrays come from independent draws (the scatter-evaluation case here, where + each per-model JSON is built without cross-model index alignment), this is + an unpaired stochastic-dominance approximation rather than the textbook + paired-bootstrap test. It is still informative for ordering models; treat + p-values as indicative rather than calibrated. + """ + a = np.asarray(samples_a, dtype=np.float64) + b = np.asarray(samples_b, dtype=np.float64) + a = a[~np.isnan(a)] + b = b[~np.isnan(b)] + if a.size == 0 or b.size == 0: + return float("nan") + n = min(a.size, b.size) + diff = a[:n] - b[:n] + # Two-sided: smaller tail mass × 2. + p_pos = float((diff <= 0).mean()) + p_neg = float((diff >= 0).mean()) + return float(min(1.0, 2.0 * min(p_pos, p_neg))) + + ### Functions to load data from the database ### @@ -320,7 +389,8 @@ def aggregate_per_model_metrics(per_model_files): rows in the reducer task. Each JSON file: {model_name, samples, metrics_summary, roc:[[fp,tp]...], - pr:[[recall,precision]...], roc_auc, pr_ap} + pr:[[recall,precision]...], roc_auc, pr_ap, + roc_auc_ci, pr_ap_ci, roc_auc_samples, pr_ap_samples} """ metrics_rows = [] metrics_aucap = {} @@ -332,10 +402,21 @@ def aggregate_per_model_metrics(per_model_files): obj = json.load(fh) name = obj["model_name"] metrics_rows.append(obj["metrics_summary"]) - metrics_aucap[name] = { + entry = { "ROC_AUC": obj["roc_auc"], "PR_AP": obj["pr_ap"], } + # Optional Phase 4 fields — missing on legacy JSONs from before the + # bootstrap rollout, so guard with .get(). + if "roc_auc_ci" in obj: + entry["ROC_AUC_CI"] = obj["roc_auc_ci"] + if "pr_ap_ci" in obj: + entry["PR_AP_CI"] = obj["pr_ap_ci"] + if "roc_auc_samples" in obj: + entry["ROC_AUC_SAMPLES"] = obj["roc_auc_samples"] + if "pr_ap_samples" in obj: + entry["PR_AP_SAMPLES"] = obj["pr_ap_samples"] + metrics_aucap[name] = entry roc_pairs = obj["roc"] pr_pairs = obj["pr"] roc_curves[name] = ( diff --git a/bin/eval_one.py b/bin/eval_one.py index 9888c89..8a7242a 100755 --- a/bin/eval_one.py +++ b/bin/eval_one.py @@ -19,10 +19,14 @@ from sklearn.metrics import ( auc, average_precision_score, + matthews_corrcoef, precision_recall_curve, + roc_auc_score, roc_curve, ) +from eval_multiqc_functions import bootstrap_metric + def _read_predictions(path: str) -> pd.DataFrame: if path.endswith(".parquet") or path.endswith(".pq"): @@ -59,6 +63,11 @@ def _safe(num, den): if (Precision + TPR) else float("nan") ) + # MCC: returns 0 when one predicted class is absent (sklearn convention). + try: + MCC = float(matthews_corrcoef(y_true, y_pred)) + except Exception: + MCC = float("nan") return { "Model": model_name, "Samples": total, @@ -72,6 +81,7 @@ def _safe(num, den): "Precision": Precision, "Balanced Accuracy": BA, "F1 Score": F1, + "MCC": MCC, } @@ -90,6 +100,9 @@ def main(): p.add_argument("--model_name", required=True) p.add_argument("--out", required=True) p.add_argument("--max_curve_points", type=int, default=600) + p.add_argument("--bootstrap_n", type=int, default=1000, + help="Bootstrap resamples for ROC_AUC / PR_AP confidence intervals.") + p.add_argument("--bootstrap_seed", type=int, default=42) p.add_argument( "--id", dest="run_id", default=None, help="Optional run ID (logged only)." @@ -112,6 +125,18 @@ def main(): precision, recall, _ = precision_recall_curve(y_true_clean, y_score_clean) pr_ap = float(average_precision_score(y_true_clean, y_score_clean)) + # Phase 4: bootstrap CIs for the two ranking metrics. Cheap (<5s on 30k + # samples × 1000 resamples) and gives `0.847 [0.821, 0.872]` style cells + # in the overview table. + _, roc_lo, roc_hi, roc_samples = bootstrap_metric( + y_true_clean, y_score_clean, roc_auc_score, + n_resamples=args.bootstrap_n, seed=args.bootstrap_seed, + ) + _, pr_lo, pr_hi, pr_samples = bootstrap_metric( + y_true_clean, y_score_clean, average_precision_score, + n_resamples=args.bootstrap_n, seed=args.bootstrap_seed, + ) + fp_d, tp_d = _downsample_curve(fp.astype(np.float32), tp.astype(np.float32), args.max_curve_points) rec_d, prec_d = _downsample_curve(recall.astype(np.float32), @@ -125,14 +150,23 @@ def main(): "pr": [[float(x), float(y)] for x, y in zip(rec_d, prec_d)], "roc_auc": roc_auc, "pr_ap": pr_ap, + "roc_auc_ci": [roc_lo, roc_hi], + "pr_ap_ci": [pr_lo, pr_hi], + # Per-resample arrays are kept so eval_multiqc.py can run pairwise + # comparisons without re-reading predictions. ~8KB per metric per model. + "roc_auc_samples": [float(v) for v in roc_samples], + "pr_ap_samples": [float(v) for v in pr_samples], } out = Path(args.out) out.parent.mkdir(parents=True, exist_ok=True) with out.open("w", encoding="utf-8") as fh: json.dump(payload, fh) - print(f"[eval_one] wrote {out} (model={args.model_name}, " - f"ROC_AUC={roc_auc:.4f}, PR_AP={pr_ap:.4f})") + print( + f"[eval_one] wrote {out} (model={args.model_name}, " + f"ROC_AUC={roc_auc:.4f} [{roc_lo:.4f},{roc_hi:.4f}], " + f"PR_AP={pr_ap:.4f} [{pr_lo:.4f},{pr_hi:.4f}])" + ) if __name__ == "__main__": diff --git a/bin/features/dummy.py b/bin/features/dummy.py index 6dce52b..5ae8917 100644 --- a/bin/features/dummy.py +++ b/bin/features/dummy.py @@ -1,7 +1,47 @@ #!/usr/bin/env python3 +"""Dummy feature encoder. + +Emits a 512-dim random vector per (domain, protein) pair. 512 chosen +as a mid-point of real embedding dimensionalities (ESM/ProtT5 range +~480-2560), so the dummy carries comparable input shape to real +encoders without leaking any signal. + +Rationale: a constant zero vector causes classifiers to collapse to +the majority class. Per-row random values produce unique inputs; +AUROC converges to ~0.5 (the true sanity floor). If a real model +fails to beat this, something is wrong with the data or training loop. +""" import h5py +import numpy as np +import pandas as pd import sqlite3 +DUMMY_DIM = 512 + +_RNG = np.random.default_rng() + def extract_features(conn: sqlite3.Connection, out_file: h5py.File): - pass + domain_protein_df = pd.read_sql( + """ + SELECT domain_id, protein_id + FROM domain_protein_map; + """, + conn, + ) + + domain_protein_df["domain_id"] = domain_protein_df["domain_id"].astype(str) + domain_protein_df["protein_id"] = domain_protein_df["protein_id"].astype(str) + + for domain_id, protein_id in domain_protein_df.itertuples(index=False): + if domain_id not in out_file: + pfam_group = out_file.create_group(domain_id) + else: + pfam_group = out_file[domain_id] + + pfam_group[protein_id] = _RNG.standard_normal(DUMMY_DIM, dtype=np.float32) + + print( + f"dummy: wrote {len(domain_protein_df)} (domain, protein) entries x " + f"{DUMMY_DIM}-dim random vectors" + ) diff --git a/bin/features/new_feature.py b/bin/features/new_feature.py new file mode 100644 index 0000000..23edd8c --- /dev/null +++ b/bin/features/new_feature.py @@ -0,0 +1,66 @@ +#!/usr/bin/env python3 +"""Template for adding a new feature encoding to the benchmark pipeline. + +Steps to add a new feature: +1. Copy this file to bin/features/.py +2. Implement extract_features() below +3. Add '' to params.machine_learning_features in nextflow.config +4. If your feature needs GPU or large memory, also add it to params.large_features + +The pipeline auto-discovers features by name: extract_features.py calls +importlib.import_module(f"features.{feature_name}").extract_features(conn, out_file). + +Database schema (domain_protein_map table): + domain_id TEXT -- Pfam domain ID (e.g. PF00001) + protein_id TEXT -- UniProt protein ID (e.g. P12345) + domain_sequence TEXT -- amino acid sequence of the domain + start_pos INT -- domain start position in protein sequence + end_pos INT -- domain end position in protein sequence + (+ embedding columns like esm3_per_domain, esmc_per_residue, etc.) + +HDF5 output structure (required by downstream ML models): + // = numpy array of shape (feature_dim,) + +See aacomp.py for a minimal example, embeddings.py for pre-computed +embedding extraction with helper utilities. +""" + +import h5py +import numpy as np +import pandas as pd +import sqlite3 + + +def extract_features(conn: sqlite3.Connection, out_file: h5py.File): + """Extract features from the database and write them to the HDF5 file. + + Args: + conn: SQLite connection to one of train.sqlite3 / test.sqlite3 / + optimization.sqlite3. Read-only — do not write. + out_file: Writable HDF5 file. Write one dataset per (domain, protein) + pair, grouped by domain_id. + """ + domain_protein_df = pd.read_sql( + """ + SELECT domain_id, protein_id, UPPER(domain_sequence) AS sequence + FROM domain_protein_map; + """, + conn, + ) + + domain_protein_df["domain_id"] = domain_protein_df["domain_id"].astype(str) + domain_protein_df["protein_id"] = domain_protein_df["protein_id"].astype(str) + + for domain_id, protein_id, sequence in domain_protein_df.itertuples(index=False): + # --- Replace this block with your feature computation --- + feature_vector = np.zeros(128, dtype=np.float32) # placeholder + # -------------------------------------------------------- + + if domain_id not in out_file: + pfam_group = out_file.create_group(domain_id) + else: + pfam_group = out_file[domain_id] + + pfam_group[protein_id] = feature_vector + + print(f"new_feature: wrote {len(domain_protein_df)} entries") diff --git a/bin/kgiddi.py b/bin/kgiddi.py index a8dcdfe..82e7294 100644 --- a/bin/kgiddi.py +++ b/bin/kgiddi.py @@ -11,13 +11,71 @@ import gc import psutil import logging -import math import sys -from concurrent.futures import ProcessPoolExecutor, as_completed +import multiprocessing as mp +from concurrent.futures import ProcessPoolExecutor, wait, FIRST_COMPLETED from tqdm import tqdm +# Force spawn-based workers. The default 'fork' start method has each worker +# inherit the parent's entire memory image via copy-on-write; once workers +# start mutating Python objects the shared pages get duplicated and the per- +# worker RSS balloons toward the parent's footprint. With ~16 GB resident +# after preprocessing and several workers, that previously tripped a cgroup +# OOM kill mid-pool and surfaced as `BrokenProcessPool` in network_expansion. +# Spawn workers start clean and only receive the pickled task args, keeping +# each worker's RSS bounded to the size of the subgraph it processes. +_MP_CTX = mp.get_context("spawn") + + +def _reset_resource_tracker(): + """Force-recreate the multiprocessing resource_tracker daemon. + + Under spawn, multiprocessing maintains a singleton resource_tracker + subprocess to clean up shared semaphores. On long-running parents + (kgiddi runs 4-7h under SLURM), that daemon can be reaped by the + container/cgroup or have its pipe closed silently. The next + ``ProcessPoolExecutor`` then crashes inside ``Queue.__init__`` with + ``BrokenPipeError`` from ``resource_tracker._send`` because the + cached fd points at a dead reader. Clearing ``_fd``/``_pid`` lets + the next ``ensure_running()`` call respawn a fresh daemon. + """ + try: + from multiprocessing import resource_tracker as _rt + rt = _rt._resource_tracker + with rt._lock: + if rt._fd is not None: + try: + os.close(rt._fd) + except OSError: + pass + rt._fd = None + rt._pid = None + except Exception: + pass + + +def _new_pool(threads): + """Construct ``ProcessPoolExecutor`` with resource_tracker recovery. + + Wraps construction so that a stale resource_tracker daemon (see + :func:`_reset_resource_tracker`) is detected by catching + ``BrokenPipeError``/``OSError`` from ``Queue``/``Lock`` init, then + we clear the tracker state and retry once. A second failure is + surfaced as before. + """ + try: + return ProcessPoolExecutor(max_workers=threads, mp_context=_MP_CTX) + except (BrokenPipeError, OSError) as err: + logging.warning( + "ProcessPoolExecutor init failed (%s); resetting " + "resource_tracker daemon and retrying once.", + err, + ) + _reset_resource_tracker() + return ProcessPoolExecutor(max_workers=threads, mp_context=_MP_CTX) + from kgiddi_functions import ( approx_bimax, select_best_ddis_per_group, @@ -66,10 +124,17 @@ def connected(self, x, y): def log_resource_usage(note=""): - process = psutil.Process(os.getpid()) - mem_mb = process.memory_info().rss / 1024 / 1024 - cpu = process.cpu_percent(interval=0.1) - logging.info(f"{note} | Memory usage: {mem_mb:.2f} MB | CPU: {cpu:.1f}%") + # Wrapped: in container teardown (SLURM SIGTERM, cgroup cleanup) /proc + # entries may disappear before Python exits, making psutil raise + # NoSuchProcess on our own pid. Swallow so the real cause (e.g. SLURM + # timeout exit 140) surfaces instead of a misleading psutil traceback. + try: + process = psutil.Process(os.getpid()) + mem_mb = process.memory_info().rss / 1024 / 1024 + cpu = process.cpu_percent(interval=0.1) + logging.info(f"{note} | Memory usage: {mem_mb:.2f} MB | CPU: {cpu:.1f}%") + except (psutil.Error, OSError) as e: + logging.warning(f"{note} | resource sampling failed: {e}") def summary_stats(data): @@ -126,18 +191,34 @@ def functionally_similar(protein_distances, ppi1, ppi2, threshold): ) +_MAX_SUBGRAPH_NODES = 800 +_MAX_SUBGRAPH_DENSE_NODES = 100 +_MAX_SUBGRAPH_DENSITY = 50 # edges per node + + def process_go_term(go_term, subgraph, bicluster_cutoff): # Log the GO term and subgraph size # logging.info(f"Processing GO term: {go_term} | Nodes: {subgraph.number_of_nodes()} | Edges: {subgraph.number_of_edges()}") local_predicted = set() - if len(subgraph.nodes) == 1: + n = subgraph.number_of_nodes() + if n == 1: local_predicted.update(subgraph.edges) else: + # Pathological cases: very large or dense subgraphs make approx_bimax + # blow up combinatorially in bicluster search / conquer() recursion. + # Skip them — benchmark-only output stability means dropping a handful + # of mega-subgraphs is acceptable, and a multi-hour single subgraph is + # what was causing the graph_model tasks to wedge. + e = subgraph.number_of_edges() + if n > _MAX_SUBGRAPH_NODES or ( + n > _MAX_SUBGRAPH_DENSE_NODES and e / max(n, 1) > _MAX_SUBGRAPH_DENSITY + ): + logging.warning( + f"skip pathological subgraph go={go_term} nodes={n} edges={e}" + ) + return go_term, local_predicted nodes = list(subgraph.nodes) A = nx.to_numpy_array(subgraph, nodelist=nodes) - # Pathological cases: very large or dense subgraphs, or subgraphs with highly connected nodes, - # can cause approx_bimax to run extremely slowly or use excessive memory due to - # combinatorial explosion in bicluster search or deep recursion in conquer(). biclusters = approx_bimax(A, b=bicluster_cutoff) for bicluster in biclusters: row_indices, col_indices = bicluster @@ -156,6 +237,7 @@ def process_go_term(go_term, subgraph, bicluster_cutoff): def evaluate_params( args, + threads: int = 1, ): # -> tuple[float | Literal[0], Any, Any, float | Literal[0]]: ( chi_square_cutoff, @@ -179,7 +261,7 @@ def evaluate_params( for entry in group_ddis } predicted_ddis, fold, fp_rate = network_expansion( - shared_go_domains, ddi_network_edges, known_ddis, all_domains, bicluster_cutoff + shared_go_domains, ddi_network_edges, known_ddis, all_domains, bicluster_cutoff, threads ) # Save predicted DDIs for this training step with open( @@ -198,6 +280,7 @@ def network_expansion( known_ddis: set[tuple[str, str]], all_domains: set[str], bicluster_cutoff: float, + threads: int = 1, ): go_ddi_subgraphs = extract_go_guided_ddi_subgraphs( @@ -208,30 +291,42 @@ def network_expansion( predicted_ddis = set() debug_lines = [] - # Chunked processing for memory efficiency + # Single pool with bounded in-flight submission. Previously a fresh + # ProcessPoolExecutor was instantiated per 100-item chunk; with the spawn + # start method, ~18 workers paid ~1-2s spawn cost on every chunk boundary, + # which on the test database (24+ chunks) burned 10-15 min before doing + # any useful work. One pool + an in-flight cap of 2x threads keeps the + # memory profile the chunking was originally there for. go_items = list(go_ddi_subgraphs.items()) - chunk_size = 100 - total = len(go_ddi_subgraphs) - num_chunks = math.ceil(total / chunk_size) - processed = 0 - for chunk_idx in range(num_chunks): - chunk = go_items[processed : processed + chunk_size] - with ProcessPoolExecutor(max_workers=os.cpu_count()) as executor: - futures = { - executor.submit( - process_go_term, go_term, subgraph, bicluster_cutoff - ): go_term - for go_term, subgraph in chunk - } - for future in as_completed(futures): - go_term, local_predicted = future.result() + total = len(go_items) + in_flight_cap = max(2 * threads, 4) + log_every = max(in_flight_cap, 100) + completed = 0 + with _new_pool(threads) as executor: + in_flight = {} + + def _drain_one(): + nonlocal completed + done, _pending = wait(in_flight, return_when=FIRST_COMPLETED) + for fut in done: + go_term, local_predicted = fut.result() predicted_ddis.update(local_predicted) debug_lines.append(f"{go_term}\t{len(local_predicted)}\n") - processed += len(chunk) - logging.info( - f"Processed chunk {chunk_idx + 1}/{num_chunks} ({processed}/{total} GO terms)" - ) - log_resource_usage(f"After chunk {chunk_idx + 1}") + in_flight.pop(fut) + completed += 1 + if completed % log_every == 0: + logging.info(f"Processed {completed}/{total} GO terms") + + for go_term, subgraph in go_items: + while len(in_flight) >= in_flight_cap: + _drain_one() + fut = executor.submit( + process_go_term, go_term, subgraph, bicluster_cutoff + ) + in_flight[fut] = go_term + while in_flight: + _drain_one() + log_resource_usage(f"After processing all {total} GO terms") with open("debug_go_predicted_ddis.txt", "a") as f: f.writelines( @@ -349,29 +444,55 @@ def preprocessing( } protein_go_terms = permuted_protein_go_terms - # Now I have the mapping of proteins to their most specific go terms in protein_go_terms - proteins = np.array(list(protein_go_terms.keys())) - n_proteins = len(proteins) - protein_distances = {} - # Precompute all pairs indices - idx_i, idx_j = np.triu_indices(n_proteins, k=1) - for i, j in zip(idx_i, idx_j): - p1 = proteins[i] - p2 = proteins[j] - terms1 = protein_go_terms[p1] - terms2 = protein_go_terms[p2] - t1_arr = np.array(terms1) - t2_arr = np.array(terms2) - # Build distance matrix for all pairs (t1, t2) - dists = np.full((len(t1_arr), len(t2_arr)), np.inf) - for m, t1 in enumerate(t1_arr): - for n, t2 in enumerate(t2_arr): - d = all_pairs_shortest_path_length.get(t1, {}).get(t2, np.inf) - dists[m, n] = d - min_dist = np.min(dists) if dists.size > 0 else np.inf - protein_distances[tuple(sorted((p1, p2)))] = ( - min_dist if min_dist != np.inf else None + # Vectorized + sparsified protein_distances construction. + # + # Legacy version materialized a dict of N*(N-1)/2 entries (~44 M for + # n_proteins≈9400) using a triple Python loop over GO term lookups. New + # version: + # 1. Builds a single GO-term-id matrix D once from the precomputed + # all_pairs_shortest_path_length (one pass, no per-pair dict lookups). + # 2. Per-pair min distance via numpy slice (D[ti[:,None], tj[None,:]]). + # 3. Emits only finite distances — pairs with no GO-graph path between + # their most-specific terms never reach the consumer, which used to + # get None and treat it as not-similar anyway. + used_go_terms = sorted( + {t for terms in protein_go_terms.values() for t in terms} + ) + go_id = {t: i for i, t in enumerate(used_go_terms)} + n_go = len(used_go_terms) + D = np.full((n_go, n_go), np.inf, dtype=np.float32) + for source, targets in all_pairs_shortest_path_length.items(): + si = go_id.get(source) + if si is None: + continue + for target, dist in targets.items(): + ti_ = go_id.get(target) + if ti_ is not None: + D[si, ti_] = dist + protein_term_ids = { + p: np.fromiter( + (go_id[t] for t in terms if t in go_id), + dtype=np.intp, + count=sum(1 for t in terms if t in go_id), ) + for p, terms in protein_go_terms.items() + } + proteins_list = list(protein_term_ids.keys()) + n_proteins = len(proteins_list) + protein_distances = {} + for i in range(n_proteins): + ti = protein_term_ids[proteins_list[i]] + if ti.size == 0: + continue + p_i = proteins_list[i] + for j in range(i + 1, n_proteins): + tj = protein_term_ids[proteins_list[j]] + if tj.size == 0: + continue + min_dist = float(D[ti[:, None], tj[None, :]].min()) + if np.isfinite(min_dist): + p_j = proteins_list[j] + protein_distances[tuple(sorted((p_i, p_j)))] = min_dist # Creation of ppi_list ppi1s = ppi_df["protein_1"].values @@ -430,13 +551,66 @@ def ppi_node_degree(ppi): uf = UnionFind(ppi_nodes) clusters_dict = defaultdict(set) - # Cluster PPIs with Union-Find - for i, ppi1 in enumerate(ppi_nodes): - for j in range(i + 1, len(ppi_nodes)): - ppi2 = ppi_nodes[j] - if not uf.connected(ppi1, ppi2): - if functionally_similar(protein_distances, ppi1, ppi2, threshold): - uf.union(ppi1, ppi2) + if os.environ.get("KGIDDI_LEGACY_BUILD"): + # Legacy O(N^2) all-pairs scan. Kept as a parity escape hatch — set + # KGIDDI_LEGACY_BUILD=1 to compare against the inverted-index path on + # the same inputs. + logging.warning( + "KGIDDI_LEGACY_BUILD set — using O(N^2) all-pairs Union-Find loop" + ) + for i, ppi1 in enumerate(ppi_nodes): + for j in range(i + 1, len(ppi_nodes)): + ppi2 = ppi_nodes[j] + if not uf.connected(ppi1, ppi2): + if functionally_similar( + protein_distances, ppi1, ppi2, threshold + ): + uf.union(ppi1, ppi2) + else: + # Inverted-index path: enumerate only candidate similar PPIs via a + # protein-keyed index. Drops the 2.25e10 pair scan that previously + # took 6-7 hours per kgiddi task. + # + # Similarity: ppi(a,b) ~ ppi(c,d) iff (c in close[a] AND d in close[b]) + # OR (c in close[b] AND d in close[a]) + # where close[p] = {q : protein_distances[(p,q)] <= threshold}. + close = defaultdict(set) + for (p, q), d in protein_distances.items(): + if d is not None and d <= threshold: + close[p].add(q) + close[q].add(p) + + ppi_by_protein = defaultdict(list) + for idx, (a, b) in enumerate(ppi_nodes): + ppi_by_protein[a].append(idx) + ppi_by_protein[b].append(idx) + + for i, (a, b) in enumerate(ppi_nodes): + close_a = close.get(a, ()) + close_b = close.get(b, ()) + candidates = set() + # Rule 1: c ~ a AND d ~ b + for c in close_a: + for j in ppi_by_protein.get(c, ()): + if j <= i: + continue + x, y = ppi_nodes[j] + other = y if x == c else x + if other in close_b: + candidates.add(j) + # Rule 2: c ~ b AND d ~ a + for c in close_b: + for j in ppi_by_protein.get(c, ()): + if j <= i: + continue + x, y = ppi_nodes[j] + other = y if x == c else x + if other in close_a: + candidates.add(j) + for j in candidates: + if not uf.connected(ppi_nodes[i], ppi_nodes[j]): + uf.union(ppi_nodes[i], ppi_nodes[j]) + for ppi in ppi_nodes: clusters_dict[uf.find(ppi)].add(ppi) clusters = [] @@ -466,7 +640,7 @@ def ppi_node_degree(ppi): return connected_components, group_ddi_chi2 -def run_kgiddi(database_path, params_file, out_dir, out_predictions): +def run_kgiddi(database_path, params_file, out_dir, out_predictions, threads=1): db_train = Path(os.path.join(database_path, "train.sqlite3")) db_test = Path(os.path.join(database_path, "test.sqlite3")) @@ -553,7 +727,7 @@ def run_kgiddi(database_path, params_file, out_dir, out_predictions): for args in tqdm( param_grid, total=len(param_grid), desc="Parameter grid search" ): - results.append(evaluate_params(args)) + results.append(evaluate_params(args, threads)) for fold, chi_square_cutoff, bicluster_cutoff, fp_rate in results: if fold > best_fold: @@ -627,6 +801,7 @@ def run_kgiddi(database_path, params_file, out_dir, out_predictions): known_ddis_test, all_domains_test, best_params["bicluster_cutoff"], + threads, ) logging.info( diff --git a/bin/load_data_gm.py b/bin/load_data_gm.py index cb1ce50..4a71900 100644 --- a/bin/load_data_gm.py +++ b/bin/load_data_gm.py @@ -44,20 +44,6 @@ def load_ddi(path_to_database: Path) -> pd.DataFrame: return ddi_final -def load_ddi_new(path_to_database: Path) -> pd.DataFrame: - with sqlite3.connect(path_to_database) as conn: - ddi_df = pd.read_sql( - """ - SELECT domain_id_a AS domain_1, domain_id_b AS domain_2, - NOT negative AS interaction - FROM domain_domain_interaction - WHERE is_evaluation_relevant; - """, - conn, - ) - - return ddi_df - def load_pd_mapping(path_to_database: Path) -> pd.DataFrame: """ @@ -111,136 +97,3 @@ def load_pgo(path_to_database: Path) -> pd.DataFrame: return protein_go_terms -def create_debug_dataset(ddi_df, pd_mapping, ppi_df) -> tuple: - """ - Create a small debug dataset by sampling a subset of the data. - """ - - # Number of negative samples in DDI - n_negative_samples = ddi_df[ddi_df["interaction"] == 0].shape[0] - debug_ddi = pd.concat( - [ - ddi_df[ddi_df["interaction"] == 1].sample( - n=n_negative_samples, random_state=42 - ), - ddi_df[ddi_df["interaction"] == 0].sample( - n=n_negative_samples // 10, random_state=42 - ), - ] - ).reset_index(drop=True) - involved_domains = set(debug_ddi["domain_a"]).union(set(debug_ddi["domain_b"])) - debug_pd_mapping = pd_mapping[ - pd_mapping["pfam_id"].isin(involved_domains) - ].reset_index(drop=True) - - involved_proteins = set(debug_pd_mapping["uniprot_id"]) - debug_ppi = ppi_df[ - ppi_df["protein_1"].isin(involved_proteins) - & ppi_df["protein_2"].isin(involved_proteins) - ].reset_index(drop=True) - - # Repeat occurring DDIS by repeating rows with the same amount as in the original dataset - original_ddi_counts = ( - ddi_df.groupby(["domain_a", "domain_b"]).size().reset_index(name="counts") - ) - debug_ddi = debug_ddi.merge( - original_ddi_counts, on=["domain_a", "domain_b"], how="left" - ) - debug_ddi = debug_ddi.loc[debug_ddi.index.repeat(debug_ddi["counts"])].reset_index( - drop=True - ) - debug_ddi = debug_ddi.drop(columns=["counts"]) - - return debug_ddi, debug_pd_mapping, debug_ppi - - -def create_debug_dataset_kgiddi( - ddi_df, pd_mapping, ppi_df, pgo_df, go_graph_nx -) -> tuple: - """ - Create a small debug dataset by sampling a subset of the data, ensuring all proteins have GO terms in the GO graph. - """ - - # 1. Filter PGO for GO terms present in the GO graph - valid_go_terms = set(go_graph_nx.nodes()) - pgo_valid = pgo_df[pgo_df["go_accession"].isin(valid_go_terms)].reset_index( - drop=True - ) - - # 2. Find proteins with at least one valid GO term - valid_proteins = set(pgo_valid["uniprot_id"]) - - # 3. Restrict PD mapping to these proteins - pd_mapping_valid = pd_mapping[ - pd_mapping["uniprot_id"].isin(valid_proteins) - ].reset_index(drop=True) - - # 4. Restrict PPI to these proteins - ppi_valid = ppi_df[ - ppi_df["protein_1"].isin(valid_proteins) - & ppi_df["protein_2"].isin(valid_proteins) - ].reset_index(drop=True) - - # 5. Get all domains present in the valid PD mapping - valid_domains = set(pd_mapping_valid["pfam_id"]) - - # 6. Restrict DDI to these domains - ddi_valid = ddi_df[ - ddi_df["domain_a"].isin(valid_domains) & ddi_df["domain_b"].isin(valid_domains) - ].reset_index(drop=True) - - # 7. Sample DDIs for debug set - n_negative_samples = ddi_valid[ddi_valid["interaction"] == 0].shape[0] - # As the number of samples is now likely smaller I multiply by 2 - n_negative_samples *= 2 - - debug_ddi = pd.concat( - [ - ddi_valid[ddi_valid["interaction"] == 1].sample( - n=min( - n_negative_samples, - ddi_valid[ddi_valid["interaction"] == 1].shape[0], - ), - random_state=42, - ), - ddi_valid[ddi_valid["interaction"] == 0].sample( - n=max(1, n_negative_samples // 10), random_state=42 - ), - ] - ).reset_index(drop=True) - - # 8. Get domains in debug_ddi - debug_domains = set(debug_ddi["domain_a"]).union(set(debug_ddi["domain_b"])) - - # 9. Restrict PD mapping to these domains - debug_pd_mapping = pd_mapping_valid[ - pd_mapping_valid["pfam_id"].isin(debug_domains) - ].reset_index(drop=True) - - # 10. Restrict PPI to proteins in debug_pd_mapping - debug_proteins = set(debug_pd_mapping["uniprot_id"]) - debug_ppi = ppi_valid[ - ppi_valid["protein_1"].isin(debug_proteins) - & ppi_valid["protein_2"].isin(debug_proteins) - ].reset_index(drop=True) - - # 11. Restrict PGO to proteins in debug_pd_mapping - debug_pgo_df = pgo_valid[pgo_valid["uniprot_id"].isin(debug_proteins)].reset_index( - drop=True - ) - - return debug_ddi, debug_pd_mapping, debug_ppi, debug_pgo_df - - -if __name__ == "__main__": - # Testing - # Check the two load ddi functions - database_path = Path( - "/nfs/data/CoBiNet_Masterpraktikum/databases/random_ddi/test.sqlite3" - ) - ddi1 = load_ddi(database_path) - ddi2 = load_ddi_new(database_path) - print(f"DDI1 shape: {ddi1.shape}, DDI2 shape: {ddi2.shape}") - # Write both to tmp csv files - ddi1.to_csv("tmp/tmp_ddi1.csv", index=False) - ddi2.to_csv("tmp/tmp_ddi2.csv", index=False) diff --git a/bin/machine_learning.py b/bin/machine_learning.py index b35e158..6b4a423 100755 --- a/bin/machine_learning.py +++ b/bin/machine_learning.py @@ -1,48 +1,28 @@ #!/usr/bin/env python3 -import itertools +"""Shared infrastructure for DDI ML models. -import math +Provides data loading, caching, evaluation utilities, and the DDIModelTrainer +base class that neural_network.py and random_forest.py extend. +""" import argparse -import h5py +import collections +import gc +import itertools import json +import math +import random + +import h5py import numpy as np import pandas as pd -import random +from abc import ABC, abstractmethod from contextlib import ExitStack from pathlib import Path -from sklearn.metrics import precision_recall_curve +from sklearn.metrics import matthews_corrcoef from sklearn.model_selection import RandomizedSearchCV, PredefinedSplit -from skorch import NeuralNetBinaryClassifier -from skorch.callbacks import EarlyStopping from typing import List -import gc - -import torch -torch.backends.cuda.matmul.allow_tf32 = True -torch.backends.cudnn.allow_tf32 = True - - -class MLPModule(torch.nn.Module): - def __init__( - self, input_size: int, hidden_layer_sizes: List[int] = [], dropout_rate=0.5 - ): - layer_sizes = ( - [input_size] + hidden_layer_sizes + [1] - ) # Ensure input and output sizes are correct - super(MLPModule, self).__init__() - layers = [] - for i in range(len(layer_sizes) - 1): - layers.append(torch.nn.Linear(layer_sizes[i], layer_sizes[i + 1])) - if i < len(layer_sizes) - 2: - layers.append(torch.nn.ReLU()) - layers.append(torch.nn.Dropout(dropout_rate)) - self.layers = torch.nn.Sequential(*layers) - - def forward(self, x): - return self.layers(x) - interaction_encodings = ["protdcal"] @@ -53,8 +33,6 @@ def forward(self, x): # Now: keep at most one entry. Outer-loop callers either reload (cheap with # h5py memmap) or get a hit when the SAME key repeats inside one inner loop # (e.g. final refit immediately after the same grid iter). -import collections - _LOAD_CACHE_MAX = 1 _load_cache: "collections.OrderedDict" = collections.OrderedDict() @@ -64,6 +42,53 @@ def clear_load_cache() -> None: _load_cache.clear() +def _aggregate_to_ddi_level(ddi_pairs, y_true, y_score): + """Collapse per-protein-pair predictions to one row per DDI pair. + + Aggregation: mean predicted probability over all protein instantiations of + each (domain_a, domain_b) pair; true label is constant per pair so 'first' + is exact. + """ + df = pd.DataFrame(ddi_pairs, columns=["domain_a", "domain_b"]) + df["true_interaction"] = np.asarray(y_true).astype(np.int8) + df["predicted_probability"] = np.asarray(y_score).astype(np.float32) + return ( + df.groupby(["domain_a", "domain_b"], sort=False) + .agg(true_interaction=("true_interaction", "first"), + predicted_probability=("predicted_probability", "mean")) + .reset_index() + ) + + +def _tune_threshold_mcc(y_true, y_score, n_candidates: int = 200): + """Pick the decision threshold that maximises Matthews correlation. + + MCC is class-balance invariant, so the result generalises across test priors + that differ from the optimisation set (avoiding the F1-induced positive-bias + that previously collapsed predictions to the majority class). + + Returns (threshold, mcc_at_threshold). + """ + y_true = np.asarray(y_true).astype(np.int8) + y_score = np.asarray(y_score).astype(np.float64) + # Candidate set: quantiles of the score distribution (handles flat scores + # and avoids evaluating degenerate thresholds outside [min, max]). + qs = np.linspace(0.0, 1.0, n_candidates + 2)[1:-1] + candidates = np.unique(np.quantile(y_score, qs)) + if candidates.size == 0: + return 0.5, 0.0 + best_thr = float(candidates[0]) + best_mcc = -2.0 + for thr in candidates: + pred = (y_score >= thr).astype(np.int8) + # matthews_corrcoef returns 0 when one class is empty in predictions. + mcc = matthews_corrcoef(y_true, pred) + if mcc > best_mcc: + best_mcc = mcc + best_thr = float(thr) + return best_thr, float(best_mcc) + + def load_embedding_data( features_path: Path, features: List[str], @@ -193,13 +218,14 @@ def get_domain_protein_combinations(f): get_domain_protein_combinations(f) ) - # Sample protein combinations - if samples_per_ddi is not None: - proteins_combinations = random.choices( - sorted(possible_protein_combinations), k=samples_per_ddi - ) + # Sample protein combinations without replacement; cap at min(K, available). + sorted_combos = sorted(possible_protein_combinations) + if samples_per_ddi is not None and len(sorted_combos) > samples_per_ddi: + proteins_combinations = random.sample(sorted_combos, k=samples_per_ddi) else: - proteins_combinations = sorted(possible_protein_combinations) + proteins_combinations = sorted_combos + if not proteins_combinations: + continue proteins_a, proteins_b = zip(*proteins_combinations) interactions = [f"{pa}_{pb}" for pa, pb in proteins_combinations] @@ -279,317 +305,234 @@ def get_domain_protein_combinations(f): return x, y -def main(): - argparser = argparse.ArgumentParser() +class DDIModelTrainer(ABC): + """Base class for DDI prediction models. + + Subclasses implement model-specific training, saving, and loading while + inheriting the shared training loop, prediction, and evaluation logic. + """ + + MODEL_NAME: str + MODEL_FILE: str + + @abstractmethod + def _get_balance_methods(self, hyperparameters: dict) -> list: + """Return list of balance method strings from the config.""" + + @abstractmethod + def _balance_keys(self) -> list: + """Config keys to exclude from the hyperparameter grid.""" + + @abstractmethod + def _load_train_data(self, args, balance_method: str, samples_per_ddi, seed: int): + """Load training data with the given balance strategy. Returns (x, y).""" + + @abstractmethod + def _create_grid_search(self, hyperparameters, n_iter, cv_split, x, y, config, num_features): + """Create and fit a RandomizedSearchCV. Returns the fitted object.""" + + @abstractmethod + def _refit(self, best_params, best_balance, args, config, num_features, samples_per_ddi): + """Refit on full training data with best params. Returns the classifier.""" + + @abstractmethod + def _save_model(self, classifier, model_path: Path): + """Serialize the trained model to disk.""" + + @abstractmethod + def _load_model(self, model_path: Path): + """Deserialize a trained model from disk.""" + + def _pre_train_hook(self): + """Called before training starts. Override for e.g. CUDA probe.""" + pass + + def _predict_proba(self, classifier, x): + """Get predicted probabilities. Override if dtype casting needed.""" + return classifier.predict_proba(x)[:, 1] + + @classmethod + def build_argparser(cls): + argparser = argparse.ArgumentParser() + argparser.add_argument("--features", nargs="+", required=True) + argparser.add_argument("--features_path", type=Path, required=True) + argparser.add_argument("--ddi_path", type=Path, required=True) + argparser.add_argument("--config", type=Path, required=True) + argparser.add_argument("--out_predictions", type=Path, required=True) + argparser.add_argument("--out_model_dir", type=Path, required=False) + argparser.add_argument("--model_dir", type=Path, required=False) + argparser.add_argument("--predict-only", action="store_true") + argparser.add_argument( + "--max_protein_combinations_per_ddi", type=int, default=None, + help="Optional cap on protein-pair instantiations per DDI pair (sampled without replacement). None = use all available combinations.", + ) + argparser.add_argument("--seed", type=int, default=42) + argparser.add_argument( + "--id", dest="run_id", default=None, + help="Optional run ID (logged only).", + ) + return argparser - argparser.add_argument( - "--features", - nargs="+", - required=True, - ) - argparser.add_argument( - "--features_path", - type=Path, - required=True, - ) - argparser.add_argument("--ddi_path", type=Path, required=True) - argparser.add_argument("--config", type=Path, required=True) - argparser.add_argument("--out_predictions", type=Path, required=True) - argparser.add_argument("--out_model_dir", type=Path, required=False) - argparser.add_argument("--model_dir", type=Path, required=False) - argparser.add_argument("--predict-only", action="store_true") - argparser.add_argument("--seed", type=int, default=42) - argparser.add_argument( - "--id", dest="run_id", default=None, - help="Optional run ID (logged only).", - ) + def run(self): + args = self.build_argparser().parse_args() + if args.run_id: + print(f"[{self.MODEL_NAME}] run_id={args.run_id}") - args = argparser.parse_args() - if args.run_id: - print(f"[machine_learning] run_id={args.run_id}") + if not args.predict_only: + classifier = self.train(args) + model_json_path = args.out_model_dir / "model_parameters.json" + else: + model_path = args.model_dir / self.MODEL_FILE + if not model_path.exists(): + raise FileNotFoundError(f"Model file {model_path} does not exist.") + classifier = self._load_model(model_path) + model_json_path = args.model_dir / "model_parameters.json" + + with model_json_path.open("r") as f: + model_parameters = json.load(f) + threshold = model_parameters.get("threshold", 0.5) + self.predict(args, classifier, threshold) + + def predict(self, args, classifier, threshold=0.5): + print("Predicting on test data...") + print(args.features) + print(args) + random.seed(args.seed) + x_test, y_test, ddi_pairs = load_embedding_data( + args.features_path, args.features, args.ddi_path, "test", + samples_per_ddi=args.max_protein_combinations_per_ddi, + balance_classes=False, return_ddi_pairs=True, + ) - if not args.predict_only: - classifier = train_model(args) - model_json_path = args.out_model_dir / "model_parameters.json" - else: - module_path = args.model_dir / "NeuralNetwork.pkl" - model_json_path = args.model_dir / "model_parameters.json" - if not module_path.exists(): - raise FileNotFoundError(f"Model file {module_path} does not exist.") - with module_path.open("rb") as model_file: - module = torch.load(model_file, weights_only=False) - print(module) - classifier = NeuralNetBinaryClassifier(module) - classifier.initialize() - - with model_json_path.open("r") as model_json_file: - model_parameters = json.load(model_json_file) - balance_opt_set = model_parameters[ - "balance_positive_and_negative_interactions_train_set" - ] - samples_per_ddi_opt = model_parameters["protein_sample_per_ddi_train_set"] - threshold = model_parameters.get("threshold", 0.5) - predict(args, classifier, balance_opt_set, samples_per_ddi_opt, threshold=threshold) - - -def train_model(args): - # Load the configuration file - with Path(args.config).open("r") as config_file: - config = json.load(config_file) - - # Load hyperparameters and search parameters from the configuration - hyperparameters = config["model_parameters"] - search_parameters = config["search_parameters"] - best_model_parameters_and_performance = [] - - balance_opt_set = search_parameters[ - "balance_positive_and_negative_interactions_opt_set" - ] - samples_per_ddi_opt = search_parameters["protein_sample_per_ddi_opt_set"] - - outer_loop_runs = len( - hyperparameters["balance_positive_and_negative_interactions_train_set"] - ) * len(hyperparameters["protein_sample_per_ddi_train_set"]) - - inner_search_runs = math.ceil( - search_parameters["models_to_evaluate"] / outer_loop_runs - ) - device = config["device"] - if device in ("auto", "cuda") and not torch.cuda.is_available(): - print("CUDA requested but not available — falling back to CPU.") - device = "cpu" - elif device == "auto": - device = "cuda" - print(f"Training device: {device}") - jobs = config["jobs"] - - # Load optimization data - print("Loading optimization data...") - random.seed(args.seed) - x_opt, y_opt = load_embedding_data( - args.features_path, - args.features, - args.ddi_path, - "optimization", - samples_per_ddi=samples_per_ddi_opt, - balance_classes=balance_opt_set, - ) - num_features = x_opt.shape[1] - print(f"Optimization data shape: {x_opt.shape}, Labels shape: {y_opt.shape}") - print( - f"Number of positive samples: {np.sum(y_opt == 1)}, Number of negative samples: {np.sum(y_opt == 0)}" - ) + print(f"Test data shape: {x_test.shape}, Labels shape: {y_test.shape}") + print( + f"Number of positive samples: {np.sum(y_test == 1)}, Number of negative samples: {np.sum(y_test == 0)}" + ) - print("Starting grid search for hyperparameter tuning...") - for balance_train_set in hyperparameters[ - "balance_positive_and_negative_interactions_train_set" - ]: - for protein_sample_per_ddi_train_set in hyperparameters[ - "protein_sample_per_ddi_train_set" - ]: - hyperparameters_filtered = { - k: v - for (k, v) in hyperparameters.items() - if k - not in [ - "balance_positive_and_negative_interactions_train_set", - "protein_sample_per_ddi_train_set", - ] - } - - # print("Loading training data...") - random.seed(args.seed) - x_train, y_train = load_embedding_data( - args.features_path, - args.features, - args.ddi_path, - "train", - balance_classes=balance_train_set, - samples_per_ddi=protein_sample_per_ddi_train_set, - ) + y_test_pred_proba = self._predict_proba(classifier, x_test) + + predictions_df = _aggregate_to_ddi_level(ddi_pairs, y_test, y_test_pred_proba) + predictions_df["predicted_interaction"] = ( + predictions_df["predicted_probability"].values >= threshold + ).astype(np.int8) + predictions_df["true_interaction"] = predictions_df["true_interaction"].astype(np.int8) + predictions_df = predictions_df[ + ["domain_a", "domain_b", "true_interaction", "predicted_interaction", "predicted_probability"] + ] + out_path = str(args.out_predictions) + if out_path.endswith(".csv"): + predictions_df.to_csv(out_path, index=False) + else: + predictions_df.to_parquet(out_path, index=False, compression="zstd") + print(f"Predictions saved to {out_path}") - # print(f"Training data shape: {x_train.shape}, Labels shape: {y_train.shape}") - # print( - # f"Number of positive samples: {np.sum(y_train == 1)}, Number of negative samples: {np.sum(y_train == 0)}") + def train(self, args): + self._pre_train_hook() + + with Path(args.config).open("r") as config_file: + config = json.load(config_file) + + hyperparameters = config["model_parameters"] + search_parameters = config["search_parameters"] + balance_methods = self._get_balance_methods(hyperparameters) + samples_per_ddi = args.max_protein_combinations_per_ddi + balance_opt_set = search_parameters[ + "balance_positive_and_negative_interactions_opt_set" + ] + + n_iter = math.ceil( + search_parameters["models_to_evaluate"] / len(balance_methods) + ) + + print("Loading optimization data...") + random.seed(args.seed) + x_opt, y_opt = load_embedding_data( + args.features_path, args.features, args.ddi_path, "optimization", + samples_per_ddi=samples_per_ddi, balance_classes=balance_opt_set, + ) + num_features = x_opt.shape[1] + print(f"Optimization data shape: {x_opt.shape}, Labels shape: {y_opt.shape}") + print( + f"Number of positive samples: {np.sum(y_opt == 1)}, Number of negative samples: {np.sum(y_opt == 0)}" + ) + + print("Starting grid search for hyperparameter tuning...") + results = [] + hparams_filtered = { + k: v for k, v in hyperparameters.items() + if k not in self._balance_keys() + } + + for balance_method in balance_methods: + print(f"[grid] balance_method={balance_method}") + x_train, y_train = self._load_train_data( + args, balance_method, samples_per_ddi, args.seed + ) x = np.concatenate([x_train, x_opt], axis=0) y = np.concatenate([y_train, y_opt], axis=0) + split = PredefinedSplit([-1] * len(x_train) + [0] * len(x_opt)) - split = PredefinedSplit( - [-1] * len(x_train) + [0] * len(x_opt) - ) # -1 = always train, 0 = validation fold - - # Train Neural Network Classifier model - classifier = NeuralNetBinaryClassifier( - MLPModule, - max_epochs=search_parameters["grid_search_epochs"], - device=device, - verbose=0, - module__input_size=num_features, - ) - # grid_search = GridSearchCV(classifier, grid_search_params, n_jobs=args.jobs) - grid_search = RandomizedSearchCV( - classifier, - hyperparameters_filtered, - n_iter=inner_search_runs, - n_jobs=jobs, - cv=split, - refit=False, - verbose=2, - scoring="average_precision", - ) - grid_search.fit(x, y) - - best_model_parameters_and_performance.append( - ( - grid_search.best_params_, - grid_search.best_score_, - balance_train_set, - protein_sample_per_ddi_train_set, - ) + gs = self._create_grid_search( + hparams_filtered, n_iter, split, x, y, config, num_features ) + results.append((gs.best_params_, gs.best_score_, balance_method)) - # B3: free per-iter arrays before next outer-loop fit - del x, y, x_train, y_train, classifier, grid_search + del x, y, x_train, y_train, gs gc.collect() - best_model_parameters_and_performance.sort(key=lambda x: x[1], reverse=True) + results.sort(key=lambda r: r[1], reverse=True) + best_params, _, best_balance = results[0] - # Refitting on training data with the best parameters - params, score, balance_train_set, protein_sample_per_ddi_train_set = ( - best_model_parameters_and_performance[0] - ) - - print(f"Best parameters: {params}") - print(f"Balance training set: {balance_train_set}") - print(f"Protein sample per DDI training set: {protein_sample_per_ddi_train_set}") - - # B3: drop x_opt before refit on full train (NN is GPU-bound, no need to - # keep validation fold materialised in host RAM during retrain). It will - # be reloaded for threshold tuning right after. - x_opt_shape = x_opt.shape - del x_opt, y_opt - clear_load_cache() - gc.collect() - - random.seed(args.seed) - x_train, y_train = load_embedding_data( - args.features_path, - args.features, - args.ddi_path, - "train", - balance_classes=balance_train_set, - samples_per_ddi=protein_sample_per_ddi_train_set, - ) - n_pos = int(np.sum(y_train == 1)) - n_neg = int(np.sum(y_train == 0)) - pos_weight = torch.tensor([n_neg / max(n_pos, 1)]) - classifier = NeuralNetBinaryClassifier( - MLPModule, - max_epochs=search_parameters["retrain_epochs"], - device=device, - **params, - verbose=1, - module__input_size=num_features, - criterion__pos_weight=pos_weight, - callbacks=[EarlyStopping(patience=5, monitor="valid_loss")], - ) - print("Refitting best parameter model on training data...") - classifier.fit(x_train, y_train) - - # B3: x_train no longer needed once refit done; reload x_opt for thresholding. - del x_train, y_train - gc.collect() - random.seed(args.seed) - x_opt, y_opt = load_embedding_data( - args.features_path, - args.features, - args.ddi_path, - "optimization", - samples_per_ddi=samples_per_ddi_opt, - balance_classes=balance_opt_set, - ) - assert x_opt.shape == x_opt_shape, "Reloaded x_opt shape changed unexpectedly" + print(f"Best parameters: {best_params}") + print(f"Balance method: {best_balance}") - print("Tuning decision threshold on optimization data...") - y_opt_proba = classifier.predict_proba(x_opt)[:, 1] - prec, rec, thr = precision_recall_curve(y_opt, y_opt_proba) - f1s = 2 * prec[:-1] * rec[:-1] / (prec[:-1] + rec[:-1] + 1e-12) - best_thr = float(thr[np.argmax(f1s)]) - print(f"Tuned threshold: {best_thr:.3f} (F1={f1s.max():.3f})") + x_opt_shape = x_opt.shape + del x_opt, y_opt + clear_load_cache() + gc.collect() - y_pred = (y_opt_proba >= best_thr).astype(int) - - # create confusion matrix - confusion_matrix = pd.crosstab( - y_opt, y_pred, rownames=["Actual"], colnames=["Predicted"], margins=True - ) - print(f"\nConfusion Matrix:\n\n{confusion_matrix}\n") - - # Save the model - print("Saving the model...") - args.out_model_dir.mkdir(parents=True, exist_ok=True) - model_path = args.out_model_dir / "NeuralNetwork.pkl" - with model_path.open("wb") as model_file: - torch.save(classifier.module_, model_file) - params_path = args.out_model_dir / "model_parameters.json" - with params_path.open("w") as params_file: - json.dump( - { - "model_parameters": params, - "balance_positive_and_negative_interactions_train_set": balance_train_set, - "protein_sample_per_ddi_train_set": protein_sample_per_ddi_train_set, - "threshold": best_thr, - }, - params_file, - indent=4, + classifier = self._refit( + best_params, best_balance, args, config, num_features, samples_per_ddi ) - return classifier - - -def predict(args, classifier, balance_opt_set, samples_per_ddi_opt, threshold=0.5): - # Predict on test data - print("Predicting on test data...") - print(args.features) - print(args) - random.seed(args.seed) - x_test, y_test, ddi_pairs, protein_pairs = load_embedding_data( - args.features_path, - args.features, - args.ddi_path, - "test", - samples_per_ddi=samples_per_ddi_opt, - balance_classes=False, - return_ddi_pairs=True, - return_protein_pairs=True, - ) - - print(f"Test data shape: {x_test.shape}, Labels shape: {y_test.shape}") - print( - f"Number of positive samples: {np.sum(y_test == 1)}, Number of negative samples: {np.sum(y_test == 0)}" - ) - - y_test_pred_proba = classifier.predict_proba(x_test)[:, 1] - y_test_pred = (y_test_pred_proba >= threshold).astype(int) - - # Save predictions. Tight dtypes; parquet by default (5–10× smaller + - # faster to read in eval_one); falls back to csv on .csv suffix for - # back-compat. - predictions_df = pd.DataFrame(ddi_pairs, columns=["domain_a", "domain_b"]) - predictions_df["protein_a"], predictions_df["protein_b"] = zip(*protein_pairs) - predictions_df["true_interaction"] = np.asarray(y_test).astype(np.int8) - predictions_df["predicted_interaction"] = np.asarray(y_test_pred).astype(np.int8) - predictions_df["predicted_probability"] = np.asarray(y_test_pred_proba).astype( - np.float32 - ) - out_path = str(args.out_predictions) - if out_path.endswith(".csv"): - predictions_df.to_csv(out_path, index=False) - else: - predictions_df.to_parquet(out_path, index=False, compression="zstd") - print(f"Predictions saved to {out_path}") + random.seed(args.seed) + x_opt, y_opt, opt_ddi_pairs = load_embedding_data( + args.features_path, args.features, args.ddi_path, "optimization", + samples_per_ddi=samples_per_ddi, balance_classes=balance_opt_set, + return_ddi_pairs=True, + ) + assert x_opt.shape == x_opt_shape, "Reloaded x_opt shape changed unexpectedly" + + print("Tuning decision threshold on DDI-aggregated optimization data via MCC...") + y_opt_proba = self._predict_proba(classifier, x_opt) + opt_agg = _aggregate_to_ddi_level(opt_ddi_pairs, y_opt, y_opt_proba) + best_thr, best_mcc = _tune_threshold_mcc( + opt_agg["true_interaction"].values, + opt_agg["predicted_probability"].values, + ) + print(f"Tuned threshold: {best_thr:.3f} (MCC={best_mcc:.3f})") + y_pred = (opt_agg["predicted_probability"].values >= best_thr).astype(int) + confusion_matrix = pd.crosstab( + opt_agg["true_interaction"].values, y_pred, + rownames=["Actual"], colnames=["Predicted"], margins=True, + ) + print(f"\nConfusion Matrix (DDI-level):\n\n{confusion_matrix}\n") + + print("Saving the model...") + args.out_model_dir.mkdir(parents=True, exist_ok=True) + self._save_model(classifier, args.out_model_dir / self.MODEL_FILE) + params_path = args.out_model_dir / "model_parameters.json" + with params_path.open("w") as f: + json.dump( + { + "model_parameters": best_params, + "balance_method": best_balance, + "threshold": best_thr, + }, + f, + indent=4, + ) -if __name__ == "__main__": - main() + return classifier diff --git a/bin/neural_network.py b/bin/neural_network.py new file mode 100644 index 0000000..c7c7407 --- /dev/null +++ b/bin/neural_network.py @@ -0,0 +1,149 @@ +#!/usr/bin/env python3 +"""Neural network DDI prediction model (skorch + PyTorch MLP).""" + +import gc +import random +from typing import List + +import numpy as np +import torch +from sklearn.model_selection import RandomizedSearchCV +from skorch import NeuralNetBinaryClassifier +from skorch.callbacks import EarlyStopping +from skorch.dataset import ValidSplit + +from machine_learning import DDIModelTrainer, load_embedding_data + +torch.backends.cuda.matmul.allow_tf32 = True +torch.backends.cudnn.allow_tf32 = True + + +class MLPModule(torch.nn.Module): + def __init__( + self, input_size: int, hidden_layer_sizes: List[int] = None, dropout_rate=0.5 + ): + if hidden_layer_sizes is None: + hidden_layer_sizes = [] + layer_sizes = [input_size] + hidden_layer_sizes + [1] + super(MLPModule, self).__init__() + layers = [] + for i in range(len(layer_sizes) - 1): + layers.append(torch.nn.Linear(layer_sizes[i], layer_sizes[i + 1])) + if i < len(layer_sizes) - 2: + layers.append(torch.nn.ReLU()) + layers.append(torch.nn.Dropout(dropout_rate)) + self.layers = torch.nn.Sequential(*layers) + + def forward(self, x): + return self.layers(x) + + +class NeuralNetworkTrainer(DDIModelTrainer): + MODEL_NAME = "NeuralNetwork" + MODEL_FILE = "NeuralNetwork.pkl" + + def _get_balance_methods(self, hyperparameters): + return hyperparameters.get("balance_method", ["downsample"]) + + def _balance_keys(self): + return ["balance_method"] + + def _load_train_data(self, args, balance_method, samples_per_ddi, seed): + if balance_method not in ("downsample", "none"): + raise ValueError( + f"NeuralNetworkTrainer supports 'downsample' and 'none', got '{balance_method}'" + ) + downsample = balance_method == "downsample" + random.seed(seed) + return load_embedding_data( + args.features_path, args.features, args.ddi_path, "train", + balance_classes=downsample, samples_per_ddi=samples_per_ddi, + ) + + def _create_grid_search(self, hyperparameters, n_iter, cv_split, x, y, config, num_features): + device = config["device"] + if device in ("auto", "cuda") and not torch.cuda.is_available(): + print("CUDA requested but not available — falling back to CPU.") + device = "cpu" + elif device == "auto": + device = "cuda" + print(f"Training device: {device}") + + # iterator_train__shuffle=True is critical: load_embedding_data with + # balance_classes=True concatenates `pos + neg`, so unshuffled batches + # are class-pure and the model collapses to majority prediction. + # Stratified valid split keeps positives in the holdout used by EarlyStopping. + classifier = NeuralNetBinaryClassifier( + MLPModule, + max_epochs=config["search_parameters"]["grid_search_epochs"], + device=device, + verbose=0, + module__input_size=num_features, + iterator_train__shuffle=True, + train_split=ValidSplit(0.2, stratified=True), + ) + gs = RandomizedSearchCV( + classifier, hyperparameters, n_iter=n_iter, + n_jobs=config["jobs"], cv=cv_split, refit=False, + verbose=2, scoring="average_precision", + ) + gs.fit(x, y) + return gs + + def _refit(self, best_params, best_balance, args, config, num_features, samples_per_ddi): + x_train, y_train = self._load_train_data( + args, best_balance, samples_per_ddi, args.seed + ) + n_pos = int(np.sum(y_train == 1)) + n_neg = int(np.sum(y_train == 0)) + pos_weight = torch.tensor([n_neg / max(n_pos, 1)]) + + device = config["device"] + if device in ("auto", "cuda") and not torch.cuda.is_available(): + device = "cpu" + elif device == "auto": + device = "cuda" + + classifier = NeuralNetBinaryClassifier( + MLPModule, + max_epochs=config["search_parameters"]["retrain_epochs"], + device=device, + **best_params, + verbose=1, + module__input_size=num_features, + criterion__pos_weight=pos_weight, + iterator_train__shuffle=True, + train_split=ValidSplit(0.2, stratified=True), + callbacks=[EarlyStopping(patience=5, monitor="valid_loss")], + ) + print("Refitting best parameter model on training data...") + classifier.fit(x_train, y_train) + del x_train, y_train + gc.collect() + return classifier + + def _save_model(self, classifier, model_path): + with model_path.open("wb") as f: + torch.save(classifier.module_, f) + + def _load_model(self, model_path): + with model_path.open("rb") as f: + module = torch.load(f, weights_only=False) + print(module) + module.eval() + return module + + def _predict_proba(self, classifier, x): + if isinstance(classifier, torch.nn.Module): + with torch.no_grad(): + x_t = torch.tensor(x, dtype=torch.float32) + if hasattr(classifier, "layers"): + device = next(classifier.parameters()).device + x_t = x_t.to(device) + logits = classifier(x_t).squeeze(-1) + return torch.sigmoid(logits).cpu().numpy() + return classifier.predict_proba(x.astype(np.float32))[:, 1] + + +if __name__ == "__main__": + NeuralNetworkTrainer().run() diff --git a/bin/new_model.py b/bin/new_model.py new file mode 100644 index 0000000..98b5dce --- /dev/null +++ b/bin/new_model.py @@ -0,0 +1,80 @@ +#!/usr/bin/env python3 +"""Template for adding a new ML model to the benchmark pipeline. + +Steps to add a new model: +1. Copy this file to bin/.py +2. Create assets/.json with hyperparameter grid +3. Add a Nextflow process in modules/local//main.nf +4. Wire it into subworkflows/local/per_db_benchmark/main.nf +""" + +import random +import pickle + +import numpy as np +from sklearn.model_selection import RandomizedSearchCV + +from machine_learning import DDIModelTrainer, load_embedding_data + + +class NewModelTrainer(DDIModelTrainer): + MODEL_NAME = "NewModel" # Must match assets/NewModel.json filename + MODEL_FILE = "NewModel.pkl" + + def _get_balance_methods(self, hyperparameters): + # Return list of balance strategies to grid-search over. + # Common: ["downsample"], ["none"], ["none", "downsample", "oversample"] + return hyperparameters.get("balance_method", ["none"]) + + def _balance_keys(self): + # Config keys handled by balance loop, excluded from sklearn param grid. + return ["balance_method"] + + def _load_train_data(self, args, balance_method, samples_per_ddi, seed): + # Load training data with the requested balance strategy. + # Must return (x, y) numpy arrays. + downsample = balance_method == "downsample" + random.seed(seed) + return load_embedding_data( + args.features_path, args.features, args.ddi_path, "train", + balance_classes=downsample, samples_per_ddi=samples_per_ddi, + ) + + def _create_grid_search(self, hyperparameters, n_iter, cv_split, x, y, config, num_features): + # Create, fit, and return a RandomizedSearchCV. + # `cv_split` is a PredefinedSplit (train=-1, opt=0). + # `config` is the full JSON config dict. + # Set refit=False — the base class handles refitting via _refit(). + raise NotImplementedError("Replace with your classifier + RandomizedSearchCV") + + def _refit(self, best_params, best_balance, args, config, num_features, samples_per_ddi): + # Retrain on full training data with best hyperparameters. + # Must return the fitted classifier. + x_train, y_train = self._load_train_data( + args, best_balance, samples_per_ddi, args.seed + ) + raise NotImplementedError("Create classifier with best_params, fit, return it") + + def _save_model(self, classifier, model_path): + # Serialize trained model. Common: pickle, torch.save, joblib. + with model_path.open("wb") as f: + pickle.dump(classifier, f) + + def _load_model(self, model_path): + # Deserialize for predict-only mode. Must return object compatible + # with _predict_proba(). + with model_path.open("rb") as f: + return pickle.load(f) + + # Optional overrides: + # + # def _pre_train_hook(self): + # """Called before training. Use for GPU probes, env checks, etc.""" + # + # def _predict_proba(self, classifier, x): + # """Override if your model needs dtype casting or custom inference.""" + # return classifier.predict_proba(x.astype(np.float32))[:, 1] + + +if __name__ == "__main__": + NewModelTrainer().run() diff --git a/bin/random_forest.py b/bin/random_forest.py index 068a2b0..fa957d4 100755 --- a/bin/random_forest.py +++ b/bin/random_forest.py @@ -1,308 +1,160 @@ #!/usr/bin/env python +"""Random Forest DDI prediction model (cuML GPU-accelerated).""" import gc -import math import random +import sys -import argparse -import json import numpy as np -import pandas as pd import pickle -from machine_learning import load_embedding_data, clear_load_cache -from pathlib import Path -from sklearn.metrics import precision_recall_curve -from sklearn.model_selection import RandomizedSearchCV, PredefinedSplit -from sklearn.utils.class_weight import compute_sample_weight - - -def main(): - argparser = argparse.ArgumentParser() - - argparser.add_argument( - "--features", - nargs="+", - required=True, - ) - argparser.add_argument( - "--features_path", - type=Path, - required=True, - ) - argparser.add_argument("--ddi_path", type=Path, required=True) - argparser.add_argument("--config", type=Path, required=True) - argparser.add_argument("--out_predictions", type=Path, required=True) - argparser.add_argument("--out_model_dir", type=Path, required=False) - argparser.add_argument("--model_dir", type=Path, required=False) - argparser.add_argument("--predict-only", action="store_true") - argparser.add_argument("--seed", type=int, default=42) - argparser.add_argument( - "--id", dest="run_id", default=None, - help="Optional run ID (logged only).", - ) - - args = argparser.parse_args() - if args.run_id: - print(f"[random_forest] run_id={args.run_id}") - - if not args.predict_only: - classifier = train_model(args) - model_json_path = args.out_model_dir / "model_parameters.json" - else: - module_path = args.model_dir / "RandomForest.pkl" - model_json_path = args.model_dir / "model_parameters.json" - if not module_path.exists(): - raise FileNotFoundError(f"Model file {module_path} does not exist.") - with module_path.open("rb") as model_file: - classifier = pickle.load(model_file) - - with model_json_path.open("r") as model_json_file: - model_parameters = json.load(model_json_file) - balance_opt_set = model_parameters[ - "balance_positive_and_negative_interactions_train_set" - ] - samples_per_ddi_opt = model_parameters["protein_sample_per_ddi_train_set"] - threshold = model_parameters.get("threshold", 0.5) - predict(args, classifier, balance_opt_set, samples_per_ddi_opt, threshold=threshold) - - -def train_model(args): - # cuML is only needed for training; importing it lazily lets --predict-only - # and --help work on machines without a GPU / without cuML installed. - from cuml.ensemble import RandomForestClassifier - - # Load the configuration file - with Path(args.config).open("r") as config_file: - config = json.load(config_file) - - # Load hyperparameters and search parameters from the configuration - hyperparameters = config["model_parameters"] - search_parameters = config["search_parameters"] - best_model_parameters_and_performance = [] - - balance_opt_set = search_parameters[ - "balance_positive_and_negative_interactions_opt_set" - ] - samples_per_ddi_opt = search_parameters["protein_sample_per_ddi_opt_set"] - - outer_loop_runs = len( - hyperparameters["balance_positive_and_negative_interactions_train_set"] - ) * len(hyperparameters["protein_sample_per_ddi_train_set"]) - - inner_search_runs = math.ceil( - search_parameters["models_to_evaluate"] / outer_loop_runs - ) - - # Load optimization data - print("Loading optimization data...") - random.seed(args.seed) - x_opt, y_opt = load_embedding_data( - args.features_path, - args.features, - args.ddi_path, - "optimization", - samples_per_ddi=samples_per_ddi_opt, - balance_classes=balance_opt_set, - ) - print(f"Optimization data shape: {x_opt.shape}, Labels shape: {y_opt.shape}") - print( - f"Number of positive samples: {np.sum(y_opt == 1)}, Number of negative samples: {np.sum(y_opt == 0)}" - ) - - print("Starting grid search for hyperparameter tuning...") - for balance_train_set in hyperparameters[ - "balance_positive_and_negative_interactions_train_set" - ]: - for protein_sample_per_ddi_train_set in hyperparameters[ - "protein_sample_per_ddi_train_set" - ]: - hyperparameters_filtered = { - k: v - for (k, v) in hyperparameters.items() - if k - not in [ - "balance_positive_and_negative_interactions_train_set", - "protein_sample_per_ddi_train_set", - ] - } - - # print("Loading training data...") - random.seed(args.seed) - x_train, y_train = load_embedding_data( - args.features_path, - args.features, - args.ddi_path, - "train", - balance_classes=balance_train_set, - samples_per_ddi=protein_sample_per_ddi_train_set, - ) - - # print(f"Training data shape: {x_train.shape}, Labels shape: {y_train.shape}") - # print( - # f"Number of positive samples: {np.sum(y_train == 1)}, Number of negative samples: {np.sum(y_train == 0)}") - - x = np.concatenate([x_train, x_opt], axis=0).astype(np.float32) - y = np.concatenate([y_train, y_opt], axis=0).astype(np.int32) - - split = PredefinedSplit( - [-1] * len(x_train) + [0] * len(x_opt) - ) # -1 = always train, 0 = validation fold - - # cuML RF has no class_weight; balanced weighting applied via sample_weight at final refit - classifier = RandomForestClassifier() - # n_jobs=1: GPU handles parallelism; parallel CV jobs risk OOM - grid_search = RandomizedSearchCV( - classifier, - hyperparameters_filtered, - n_iter=inner_search_runs, - n_jobs=1, - cv=split, - refit=False, - verbose=2, - scoring="average_precision", - ) - grid_search.fit(x, y) - - best_model_parameters_and_performance.append( - ( - grid_search.best_params_, - grid_search.best_score_, - balance_train_set, - protein_sample_per_ddi_train_set, - ) - ) - - # B3: drop per-iter buffers before next outer iter - del x, y, x_train, y_train, classifier, grid_search - gc.collect() - - best_model_parameters_and_performance.sort(key=lambda x: x[1], reverse=True) - - # Refitting on training data with the best parameters - params, score, balance_train_set, protein_sample_per_ddi_train_set = ( - best_model_parameters_and_performance[0] - ) - - print(f"Best parameters: {params}") - print(f"Balance training set: {balance_train_set}") - print(f"Protein sample per DDI training set: {protein_sample_per_ddi_train_set}") - - # B3: drop x_opt before refit; reload after for thresholding. - x_opt_shape = x_opt.shape - del x_opt, y_opt - clear_load_cache() - gc.collect() - - random.seed(args.seed) +from sklearn.model_selection import RandomizedSearchCV + +from machine_learning import DDIModelTrainer, load_embedding_data + + +# Exit code 140 sits in conf/base.config's retry list (errorStrategy retries on +# [137, 139, 140, 143, 247]). We use it to flag transient CUDA failures so +# Nextflow reschedules onto a different GPU node instead of giving up. +_CUDA_RETRY_EXIT_CODE = 140 + + +def _probe_cuda_or_retry(): + """Run a 1-element GPU op to surface broken CUDA init before grid search. + + Without this, a node-local CUDA glitch (cudaErrorOperatingSystem from + stale --nv bind / cgroup race) makes all GridSearchCV fits fail identically + and sklearn swallows the exception behind "All N fits failed". We catch it + here and exit 140 so the SLURM retry policy moves us to another node. + """ + try: + import cupy as cp + _ = cp.asarray([1.0], dtype=cp.float32) + 1.0 + cp.cuda.runtime.deviceSynchronize() + except Exception as exc: # noqa: BLE001 — any CUDA-init failure is fatal here + print( + f"[random_forest] CUDA smoke test failed ({type(exc).__name__}: {exc}). " + "Exiting with retry code so Nextflow reschedules on a different GPU node.", + file=sys.stderr, + ) + sys.exit(_CUDA_RETRY_EXIT_CODE) + + +# Three ways to address the heavy positive-class imbalance in the training set. +# "none" — train on raw data, no correction. +# "downsample" — load_embedding_data(balance_classes=True): equal pos/neg by +# resampling; preserves the original pipeline behaviour. +# "oversample" — replicate minority-class rows to match majority size; keeps +# all original data and is equivalent to balanced integer +# sample weights. cuML RandomForestClassifier.fit() does not +# accept a sample_weight kwarg, so we materialise the weights +# as duplicated rows instead. +BALANCE_METHODS = ("none", "downsample", "oversample") + + +def _oversample_minority(x, y, seed): + """Replicate minority-class rows so class counts match. Returns (x, y).""" + rng = np.random.default_rng(seed) + y_arr = np.asarray(y).astype(np.int32) + classes, counts = np.unique(y_arr, return_counts=True) + if len(classes) < 2: + return x, y_arr + majority_count = counts.max() + parts_x = [x] + parts_y = [y_arr] + for cls, cnt in zip(classes, counts): + if cnt >= majority_count: + continue + idx = np.where(y_arr == cls)[0] + need = majority_count - cnt + pick = rng.choice(idx, size=need, replace=True) + parts_x.append(x[pick]) + parts_y.append(y_arr[pick]) + x_out = np.concatenate(parts_x, axis=0) + y_out = np.concatenate(parts_y, axis=0) + perm = rng.permutation(len(y_out)) + return x_out[perm], y_out[perm] + + +def _load_train_with_balance( + args, balance_method: str, samples_per_ddi: int, seed: int +): + """Load training arrays under one of the three balance strategies.""" + if balance_method not in BALANCE_METHODS: + raise ValueError(f"Unknown balance_method: {balance_method}") + downsample = balance_method == "downsample" + random.seed(seed) x_train, y_train = load_embedding_data( args.features_path, args.features, args.ddi_path, "train", - balance_classes=balance_train_set, - samples_per_ddi=protein_sample_per_ddi_train_set, - ) - classifier = RandomForestClassifier(**params) - print("Refitting best parameter model on training data...") - x_train_f32 = x_train.astype(np.float32) - y_train_i32 = y_train.astype(np.int32) - del x_train, y_train - gc.collect() - #sample_weight = compute_sample_weight("balanced", y_train_i32) - classifier.fit(x_train_f32, y_train_i32)#, sample_weight=sample_weight) - - # Free training buffers before allocating x_opt again. - del x_train_f32, y_train_i32 - gc.collect() - - random.seed(args.seed) - x_opt, y_opt = load_embedding_data( - args.features_path, - args.features, - args.ddi_path, - "optimization", - samples_per_ddi=samples_per_ddi_opt, - balance_classes=balance_opt_set, - ) - assert x_opt.shape == x_opt_shape, "Reloaded x_opt shape changed unexpectedly" - - print("Tuning decision threshold on optimization data...") - y_opt_proba = classifier.predict_proba(x_opt.astype(np.float32))[:, 1] - prec, rec, thr = precision_recall_curve(y_opt, y_opt_proba) - f1s = 2 * prec[:-1] * rec[:-1] / (prec[:-1] + rec[:-1] + 1e-12) - best_thr = float(thr[np.argmax(f1s)]) - print(f"Tuned threshold: {best_thr:.3f} (F1={f1s.max():.3f})") - - y_pred = (y_opt_proba >= best_thr).astype(int) - - # create confusion matrix - confusion_matrix = pd.crosstab( - y_opt, y_pred, rownames=["Actual"], colnames=["Predicted"], margins=True + balance_classes=downsample, + samples_per_ddi=samples_per_ddi, ) - print(f"\nConfusion Matrix:\n\n{confusion_matrix}\n") - - # Save the model - print("Saving the model...") - args.out_model_dir.mkdir(parents=True, exist_ok=True) - model_path = args.out_model_dir / "RandomForest.pkl" - with model_path.open("wb") as model_file: - pickle.dump(classifier, model_file) - params_path = args.out_model_dir / "model_parameters.json" - with params_path.open("w") as params_file: - json.dump( - { - "model_parameters": params, - "balance_positive_and_negative_interactions_train_set": balance_train_set, - "protein_sample_per_ddi_train_set": protein_sample_per_ddi_train_set, - "threshold": best_thr, - }, - params_file, - indent=4, + if balance_method == "oversample": + x_train, y_train = _oversample_minority(x_train, y_train, seed) + return x_train, y_train + + +class RandomForestTrainer(DDIModelTrainer): + MODEL_NAME = "RandomForest" + MODEL_FILE = "RandomForest.pkl" + + def _pre_train_hook(self): + _probe_cuda_or_retry() + + def _get_balance_methods(self, hyperparameters): + return hyperparameters.get("balance_method", ["none"]) + + def _balance_keys(self): + return ["balance_method"] + + def _load_train_data(self, args, balance_method, samples_per_ddi, seed): + return _load_train_with_balance(args, balance_method, samples_per_ddi, seed) + + def _predict_proba(self, classifier, x): + return classifier.predict_proba(x.astype(np.float32))[:, 1] + + def _create_grid_search(self, hyperparameters, n_iter, cv_split, x, y, config, num_features): + from cuml.ensemble import RandomForestClassifier + x = x.astype(np.float32) + y = y.astype(np.int32) + classifier = RandomForestClassifier() + # n_jobs=1: GPU handles parallelism; parallel CV jobs risk OOM + gs = RandomizedSearchCV( + classifier, hyperparameters, n_iter=n_iter, n_jobs=1, + cv=cv_split, refit=False, verbose=2, scoring="average_precision", + # Surface the real exception on the first failed fit instead of + # letting sklearn mask all N folds behind "All N fits failed". + error_score="raise", ) + gs.fit(x, y) + return gs - return classifier - - -def predict(args, classifier, balance_opt_set, samples_per_ddi_opt, threshold=0.5): - # Predict on test data - print("Predicting on test data...") - print(args.features) - print(args) - random.seed(args.seed) - x_test, y_test, ddi_pairs, protein_pairs = load_embedding_data( - args.features_path, - args.features, - args.ddi_path, - "test", - samples_per_ddi=samples_per_ddi_opt, - balance_classes=False, - return_ddi_pairs=True, - return_protein_pairs=True, - ) - - print(f"Test data shape: {x_test.shape}, Labels shape: {y_test.shape}") - print( - f"Number of positive samples: {np.sum(y_test == 1)}, Number of negative samples: {np.sum(y_test == 0)}" - ) - - y_test_pred_proba = classifier.predict_proba(x_test.astype(np.float32))[:, 1] - y_test_pred = (y_test_pred_proba >= threshold).astype(int) - - # Save predictions. Parquet by default (back-compat csv on .csv suffix). - predictions_df = pd.DataFrame(ddi_pairs, columns=["domain_a", "domain_b"]) - predictions_df["protein_a"], predictions_df["protein_b"] = zip(*protein_pairs) - predictions_df["true_interaction"] = np.asarray(y_test).astype(np.int8) - predictions_df["predicted_interaction"] = np.asarray(y_test_pred).astype(np.int8) - predictions_df["predicted_probability"] = np.asarray(y_test_pred_proba).astype( - np.float32 - ) - out_path = str(args.out_predictions) - if out_path.endswith(".csv"): - predictions_df.to_csv(out_path, index=False) - else: - predictions_df.to_parquet(out_path, index=False, compression="zstd") - print(f"Predictions saved to {out_path}") + def _refit(self, best_params, best_balance, args, config, num_features, samples_per_ddi): + from cuml.ensemble import RandomForestClassifier + x_train, y_train = self._load_train_data( + args, best_balance, samples_per_ddi, args.seed + ) + classifier = RandomForestClassifier(**best_params) + print("Refitting best parameter model on training data...") + x_f32 = x_train.astype(np.float32) + y_i32 = y_train.astype(np.int32) + del x_train, y_train + gc.collect() + classifier.fit(x_f32, y_i32) + del x_f32, y_i32 + gc.collect() + return classifier + + def _save_model(self, classifier, model_path): + with model_path.open("wb") as f: + pickle.dump(classifier, f) + + def _load_model(self, model_path): + with model_path.open("rb") as f: + return pickle.load(f) if __name__ == "__main__": - main() + RandomForestTrainer().run() diff --git a/bin/run_graph_models.py b/bin/run_graph_models.py index 896ee16..e20ee0f 100755 --- a/bin/run_graph_models.py +++ b/bin/run_graph_models.py @@ -27,29 +27,32 @@ "--id", dest="run_id", default=None, help="Optional run ID (logged only).", ) + parser.add_argument( + "--threads", type=int, default=1, + help="Number of worker threads/processes (from task.cpus).", + ) args = parser.parse_args() if args.run_id: print(f"[run_graph_models] run_id={args.run_id}") if args.model == "kgiddi": json_file = os.path.join(args.params, "kgiddi.json") - # Call kgiddi with appropriate parameters print( f"Run KGIDDI graph model with database {args.database} and parameters from {json_file}, output to {args.out_dir} and predictions to {args.out_predictions}" ) - run_kgiddi(args.database, json_file, args.out_dir, args.out_predictions) + run_kgiddi(args.database, json_file, args.out_dir, args.out_predictions, threads=args.threads) elif args.model == "kgiddi_random": json_file = os.path.join(args.params, "kgiddi_random.json") print( f"Run KGIDDI_RANDOM graph model with database {args.database} and parameters from {json_file}, output to {args.out_dir} and predictions to {args.out_predictions}" ) - run_kgiddi(args.database, json_file, args.out_dir, args.out_predictions) + run_kgiddi(args.database, json_file, args.out_dir, args.out_predictions, threads=args.threads) elif args.model == "ddiparsimony": json_file = os.path.join(args.params, "ddiparsimony.json") print( f"Run DDIParsimony graph model with database {args.database} and parameters from {json_file}, output to {args.out_dir} and predictions to {args.out_predictions}" ) - run_ddiparsimony(args.database, json_file, args.out_dir, args.out_predictions) + run_ddiparsimony(args.database, json_file, args.out_dir, args.out_predictions, threads=args.threads) else: raise ValueError(f"Unknown model: {args.model}") diff --git a/bin/test_generate_features.py b/bin/test_generate_features.py deleted file mode 100644 index 6446121..0000000 --- a/bin/test_generate_features.py +++ /dev/null @@ -1,88 +0,0 @@ -#!/usr/bin/env python3 - -import pandas as pd -import argparse - -# PARSE USER INFORMATION PASSED AT RUNTIME -parser = argparse.ArgumentParser( - description="Feature generation for peptide-based sequences. Output file contains features for the sequence as well as the sequence itself (in the first column).", - formatter_class=argparse.ArgumentDefaultsHelpFormatter, -) -parser.add_argument( - "-i", - "--input", - action="store_true", - help="location of file containing peptide sequences to generate features for", -) -parser.add_argument( - "-o", - "--output", - action="store_true", - help="location and name of output file containing feature data", -) -args = parser.parse_known_args() - -# LINK PARAMETERS TO ARGUMENTS -read_in = args[1][0] -read_out = args[1][1] - - -# GENERATE PROTDCAL FEATURES FOR SEQUENCE -def protdcal_features(sequence, protdcal): - # METHOD FOR GENERATING PROTDCAL MEAN VALUES FOR EACH SEQUENCE - # FIRST: GENERATE PROTDCAL VALUES - slist = list(sequence.upper()) # split sequence up into list - # Go through sequence to get protdcal value - t1 = [] - values = [] - for i in slist: - t1.append(protdcal.loc[i].tolist()) - t2 = list(map(lambda *x: sum(x), *t1)) # add up values - for t in t2: - t = t / len(slist) # get average (mean) of summed values - values.append(t) - headers = protdcal.columns.tolist() # include headers - - return values, headers - - -# TRAINING FEATURE GENERATION: GENERATE OTHER FEATURES + FORMAT NICELY -# Import protdcal file -protdcal = pd.read_csv( - "/home/t/thomasc/MaPra/Pipeline/dummy_pipeline/scripts/protdcal_table.csv", - index_col=0, -) - -# Import user file -file_in = str(read_in) -user_feat = pd.read_csv(read_in) - -# Define seq_col input by user as a variable, ensure it's in string format for use with dataframe -seq_col = "domain_sequence" - - -# Check for invalid information in sequences that we can't handle - remove + output as error file -alphabet = "ARNDCQEGHILKMFPSTWYVX" # 20 essential amino acids + X - -print(user_feat) - -sequences = user_feat[seq_col] - -# NOW GET INTO FEATURE GENERATION! - -# Create df for protdcal results to go into -print(sequences[0]) -v, h = protdcal_features(sequences[0], protdcal) -features = pd.DataFrame(columns=h) -features.loc[len(features)] = v - -i = 1 - -# Go through rest of sequences to generate protdcal features set -while i < len(sequences): - ts = sequences[i] - value, header = protdcal_features(ts, protdcal) - features.loc[len(features)] = value - i += 1 - -print(features) diff --git a/conf/base.config b/conf/base.config index 3d914d0..c15ee53 100644 --- a/conf/base.config +++ b/conf/base.config @@ -10,22 +10,15 @@ process { - // TODO nf-core: Check the defaults for all processes cpus = { 1 * task.attempt } memory = { 6.GB * task.attempt } time = { 4.h * task.attempt } + // Retry on OOM/SIGKILL classes — base memory scales with task.attempt + // across labels below, so retry 2/3 gets ×2 / ×3 memory automatically. errorStrategy = { task.exitStatus in ((130..145) + 104 + (175..177)) ? 'retry' : 'finish' } - maxRetries = 1 + maxRetries = 3 maxErrors = '-1' - - // Process-specific resource requirements - // NOTE - Please try and reuse the labels below as much as possible. - // These labels are used and recognised by default in DSL2 files hosted on nf-core/modules. - // If possible, it would be nice to keep the same label naming convention when - // adding in your local modules too. - // TODO nf-core: Customise requirements for specific processes. - // See https://www.nextflow.io/docs/latest/config.html#config-process-selectors withLabel:process_single { cpus = { 1 } memory = { 6.GB * task.attempt } @@ -46,6 +39,17 @@ process { memory = { 72.GB * task.attempt } time = { 16.h * task.attempt } } + // Graph models (kgiddi, kgiddi_random, ddiparsimony). kgiddi spends ~5h + // in build_ddi_network alone and kgiddi_random (permutation control) + // pushes network_expansion past that — both blew through process_medium's + // 8h slot on the last cluster run. Inline overrides inside the GRAPH_MODEL + // process body were silently shadowed because withLabel directives in this + // config take precedence over per-process directives. Dedicated label. + withLabel:process_graph { + cpus = 18 + memory = { 108.GB * task.attempt } + time = { 24.h * task.attempt } + } withLabel:process_long { time = { 20.h * task.attempt } } diff --git a/conf/modules.config b/conf/modules.config index 6cb70c6..38c556b 100644 --- a/conf/modules.config +++ b/conf/modules.config @@ -40,19 +40,32 @@ process { withName: 'FEATURE_EXTRACTION_ONE' { publishDir = [ enabled: false ] + // Heavy embedding features (1024-2560-dim) need a much bigger + // allocation than the default `feature_extraction` label (40 GB). + // Route them to the same memory/time envelope as `process_gpu_large` + // (160 GB × task.attempt, 36 h). With maxRetries=3 the ceiling is + // 480 GB, leaving headroom past the previously-failing 120 GB cap. + memory = { + def heavy = (params.large_features ?: '').tokenize(',')*.trim() + (heavy.contains(meta.feature) ? 160.GB : 40.GB) * task.attempt + } + time = { + def heavy = (params.large_features ?: '').tokenize(',')*.trim() + (heavy.contains(meta.feature) ? 36.h : 12.h) * task.attempt + } } - withName: 'MACHINE_LEARNING' { + withName: 'NEURAL_NETWORK' { publishDir = [ - path: { "${params.outdir}/${meta.id}/ml_output" }, + path: { "${params.outdir}/${meta.id}/nn_output" }, mode: params.publish_dir_mode, saveAs: { filename -> filename.equals('versions.yml') ? null : filename } ] } - withName: 'MACHINE_LEARNING_EVALUATION' { + withName: 'NEURAL_NETWORK_EVALUATION' { publishDir = [ - path: { "${params.outdir}/${meta.id}/ml_output_complete_db" }, + path: { "${params.outdir}/${meta.id}/nn_output_complete_db" }, mode: params.publish_dir_mode, saveAs: { filename -> filename.equals('versions.yml') ? null : filename } ] diff --git a/conf/slurm.config b/conf/slurm.config index 53eab24..39649b4 100644 --- a/conf/slurm.config +++ b/conf/slurm.config @@ -20,7 +20,7 @@ process { memory = { 80.GB * task.attempt } time = { 24.h * task.attempt } errorStrategy = { task.exitStatus in [137, 139, 140, 143, 247] ? 'retry' : 'finish' } - maxRetries = 2 + maxRetries = 3 // ------- GPU labels ------- // ML/RF pick `process_gpu_small` for aacomp/aaencode and @@ -40,7 +40,7 @@ process { } withLabel: 'process_gpu_large' { queue = 'shared-gpu' - memory = { 120.GB * task.attempt } + memory = { 160.GB * task.attempt } time = { 36.h * task.attempt } clusterOptions = '--gpus-per-task=1 --ntasks=1 --qos=limitgpus --nodelist=jlab-gpu01.exbio.wzw.tum.de,gpu02.exbio.wzw.tum.de,compms-gpu-2.exbio.wzw.tum.de' } diff --git a/conf/test.config b/conf/test.config index 71e2efb..826308b 100644 --- a/conf/test.config +++ b/conf/test.config @@ -22,10 +22,8 @@ params { config_profile_name = 'Test profile' config_profile_description = 'Minimal in-repo fixture used for smoke tests / CI.' - // I/O — tests/data/ is created in Stage 9 of the refactor. - db = "${projectDir}/tests/data/db_minimal" - db_list = null - input = null + // I/O + input = "${projectDir}/assets/samplesheet_test.csv" outdir = './results-test' // Pipeline behaviour: run only the cheapest feature + skip every graph model. diff --git a/conf/test_full.config b/conf/test_full.config index 604f344..9ed03dd 100644 --- a/conf/test_full.config +++ b/conf/test_full.config @@ -19,13 +19,7 @@ params { config_profile_name = 'Full test profile' config_profile_description = 'Full benchmark across all cluster DB splits' - // Real DB splits on cluster scratch. Override with --db_list for other paths. - db_list = [ - '/nfs/data/CoBiNet_Masterpraktikum/databases/minimal_leakage_domain', - '/nfs/data/CoBiNet_Masterpraktikum/databases/random_ddi', - '/nfs/data/CoBiNet_Masterpraktikum/databases/random_ddi', - '/nfs/data/CoBiNet_Masterpraktikum/databases/random_discovery', - ] + input = "${projectDir}/assets/samplesheet_full.csv" outdir = 'results-test-full' skip = '' diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md index 96d3cfe..e52fe6b 100644 --- a/docs/CONTRIBUTING.md +++ b/docs/CONTRIBUTING.md @@ -148,6 +148,15 @@ These labels define resource defaults for single-core processes, modules that re Values assigned within these labels can be dynamically passed to a tool using the the `${task.cpus}` and `${task.memory}` Nextflow variables in the `script:` block of a module (see an example in the [modules repository](https://github.com/nf-core/modules/blob/bd1b6a40f55933d94b8c9ca94ec8c1ea0eaf4b82/modules/nf-core/samtools/bam2fq/main.nf#L30)). +#### Adding new features or models + +The pipeline provides copy-and-customize templates for the two most common extension points: + +- **Feature encodings:** `bin/features/new_feature.py` — implements the `extract_features(conn, out_file)` contract. Copy it, implement your computation, and add the feature name to `params.machine_learning_features` in `nextflow.config`. Add to `params.large_features` if it needs GPU/large memory. +- **ML models:** `bin/new_model.py` — subclasses `DDIModelTrainer` with hooks for grid search, refit, save/load. Copy it, implement the methods, create the matching `assets/.json`, and add a Nextflow process + subworkflow wiring. + +Both templates include step-by-step instructions in their docstrings. + #### Nextflow version bumping If you use a new feature from core Nextflow, bump the minimum required Nextflow version in the pipeline with: diff --git a/docs/metromap_style_pipeline_workflow_components.drawio b/docs/metromap_style_pipeline_workflow_components.drawio new file mode 100644 index 0000000..24231e7 --- /dev/null +++ b/docs/metromap_style_pipeline_workflow_components.drawio @@ -0,0 +1,615 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/main.nf b/main.nf index a3a498b..36b6101 100644 --- a/main.nf +++ b/main.nf @@ -62,10 +62,7 @@ workflow { params.input, params.help, params.help_full, - params.show_hidden, - // pipeline-specific input - params.db_list, - params.db + params.show_hidden ) // diff --git a/modules/local/ddi_extraction/meta.yml b/modules/local/ddi_extraction/meta.yml new file mode 100644 index 0000000..430ef15 --- /dev/null +++ b/modules/local/ddi_extraction/meta.yml @@ -0,0 +1,51 @@ +name: ddi_extraction +description: | + Extract domain-domain interaction (DDI) tables from a database split. + Reads one or more `.sqlite3` files and writes + per-split CSV tables of `(domain_1, domain_2, interaction)` rows for + downstream feature extraction and model training. +keywords: + - ddi + - sqlite + - domain-domain interaction + - benchmark +tools: + - python: + description: SQL → CSV extraction script. + homepage: https://www.python.org + documentation: https://docs.python.org + licence: + - "PSF-2.0" +input: + - - meta: + type: map + description: | + Groovy Map containing database information, + e.g. [ id:'random_denoise' ] + - database_dir: + type: directory + description: | + Directory containing `train.sqlite3`, `test.sqlite3`, + `optimization.sqlite3`. A single `.sqlite3` file is also + accepted; in that case only the `test` split is produced. + pattern: "*" +output: + ddi: + - - meta: + type: map + description: Groovy Map containing database information + - "${meta.id}/DDI": + type: directory + description: Per-split DDI CSV directory with train/test/optimization tables. + pattern: "DDI" + versions: + - - ${task.process}: + type: string + description: The process the versions were collected from + - python: + type: string + description: Python interpreter version +authors: + - "@konstantinpelz" +maintainers: + - "@konstantinpelz" diff --git a/modules/local/evaluation/main.nf b/modules/local/evaluation/main.nf index 757569b..6f38b6a 100644 --- a/modules/local/evaluation/main.nf +++ b/modules/local/evaluation/main.nf @@ -77,19 +77,37 @@ process COMBINE_EVAL { container "docker://konstantinpelz/cobinet-general:1.0.0" input: - tuple val(meta), path(eval_reports, stageAs: 'reports/*') + // Every per-DB EVALUATION emits a dir literally named `evaluation/`, so + // staging them with a shared target collides ("multiple input files for + // each of the following file names: reports/evaluation"). `stageAs: + // 'src/?/*'` gives each input its own numbered parent dir while + // preserving the original `evaluation` name. `ids` runs in parallel with + // `eval_reports` (same order), letting us symlink each staged dir into + // `reports//evaluation` -- required by combine_eval.py, which + // derives db_name from the report dir's parent. + tuple val(meta), path(eval_reports, stageAs: 'src/?/*'), val(ids) output: tuple val(meta), path('evaluation/'), emit: combined_report path "versions.yml", emit: versions script: - def reports_list = eval_reports instanceof java.util.List ? eval_reports.join(' ') : eval_reports + def ids_list = ids instanceof java.util.List ? ids : [ids] + def ids_bash = ids_list.collect { "'${it}'" }.join(' ') """ + mkdir -p reports + ids=(${ids_bash}) + n=\${#ids[@]} + for (( i=1; i<=n; i++ )); do + db_id="\${ids[\$((i-1))]}" + mkdir -p "reports/\${db_id}" + ln -s "../../src/\${i}/evaluation" "reports/\${db_id}/evaluation" + done + mkdir -p evaluation combine_eval.py \\ - --reports ${reports_list} \\ + --reports reports/*/evaluation \\ --out_dir evaluation cat <<-END_VERSIONS > versions.yml diff --git a/modules/local/evaluation/meta.yml b/modules/local/evaluation/meta.yml new file mode 100644 index 0000000..c2a11c9 --- /dev/null +++ b/modules/local/evaluation/meta.yml @@ -0,0 +1,63 @@ +name: evaluation +description: | + Two-stage evaluation. `EVAL_ONE` scatters: one subtask per prediction + parquet emits a small JSON of metrics plus downsampled ROC/PR curves. + `EVALUATION` reduces: collects per-model JSONs for a database and builds + the per-DB MultiQC report. `COMBINE_EVAL` merges the per-DB reports into + the cross-database `ddi_report.html`. +keywords: + - evaluation + - multiqc + - metrics + - roc + - pr curve + - benchmark +tools: + - multiqc: + description: HTML aggregation of per-model metric JSONs. + homepage: https://multiqc.info + documentation: https://multiqc.info/docs/ + licence: + - "GPL-3.0-or-later" + - scikit-learn: + description: Metric computation (AUROC, AUPRC, F1, MCC). + homepage: https://scikit-learn.org + documentation: https://scikit-learn.org/stable/ + licence: + - "BSD-3-Clause" +input: + - - meta: + type: map + description: | + Groovy Map with database and model information. + - predictions: + type: file + description: Predictions parquet file (one model × DB × feature combo). + pattern: "*.parquet" +output: + metrics: + - - meta: + type: map + description: Groovy Map with database and model info. + - "per_model/${meta.model}.json": + type: file + description: Per-model metric JSON (metrics + downsampled curves). + pattern: "*.json" + report: + - - meta: + type: map + description: Groovy Map with database info. + - "evaluation/": + type: directory + description: Per-database MultiQC report directory. + versions: + - - ${task.process}: + type: string + description: The process the versions were collected from + - python: + type: string + description: Python interpreter version +authors: + - "@konstantinpelz" +maintainers: + - "@konstantinpelz" diff --git a/modules/local/feature_extraction/meta.yml b/modules/local/feature_extraction/meta.yml new file mode 100644 index 0000000..65b10c8 --- /dev/null +++ b/modules/local/feature_extraction/meta.yml @@ -0,0 +1,54 @@ +name: feature_extraction +description: | + Compute a single feature encoding for one `(database × split)` cell. + `FEATURE_EXTRACTION_ONE` is fanned out by the inner `FEATURE_EXTRACTION` + workflow across all `(feature × split)` combinations so downstream ML, + RF, and graph modules can consume the canonical + `features//.h5` layout. +keywords: + - feature extraction + - domain embeddings + - aacomp + - aaencode + - protdcal + - esm + - prott5 + - h5 +tools: + - python: + description: Feature extraction script `extract_features.py` dispatches to encoders under `bin/features/`. + homepage: https://www.python.org + documentation: https://docs.python.org + licence: + - "PSF-2.0" +input: + - - meta: + type: map + description: | + Groovy Map containing database, feature, and split information, + e.g. [ id:'random_denoise', feature:'aacomp', dataset:'train' ] + - database_dir: + type: directory + description: | + Database split directory containing `.sqlite3`. + pattern: "*" +output: + h5: + - - meta: + type: map + description: Groovy Map with database, feature, and split info. + - "${meta.dataset}.h5": + type: file + description: HDF5 feature matrix for one split. + pattern: "*.h5" + versions: + - - ${task.process}: + type: string + description: The process the versions were collected from + - python: + type: string + description: Python interpreter version +authors: + - "@konstantinpelz" +maintainers: + - "@konstantinpelz" diff --git a/modules/local/graph_model/main.nf b/modules/local/graph_model/main.nf index 0cc3541..96e324b 100644 --- a/modules/local/graph_model/main.nf +++ b/modules/local/graph_model/main.nf @@ -1,6 +1,6 @@ process GRAPH_MODEL { tag "${meta.id}" - label 'process_medium' + label 'process_high' conda "${projectDir}/environments/general.yml" container "docker://konstantinpelz/cobinet-general:1.0.0" @@ -24,7 +24,8 @@ process GRAPH_MODEL { --model ${meta.model} \\ --params ${model_json} \\ --out_dir ${output_model_dir} \\ - --out_predictions ${output_predictions} + --out_predictions ${output_predictions} \\ + --threads ${task.cpus} cat <<-END_VERSIONS > versions.yml "${task.process}": diff --git a/modules/local/graph_model/meta.yml b/modules/local/graph_model/meta.yml new file mode 100644 index 0000000..92445dc --- /dev/null +++ b/modules/local/graph_model/meta.yml @@ -0,0 +1,59 @@ +name: graph_model +description: | + Run a graph-based DDI predictor (KGIDDI, DDIParsimony, or + KGIDDI_RANDOM) against a database split using the model JSON config. + Operates directly on the sqlite splits — no feature extraction needed. +keywords: + - graph model + - kgiddi + - ddiparsimony + - knowledge graph + - link prediction +tools: + - networkx: + description: Graph algorithms backing KGIDDI / DDIParsimony. + homepage: https://networkx.org + documentation: https://networkx.org/documentation/stable/ + licence: + - "BSD-3-Clause" +input: + - - meta: + type: map + description: | + Groovy Map with database and model information, + e.g. [ id:'random_denoise', model:'kgiddi' ] + - database: + type: directory + description: Database split directory containing the sqlite triples. + pattern: "*" + - model_json: + type: file + description: Model hyperparameter JSON from `assets/.json`. + pattern: "*.json" +output: + predictions: + - - meta: + type: map + description: Groovy Map with database and model info. + - "${meta.model}/predictions.parquet": + type: file + description: Per-prediction parquet table. + pattern: "*.parquet" + model: + - - meta: + type: map + description: Groovy Map with database and model info. + - "${meta.model}/model/": + type: directory + description: Trained graph-model artefacts. + versions: + - - ${task.process}: + type: string + description: The process the versions were collected from + - python: + type: string + description: Python interpreter version +authors: + - "@konstantinpelz" +maintainers: + - "@konstantinpelz" diff --git a/modules/local/machine_learning/main.nf b/modules/local/neural_network/main.nf similarity index 67% rename from modules/local/machine_learning/main.nf rename to modules/local/neural_network/main.nf index a78875f..dd4ea16 100644 --- a/modules/local/machine_learning/main.nf +++ b/modules/local/neural_network/main.nf @@ -1,4 +1,4 @@ -process MACHINE_LEARNING { +process NEURAL_NETWORK { tag "${meta.id}" label 'process_gpu' @@ -9,24 +9,26 @@ process MACHINE_LEARNING { tuple val(meta), path('DDI'), path('features/*'), path('config.json') output: - tuple val(meta), path("machine_learning_${meta.features.join('_')}/predictions.parquet"), emit: predictions - tuple val(meta), path("machine_learning_${meta.features.join('_')}/model/"), emit: model - path "versions.yml", emit: versions + tuple val(meta), path("neural_network_${meta.combo_id}/predictions.parquet"), emit: predictions + tuple val(meta), path("neural_network_${meta.combo_id}/model/"), emit: model + path "versions.yml", emit: versions script: - def output_base = "machine_learning_${meta.features.join('_')}" + def output_base = "neural_network_${meta.combo_id}" def output_predictions = "${output_base}/predictions.parquet" def output_model_dir = "${output_base}/model" + def max_combos_arg = params.max_protein_combinations_per_ddi ? "--max_protein_combinations_per_ddi ${params.max_protein_combinations_per_ddi}" : '' """ mkdir -p ${output_model_dir} - machine_learning.py \\ + neural_network.py \\ --features ${meta.features.join(' ')} \\ --features_path features/ \\ --ddi_path DDI/ \\ --config config.json \\ --out_predictions ${output_predictions} \\ --out_model_dir ${output_model_dir} \\ + ${max_combos_arg} \\ --seed ${params.seed} cat <<-END_VERSIONS > versions.yml @@ -38,7 +40,7 @@ process MACHINE_LEARNING { """ } -process MACHINE_LEARNING_EVALUATION { +process NEURAL_NETWORK_EVALUATION { tag "${meta.id}" label 'process_high_memory' label 'process_long' @@ -52,22 +54,24 @@ process MACHINE_LEARNING_EVALUATION { tuple val(meta), path('DDI'), path('features/*'), path('config.json'), path(prev_results) output: - tuple val(meta), path("machine_learning_${meta.features.join('_')}/predictions.parquet"), emit: predictions - path "versions.yml", emit: versions + tuple val(meta), path("neural_network_${meta.combo_id}/predictions.parquet"), emit: predictions + path "versions.yml", emit: versions script: - def output_base = "machine_learning_${meta.features.join('_')}" - def model_dir = "${prev_results}/ml_output/${output_base}/model" + def output_base = "neural_network_${meta.combo_id}" + def model_dir = "${prev_results}/nn_output/${output_base}/model" + def max_combos_arg = params.max_protein_combinations_per_ddi ? "--max_protein_combinations_per_ddi ${params.max_protein_combinations_per_ddi}" : '' """ mkdir -p ${output_base} - machine_learning.py \\ + neural_network.py \\ --features ${meta.features.join(' ')} \\ --features_path features/ \\ --ddi_path DDI/ \\ --config config.json \\ --out_predictions ${output_base}/predictions.parquet \\ --model_dir ${model_dir} \\ + ${max_combos_arg} \\ --predict-only cat <<-END_VERSIONS > versions.yml diff --git a/modules/local/neural_network/meta.yml b/modules/local/neural_network/meta.yml new file mode 100644 index 0000000..2089134 --- /dev/null +++ b/modules/local/neural_network/meta.yml @@ -0,0 +1,71 @@ +name: neural_network +description: | + Train a neural-network DDI classifier (PyTorch + skorch) on a feature + combination, run hyperparameter search per the model JSON config, and + emit predictions for downstream evaluation. +keywords: + - machine learning + - neural network + - pytorch + - skorch + - ddi + - classifier +tools: + - pytorch: + description: Deep learning framework used to train the neural-network classifier. + homepage: https://pytorch.org + documentation: https://pytorch.org/docs/ + licence: + - "BSD-3-Clause" + - skorch: + description: scikit-learn-compatible wrapper around PyTorch models. + homepage: https://github.com/skorch-dev/skorch + documentation: https://skorch.readthedocs.io + licence: + - "BSD-3-Clause" +input: + - - meta: + type: map + description: | + Groovy Map with database and feature-combo information, + e.g. [ id:'random_denoise', features:['aacomp','aaencode'] ] + - ddi_dir: + type: directory + description: DDI tables from `DDI_EXTRACTION` (staged as `DDI/`). + pattern: "DDI" + - features_dir: + type: directory + description: | + Per-feature `.h5` files staged as `features//.h5`. + pattern: "features/*" + - config_json: + type: file + description: Model hyperparameter JSON from `assets/.json`. + pattern: "*.json" +output: + predictions: + - - meta: + type: map + description: Groovy Map with database and feature-combo info. + - "neural_network_*/predictions.parquet": + type: file + description: Per-prediction parquet table. + pattern: "*.parquet" + model: + - - meta: + type: map + description: Groovy Map with database and feature-combo info. + - "neural_network_*/model/": + type: directory + description: Trained model artefacts. + versions: + - - ${task.process}: + type: string + description: The process the versions were collected from + - python: + type: string + description: Python interpreter version +authors: + - "@konstantinpelz" +maintainers: + - "@konstantinpelz" diff --git a/modules/local/random_forest/main.nf b/modules/local/random_forest/main.nf index eaad0e1..48345e6 100644 --- a/modules/local/random_forest/main.nf +++ b/modules/local/random_forest/main.nf @@ -9,14 +9,15 @@ process RANDOM_FOREST { tuple val(meta), path('DDI'), path('features/*'), path('config.json') output: - tuple val(meta), path("random_forest_${meta.features.join('_')}/predictions.parquet"), emit: predictions - tuple val(meta), path("random_forest_${meta.features.join('_')}/model/"), emit: model - path "versions.yml", emit: versions + tuple val(meta), path("random_forest_${meta.combo_id}/predictions.parquet"), emit: predictions + tuple val(meta), path("random_forest_${meta.combo_id}/model/"), emit: model + path "versions.yml", emit: versions script: - def output_base = "random_forest_${meta.features.join('_')}" + def output_base = "random_forest_${meta.combo_id}" def output_predictions = "${output_base}/predictions.parquet" def output_model_dir = "${output_base}/model" + def max_combos_arg = params.max_protein_combinations_per_ddi ? "--max_protein_combinations_per_ddi ${params.max_protein_combinations_per_ddi}" : '' """ mkdir -p ${output_model_dir} @@ -27,6 +28,7 @@ process RANDOM_FOREST { --config config.json \\ --out_predictions ${output_predictions} \\ --out_model_dir ${output_model_dir} \\ + ${max_combos_arg} \\ --seed ${params.seed} cat <<-END_VERSIONS > versions.yml @@ -49,12 +51,13 @@ process RANDOM_FOREST_EVALUATION { tuple val(meta), path('DDI'), path('features/*'), path('config.json'), path(prev_results) output: - tuple val(meta), path("${meta.features.join('_')}/predictions.parquet"), emit: predictions - path "versions.yml", emit: versions + tuple val(meta), path("random_forest_${meta.combo_id}/predictions.parquet"), emit: predictions + path "versions.yml", emit: versions script: - def output_base = meta.features.join('_') - def model_dir = "${prev_results}/rf_output/random_forest_${output_base}/model" + def output_base = "random_forest_${meta.combo_id}" + def model_dir = "${prev_results}/rf_output/${output_base}/model" + def max_combos_arg = params.max_protein_combinations_per_ddi ? "--max_protein_combinations_per_ddi ${params.max_protein_combinations_per_ddi}" : '' """ mkdir -p ${output_base} @@ -65,6 +68,7 @@ process RANDOM_FOREST_EVALUATION { --config config.json \\ --out_predictions ${output_base}/predictions.parquet \\ --model_dir ${model_dir} \\ + ${max_combos_arg} \\ --predict-only cat <<-END_VERSIONS > versions.yml diff --git a/modules/local/random_forest/meta.yml b/modules/local/random_forest/meta.yml new file mode 100644 index 0000000..f6583a8 --- /dev/null +++ b/modules/local/random_forest/meta.yml @@ -0,0 +1,69 @@ +name: random_forest +description: | + Train a Random Forest DDI classifier (RAPIDS cuML / scikit-learn) on a + feature combination, run `RandomizedSearchCV` per the model JSON config, + and emit predictions for downstream evaluation. +keywords: + - random forest + - cuml + - scikit-learn + - ddi + - classifier +tools: + - scikit-learn: + description: Random-forest classifier and hyperparameter search. + homepage: https://scikit-learn.org + documentation: https://scikit-learn.org/stable/ + licence: + - "BSD-3-Clause" + - cuml: + description: GPU-accelerated random-forest implementation. + homepage: https://rapids.ai/cuml + documentation: https://docs.rapids.ai/api/cuml/stable/ + licence: + - "Apache-2.0" +input: + - - meta: + type: map + description: | + Groovy Map with database and feature-combo information, + e.g. [ id:'random_denoise', features:['aacomp'] ] + - ddi_dir: + type: directory + description: DDI tables from `DDI_EXTRACTION`. + pattern: "DDI" + - features_dir: + type: directory + description: Per-feature `.h5` files. + pattern: "features/*" + - config_json: + type: file + description: Model hyperparameter JSON from `assets/RandomForest.json`. + pattern: "*.json" +output: + predictions: + - - meta: + type: map + description: Groovy Map with database and feature-combo info. + - "random_forest_*/predictions.parquet": + type: file + description: Per-prediction parquet table. + pattern: "*.parquet" + model: + - - meta: + type: map + description: Groovy Map with database and feature-combo info. + - "random_forest_*/model/": + type: directory + description: Trained model artefacts. + versions: + - - ${task.process}: + type: string + description: The process the versions were collected from + - python: + type: string + description: Python interpreter version +authors: + - "@konstantinpelz" +maintainers: + - "@konstantinpelz" diff --git a/nextflow.config b/nextflow.config index a5f9a06..af136ae 100644 --- a/nextflow.config +++ b/nextflow.config @@ -13,8 +13,6 @@ params { // -------- I/O -------- input = null - db = null - db_list = null outdir = './results' publish_dir_mode = 'copy' @@ -24,11 +22,16 @@ params { // Comma-separated. main.nf splits on entry. (nf-core schema lint is // happier with strings than Groovy lists for default values.) graph_models = 'kgiddi,ddiparsimony,kgiddi_random' - machine_learning_features = 'aacomp,aaencode,prott5_protein_domain_embeddings,esm3_protein_domain_embeddings,esmc_protein_domain_embeddings,esm3_domain_embeddings,esmc_domain_embeddings' + machine_learning_models = 'neural_network,random_forest' + machine_learning_features = 'dummy,aacomp,aaencode,prott5_protein_domain_embeddings,esm3_protein_domain_embeddings,esmc_protein_domain_embeddings,esm3_domain_embeddings,esmc_domain_embeddings' // Features whose embedding dim makes ML/RF tasks need the heavy GPU label. // Anything not listed here uses `process_gpu_small` (~50 GB). large_features = 'prott5_protein_domain_embeddings,esm3_protein_domain_embeddings,esmc_protein_domain_embeddings,esm3_domain_embeddings,esmc_domain_embeddings' - max_machine_learning_features = 2 + // Cap on protein-pair instantiations per DDI domain pair during training + // and prediction. `null` (default) = use every available protein + // combination. When set, the cap is applied via `random.sample` (no + // replacement); if fewer combinations exist than the cap, all are used. + max_protein_combinations_per_ddi = 15 old_report = 'evaluation' seed = 42 @@ -232,20 +235,24 @@ process.shell = [ nextflow.enable.configProcessNamesValidation = false timeline { - enabled = true - file = "${params.outdir}/pipeline_info/execution_timeline_${params.trace_report_suffix}.html" + enabled = true + overwrite = true + file = "${params.outdir}/pipeline_info/execution_timeline_${params.trace_report_suffix}.html" } report { - enabled = true - file = "${params.outdir}/pipeline_info/execution_report_${params.trace_report_suffix}.html" + enabled = true + overwrite = true + file = "${params.outdir}/pipeline_info/execution_report_${params.trace_report_suffix}.html" } trace { - enabled = true - file = "${params.outdir}/pipeline_info/execution_trace_${params.trace_report_suffix}.txt" + enabled = true + overwrite = true + file = "${params.outdir}/pipeline_info/execution_trace_${params.trace_report_suffix}.txt" } dag { - enabled = true - file = "${params.outdir}/pipeline_info/pipeline_dag_${params.trace_report_suffix}.html" + enabled = true + overwrite = true + file = "${params.outdir}/pipeline_info/pipeline_dag_${params.trace_report_suffix}.html" } manifest { @@ -261,7 +268,7 @@ manifest { orcid: '0009-0003-3891-2005' ], [ - name: ' Chiara Thomas', + name: 'Chiara Thomas', affiliation: 'Data Science in Systems Biology, TUM School of Life Sciences, Technical University of Munich, Germany.', email: 'ge58rab@mytum.de', github: 'chialtho', @@ -269,7 +276,7 @@ manifest { orcid: '0000-0001-9720-7208' ], [ - name: ' Christian Romberg', + name: 'Christian Romberg', affiliation: 'Data Science in Systems Biology, TUM School of Life Sciences, Technical University of Munich, Germany.', email: 'c.romberg@tum.de', github: 'ChristianRomberg', diff --git a/nextflow_schema.json b/nextflow_schema.json index e5fdff6..76b89ba 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -20,21 +20,9 @@ "mimetype": "text/csv", "pattern": "^\\S+\\.csv$", "description": "Path to comma-separated samplesheet listing database splits to run.", - "help_text": "Optional. CSV with one row per database split (columns: id, db_path). Use this OR --db / --db_list, not both.", + "help_text": "CSV with one row per database split (columns: id, db_path). Relative db_path values are resolved against the project directory.", "fa_icon": "fas fa-file-csv" }, - "db": { - "type": "string", - "format": "directory-path", - "description": "Path to a single database split directory containing train.sqlite3, test.sqlite3, optimization.sqlite3.", - "fa_icon": "fas fa-folder" - }, - "db_list": { - "type": "string", - "description": "Comma-separated list of database split directories to run as a scatter.", - "help_text": "Mutually exclusive with --db. Each entry must contain train.sqlite3 / test.sqlite3 / optimization.sqlite3.", - "fa_icon": "fas fa-folder-tree" - }, "outdir": { "type": "string", "format": "directory-path", @@ -66,13 +54,20 @@ "graph_models": { "type": "string", "default": "kgiddi,ddiparsimony,kgiddi_random", - "pattern": "^(\\s*\\w+\\s*)(,\\s*\\w+\\s*)*$", - "description": "Comma-separated list of graph-based DDI models to run.", + "pattern": "^(|none|\\s*\\w+\\s*(,\\s*\\w+\\s*)*)$", + "description": "Comma-separated list of graph-based DDI models to run. Pass 'none' (or an empty string) to disable.", "fa_icon": "fas fa-project-diagram" }, + "machine_learning_models": { + "type": "string", + "default": "neural_network,random_forest", + "pattern": "^(|none|\\s*\\w+\\s*(,\\s*\\w+\\s*)*)$", + "description": "Comma-separated list of machine-learning models to run (subset of: neural_network, random_forest). Pass 'none' to disable all ML models.", + "fa_icon": "fas fa-brain" + }, "machine_learning_features": { "type": "string", - "default": "aacomp,aaencode,prott5_protein_domain_embeddings,esm3_protein_domain_embeddings,esmc_protein_domain_embeddings,esm3_domain_embeddings,esmc_domain_embeddings", + "default": "dummy,aacomp,aaencode,prott5_protein_domain_embeddings,esm3_protein_domain_embeddings,esmc_protein_domain_embeddings,esm3_domain_embeddings,esmc_domain_embeddings", "pattern": "^(\\s*\\w+\\s*)(,\\s*\\w+\\s*)*$", "description": "Comma-separated list of feature encodings to compute and feed into the ML / RF classifiers.", "fa_icon": "fas fa-vector-square" @@ -84,12 +79,12 @@ "description": "Comma-separated list of features whose embedding dim makes ML / RF tasks need the heavy GPU label (process_gpu_large).", "fa_icon": "fas fa-weight-hanging" }, - "max_machine_learning_features": { - "type": "integer", - "default": 2, + "max_protein_combinations_per_ddi": { + "type": ["integer", "null"], + "default": null, "minimum": 1, - "description": "Maximum number of feature encodings combined per ML run (powerset cap).", - "fa_icon": "fas fa-layer-group" + "description": "Cap on protein-pair instantiations per DDI domain pair for ML / RF training and prediction. Sampled without replacement; if fewer combinations exist than the cap, all are used. Leave null / unset to use every available combination. Default: 15.", + "fa_icon": "fas fa-random" }, "modeljson": { "type": "string", diff --git a/ro-crate-metadata.json b/ro-crate-metadata.json index 0e26860..b1bc013 100644 --- a/ro-crate-metadata.json +++ b/ro-crate-metadata.json @@ -23,7 +23,7 @@ "@type": "Dataset", "creativeWorkStatus": "InProgress", "datePublished": "2026-05-04T14:29:51+00:00", - "description": "# daisybio/domainbenchmark\n\n[![Open in GitHub Codespaces](https://img.shields.io/badge/Open_In_GitHub_Codespaces-black?labelColor=grey&logo=github)](https://github.com/codespaces/new/daisybio/domainbenchmark)\n[![GitHub Actions CI Status](https://github.com/daisybio/domainbenchmark/actions/workflows/nf-test.yml/badge.svg)](https://github.com/daisybio/domainbenchmark/actions/workflows/nf-test.yml)\n[![GitHub Actions Linting Status](https://github.com/daisybio/domainbenchmark/actions/workflows/linting.yml/badge.svg)](https://github.com/daisybio/domainbenchmark/actions/workflows/linting.yml)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX)\n[![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com)\n\n[![Nextflow](https://img.shields.io/badge/version-%E2%89%A525.10.4-green?style=flat&logo=nextflow&logoColor=white&color=%230DC09D&link=https%3A%2F%2Fnextflow.io)](https://www.nextflow.io/)\n[![nf-core template version](https://img.shields.io/badge/nf--core_template-4.0.2-green?style=flat&logo=nfcore&logoColor=white&color=%2324B064&link=https%3A%2F%2Fnf-co.re)](https://github.com/nf-core/tools/releases/tag/4.0.2)\n[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)\n[![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)\n[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)\n[![Launch on Seqera Platform](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Seqera%20Platform-%234256e7)](https://cloud.seqera.io/launch?pipeline=https://github.com/daisybio/domainbenchmark)\n\n## Introduction\n\n**daisybio/domainbenchmark** is a bioinformatics benchmarking pipeline for protein **domain-domain\ninteraction (DDI)** prediction. Given one or more pre-split DDI databases\n(`train.sqlite3`, `test.sqlite3`, `optimization.sqlite3` plus matching\nembeddings), the pipeline trains a panel of machine-learning and\ngraph-based predictors, evaluates each one against a held-out test split,\nand produces a unified MultiQC report comparing them.\n\nThe pipeline runs the following stages:\n\n1. **DDI extraction** from the database split (`DDI_EXTRACTION`).\n2. **Feature extraction** for every requested encoding (`aacomp`,\n `aaencode`, ProtT5 / ESM-3 / ESM-C protein and domain embeddings) \u2014\n parallelized per `(feature \u00d7 split)`.\n3. **ML classifiers** trained on every feature combination up to\n `--max_machine_learning_features`:\n - `MACHINE_LEARNING` \u2014 neural network (PyTorch + skorch).\n - `RANDOM_FOREST` \u2014 RAPIDS cuML random forest on GPU.\n4. **Graph models** (`GRAPH_MODEL`): KGIDDI, DDI parsimony, KGIDDI-random.\n5. **Per-prediction evaluation** (`EVAL_ONE`) \u2192 tiny per-model JSONs.\n6. **Per-database aggregation** (`EVALUATION`) \u2192 MultiQC report.\n7. **Cross-database aggregation** (`COMBINE_EVAL`) when `--db_list` is set.\n\n```mermaid\nflowchart LR\n subgraph \"per database\"\n ddi[DDI_EXTRACTION]\n feat[FEATURE_EXTRACTION]\n ml[MACHINE_LEARNING]\n rf[RANDOM_FOREST]\n gm[GRAPH_MODEL]\n eo[EVAL_ONE]\n ev[EVALUATION]\n end\n db[(db_list)] --> ddi\n db --> feat --> ml & rf\n db --> gm\n ddi --> ml & rf\n ml & rf & gm --> eo --> ev\n ev --> agg[COMBINE_EVAL] --> report[ddi_report.html]\n```\n\n\n\n\n## Usage\n\n> [!NOTE]\n> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/get_started/environment_setup/overview) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/get_started/run-your-first-pipeline) with `-profile test` before running the workflow on actual data.\n\n### Single database\n\n```bash\n<<<<<<< HEAD\nnextflow run main.nf \\\n -profile \\\n --db /path/to/database_split \\\n --outdir results\n=======\nnextflow run daisybio/domainbenchmark \\\n -profile \\\n --input samplesheet.csv \\\n --outdir \n>>>>>>> TEMPLATE\n```\n\n### Multiple databases (scatter + combined evaluation)\n\n```bash\nnextflow run main.nf \\\n -profile \\\n --db_list \"/path/to/db1,/path/to/db2,/path/to/db3\" \\\n --outdir results\n```\n\n### Cluster (Slurm + Singularity)\n\n```bash\nnextflow run main.nf \\\n -profile slurm,singularity \\\n --db_list \"/nfs/data/CoBiNet_Masterpraktikum/databases/random_denoise,...\" \\\n --outdir /nfs/scratch/cobinet/results\n```\n\n### Skipping stages\n\n`--skip` accepts a comma-separated list of feature names or graph-model\nnames. For example, to skip the heavy embedding-based features and the\ntwo parsimony graph models:\n\n```bash\nnextflow run main.nf \\\n -profile slurm,singularity \\\n --skip \"kgiddi,ddiparsimony,prott5_protein_domain_embeddings,esm3_protein_domain_embeddings\"\n```\n\n### Test profile (in-repo fixture)\n\n```bash\nnextflow run main.nf -profile test,singularity --outdir results-test\n```\n\nThe `test` profile points `--db` at a tiny in-repo SQLite triple under\n`tests/data/` and disables every heavy feature except `aacomp`. Used by\nCI and by `nf-test`.\n\n## Pipeline parameters\n\n| Parameter | Default | Description |\n|---|---|---|\n| `--input` | `null` | Optional samplesheet CSV (one row per database split). |\n| `--db` | `null` | Path to a single database split directory. |\n| `--db_list` | `null` | Comma-separated list of database splits. |\n| `--outdir` | `./results` | Output directory. |\n| `--modeljson` | `${projectDir}/assets` | Directory of model hyperparameter JSONs. |\n| `--skip` | `''` | Comma-separated feature/model names to skip. |\n| `--graph_models` | `kgiddi, ddiparsimony, kgiddi_random` | Graph models to run. |\n| `--machine_learning_features` | `aacomp, aaencode, prott5_*, esm3_*, esmc_*` | Feature encodings to compute. |\n| `--max_machine_learning_features` | `2` | Max features combined per ML run. |\n| `--seed` | `42` | Global RNG seed. |\n| `--publish_dir_mode` | `'copy'` | Nextflow `publishDir` mode. |\n\nFull schema with defaults, types, and descriptions: `nextflow_schema.json`.\nRun `nextflow run main.nf --help` for a CLI summary.\n\n## Pipeline output\n\nFor each database split processed, a subdirectory under `--outdir//`:\n\n```\n//\n\u251c\u2500\u2500 data/\n\u2502 \u2514\u2500\u2500 /\n\u2502 \u251c\u2500\u2500 train.h5\n\u2502 \u251c\u2500\u2500 test.h5\n\u2502 \u2514\u2500\u2500 optimization.h5\n\u251c\u2500\u2500 ml_output/\n\u2502 \u2514\u2500\u2500 /\n\u2502 \u251c\u2500\u2500 predictions.parquet\n\u2502 \u2514\u2500\u2500 model/\n\u251c\u2500\u2500 rf_output/\n\u2502 \u2514\u2500\u2500 random_forest_/\n\u2502 \u251c\u2500\u2500 predictions.parquet\n\u2502 \u2514\u2500\u2500 model/\n\u251c\u2500\u2500 graph_models/\n\u2502 \u2514\u2500\u2500 /\n\u2502 \u251c\u2500\u2500 predictions.parquet\n\u2502 \u2514\u2500\u2500 model/\n\u2514\u2500\u2500 evaluation/\n \u2514\u2500\u2500 multiqc_report.html\n```\n\nWhen `--db_list` is set, a top-level cross-database report is also written:\n`/evaluation/ddi_report.html`.\n\nA full description of each output is in [`docs/output.md`](docs/output.md).\n\n## Profiles\n\n| Profile | Effect |\n|---|---|\n| `standard` | Local executor + conda. Default. |\n| `docker` | Local executor + docker. |\n| `singularity` | Local executor + singularity / apptainer. |\n| `conda` | Forces conda; disables docker / singularity. |\n| `slurm` | Slurm executor with GPU labels and retry-on-OOM. Pairs with `singularity` on the cluster. |\n| `test` | Tiny in-repo fixture for smoke tests and `nf-test`. |\n| `test_full` | Full-data CI run (large fixtures). |\n\n## Adding new components\n\n- **New ML model:** add `assets/.json` (with `model_name`, `data`,\n `search_parameters`, `model_parameters`) and the matching script in\n `bin/`. The pipeline picks up the JSON automatically.\n- **New feature encoding:** add `bin/features/.py` and append\n `` to `params.machine_learning_features` in `nextflow.config`.\n\n## Credits\n\ndaisybio/domainbenchmark was originally written by Konstantin Pelz, Chiara Thomas, Christian Romberg.\n\nWe thank the following people for their extensive assistance in the development of this pipeline:\n\n- Amelie Hilbig\n\n## Contributions and Support\n\nIf you would like to contribute to this pipeline, please see the [contributing guidelines](docs/CONTRIBUTING.md).\n\n## Citations\n\n\n\n\nAn extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.\n\nThis pipeline uses code and infrastructure developed and maintained by the [nf-core](https://nf-co.re) community, reused here under the [MIT license](https://github.com/nf-core/tools/blob/main/LICENSE).\n\n> **The nf-core framework for community-curated bioinformatics pipelines.**\n>\n> Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.\n>\n> _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x).\n", + "description": "# daisybio/domainbenchmark\n\n[![Open in GitHub Codespaces](https://img.shields.io/badge/Open_In_GitHub_Codespaces-black?labelColor=grey&logo=github)](https://github.com/codespaces/new/daisybio/domainbenchmark)\n[![GitHub Actions CI Status](https://github.com/daisybio/domainbenchmark/actions/workflows/nf-test.yml/badge.svg)](https://github.com/daisybio/domainbenchmark/actions/workflows/nf-test.yml)\n[![GitHub Actions Linting Status](https://github.com/daisybio/domainbenchmark/actions/workflows/linting.yml/badge.svg)](https://github.com/daisybio/domainbenchmark/actions/workflows/linting.yml)\n[![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com)\n\n\n\n\n[![Nextflow](https://img.shields.io/badge/version-%E2%89%A525.10.2-green?style=flat&logo=nextflow&logoColor=white&color=%230DC09D&link=https%3A%2F%2Fnextflow.io)](https://www.nextflow.io/)\n[![nf-core template version](https://img.shields.io/badge/nf--core_template-4.0.2-green?style=flat&logo=nfcore&logoColor=white&color=%2324B064&link=https%3A%2F%2Fnf-co.re)](https://github.com/nf-core/tools/releases/tag/4.0.2)\n[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)\n[![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)\n[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)\n[![Launch on Seqera Platform](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Seqera%20Platform-%234256e7)](https://cloud.seqera.io/launch?pipeline=https://github.com/daisybio/domainbenchmark)\n\n## Introduction\n\n**daisybio/domainbenchmark** is a bioinformatics benchmarking pipeline for protein **domain-domain\ninteraction (DDI)** prediction. Given one or more pre-split DDI databases\n(`train.sqlite3`, `test.sqlite3`, `optimization.sqlite3` plus matching\nembeddings), the pipeline trains a panel of machine-learning and\ngraph-based predictors, evaluates each one against a held-out test split,\nand produces a unified MultiQC report comparing them.\n\nThe pipeline runs the following stages:\n\n1. **DDI extraction** from the database split (`DDI_EXTRACTION`).\n2. **Feature extraction** for every requested encoding (`aacomp`,\n `aaencode`, ProtT5 / ESM-3 / ESM-C protein and domain embeddings) \u2014\n parallelized per `(feature \u00d7 split)`.\n3. **ML classifiers** trained on each feature individually plus one\n all-feature concatenation run (gated by `--machine_learning_models`):\n - `NEURAL_NETWORK` \u2014 neural network (PyTorch + skorch).\n - `RANDOM_FOREST` \u2014 RAPIDS cuML random forest on GPU.\n4. **Graph models** (`GRAPH_MODEL`): KGIDDI, DDI parsimony, KGIDDI-random.\n5. **Per-prediction evaluation** (`EVAL_ONE`) \u2192 tiny per-model JSONs.\n6. **Per-database aggregation** (`EVALUATION`) \u2192 MultiQC report.\n7. **Cross-database aggregation** (`COMBINE_EVAL`) when `--db_list` is set.\n\n```mermaid\nflowchart LR\n subgraph \"per database\"\n ddi[DDI_EXTRACTION]\n feat[FEATURE_EXTRACTION]\n nn[NEURAL_NETWORK]\n rf[RANDOM_FOREST]\n gm[GRAPH_MODEL]\n eo[EVAL_ONE]\n ev[EVALUATION]\n end\n db[(db_list)] --> ddi\n db --> feat --> nn & rf\n db --> gm\n ddi --> nn & rf\n nn & rf & gm --> eo --> ev\n ev --> agg[COMBINE_EVAL] --> report[ddi_report.html]\n```\n\n\n\n## Usage\n\n> [!NOTE]\n> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/get_started/environment_setup/overview) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/get_started/run-your-first-pipeline) with `-profile test` before running the workflow on actual data.\n\n### Samplesheet (recommended entry)\n\n```bash\nnextflow run . \\\n -profile ,slurm \\\n --input assets/samplesheet.csv \\\n --outdir results\n```\n\nThe samplesheet is a CSV with one row per database split:\n\n```csv\nid,db_path\nrandom_denoise,/path/to/random_denoise\nrandom_ddi,/path/to/random_ddi\n```\n\nSchema in `assets/schema_input.json`. Each `db_path` must contain\n`train.sqlite3`, `test.sqlite3`, `optimization.sqlite3`.\n\n### Cluster (Slurm + Singularity)\n\n```bash\nnextflow run . \\\n -profile slurm,singularity \\\n --input assets/samplesheet.csv \\\n --outdir /nfs/scratch/cobinet/results \\\n -resume\n```\n\n### Skipping stages\n\n`--skip` accepts a comma-separated list of feature names or graph-model\nnames. For example, to skip the heavy embedding-based features and the\ntwo parsimony graph models:\n\n```bash\nnextflow run . \\\n -profile slurm,singularity \\\n --input assets/samplesheet.csv \\\n --skip \"kgiddi,ddiparsimony,prott5_protein_domain_embeddings,esm3_protein_domain_embeddings\"\n```\n\n### Test profile (in-repo fixture)\n\n```bash\nnextflow run . -profile test,singularity --outdir results-test\n```\n\nThe `test` profile points at a tiny in-repo SQLite triple under\n`tests/data/` and disables every heavy feature except `aacomp`. Used by\nCI and by `nf-test`.\n\n## Pipeline parameters\n\n| Parameter | Default | Description |\n|---|---|---|\n| `--input` | `null` | **Required.** Samplesheet CSV (one row per database split). |\n| `--outdir` | `./results` | Output directory. |\n| `--modeljson` | `${projectDir}/assets` | Directory of model hyperparameter JSONs. |\n| `--skip` | `''` | Comma-separated feature/model names to skip. |\n| `--graph_models` | `kgiddi,ddiparsimony,kgiddi_random` | Graph models to run. |\n| `--machine_learning_features` | `aacomp,aaencode,prott5_*,esm3_*,esmc_*` | Feature encodings to compute. |\n| `--large_features` | `prott5_*,esm3_*,esmc_*` | Features routed to `process_gpu_large`. |\n| `--machine_learning_models` | `neural_network,random_forest` | ML models to run. |\n| `--max_protein_combinations_per_ddi` | `null` | Cap on protein-pair instantiations per DDI pair (sampled without replacement). Null = use all. |\n| `--seed` | `42` | Global RNG seed. |\n| `--publish_dir_mode` | `'copy'` | Nextflow `publishDir` mode. |\n\nFull schema with defaults, types, and descriptions: `nextflow_schema.json`.\nRun `nextflow run . --help` for a CLI summary.\n\n## Pipeline output\n\nFor each database split processed, a subdirectory under `--outdir//`:\n\n```\n//\n\u251c\u2500\u2500 data/\n\u2502 \u2514\u2500\u2500 /\n\u2502 \u251c\u2500\u2500 train.h5\n\u2502 \u251c\u2500\u2500 test.h5\n\u2502 \u2514\u2500\u2500 optimization.h5\n\u251c\u2500\u2500 nn_output/\n\u2502 \u2514\u2500\u2500 neural_network_/\n\u2502 \u251c\u2500\u2500 predictions.parquet\n\u2502 \u2514\u2500\u2500 model/\n\u251c\u2500\u2500 rf_output/\n\u2502 \u2514\u2500\u2500 random_forest_/\n\u2502 \u251c\u2500\u2500 predictions.parquet\n\u2502 \u2514\u2500\u2500 model/\n\u251c\u2500\u2500 graph_models/\n\u2502 \u2514\u2500\u2500 /\n\u2502 \u251c\u2500\u2500 predictions.parquet\n\u2502 \u2514\u2500\u2500 model/\n\u2514\u2500\u2500 evaluation/\n \u2514\u2500\u2500 multiqc_report.html\n```\n\nWhen the samplesheet contains more than one database, a top-level\ncross-database report is also written: `/evaluation/ddi_report.html`.\n\nA full description of each output is in [`docs/output.md`](docs/output.md).\n\n## Profiles\n\n| Profile | Effect |\n|---|---|\n| `standard` | Local executor + conda. Default. |\n| `docker` | Local executor + docker. |\n| `singularity` | Local executor + singularity / apptainer. |\n| `conda` | Forces conda; disables docker / singularity. |\n| `slurm` | Slurm executor with GPU labels and retry-on-OOM. Pairs with `singularity` on the cluster. |\n| `test` | Tiny in-repo fixture for smoke tests and `nf-test`. |\n| `test_full` | Full-data CI run (large fixtures). |\n\n## Adding new components\n\n### New feature encoding\n\nCopy the template and implement your feature computation:\n\n1. Copy `bin/features/new_feature.py` to `bin/features/.py`\n2. Implement `extract_features(conn, out_file)` \u2014 read from SQLite, write feature vectors to HDF5\n3. Append `` to `params.machine_learning_features` in `nextflow.config`\n4. If it needs GPU or large memory, also add it to `params.large_features`\n\nSee `bin/features/new_feature.py` for the full contract and examples.\n\n### New ML model\n\nCopy the template and implement your training/prediction logic:\n\n1. Copy `bin/new_model.py` to `bin/.py`\n2. Subclass `DDIModelTrainer` and implement the required methods\n3. Create `assets/.json` with hyperparameter grid (must include `model_name`, `data`, `search_parameters`, `model_parameters`)\n4. Add a Nextflow process in `modules/local//main.nf` and wire it into `subworkflows/local/per_db_benchmark/main.nf`\n\nSee `bin/new_model.py` for the full API and optional overrides.\n\n## Credits\n\ndaisybio/domainbenchmark was originally written by Konstantin Pelz, Chiara Thomas, Christian Romberg.\n\nWe thank the following people for their extensive assistance in the development of this pipeline:\n\n- Amelie Hilbig\n\n## Contributions and Support\n\nIf you would like to contribute to this pipeline, please see the [contributing guidelines](docs/CONTRIBUTING.md).\n\n## Citations\n\nA Zenodo DOI for daisybio/domainbenchmark will be issued at the v1.0.0\nrelease; this section will be updated then. An extensive list of\nreferences for the tools used by the pipeline can be found in the\n[`CITATIONS.md`](CITATIONS.md) file.\n\nThis pipeline uses code and infrastructure developed and maintained by the [nf-core](https://nf-co.re) community, reused here under the [MIT license](https://github.com/nf-core/tools/blob/main/LICENSE).\n\n> **The nf-core framework for community-curated bioinformatics pipelines.**\n>\n> Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.\n>\n> _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x).\n", "hasPart": [ { "@id": "main.nf" diff --git a/subworkflows/local/aggregate_eval/main.nf b/subworkflows/local/aggregate_eval/main.nf index ae97ebe..eb38ad3 100644 --- a/subworkflows/local/aggregate_eval/main.nf +++ b/subworkflows/local/aggregate_eval/main.nf @@ -16,14 +16,18 @@ workflow AGGREGATE_EVAL { per_db_reports // channel: tuple(meta, evaluation_dir) — emitted by PER_DB_BENCHMARK main: - def reports_collected = per_db_reports - .map { _meta, dir -> dir } - .collect() - - def combine_input_ch = reports_collected.map { dirs -> - def m = [ id: 'combined' ] - tuple(m, dirs) - } + // Collect (db_id, dir) pairs as a single list so COMBINE_EVAL can stage + // each per-DB report under a parent dir named by db_id. Flat collect + // would lose the pairing; flat:false preserves the per-entry tuples. + def combine_input_ch = per_db_reports + .map { meta, dir -> [ meta.id, dir ] } + .collect(flat: false) + .map { entries -> + def m = [ id: 'combined' ] + def ids = entries.collect { it[0] } + def dirs = entries.collect { it[1] } + tuple(m, dirs, ids) + } COMBINE_EVAL(combine_input_ch) diff --git a/subworkflows/local/per_db_benchmark/main.nf b/subworkflows/local/per_db_benchmark/main.nf index 88070de..5632c2e 100644 --- a/subworkflows/local/per_db_benchmark/main.nf +++ b/subworkflows/local/per_db_benchmark/main.nf @@ -7,7 +7,7 @@ ↓ FEATURE_EXTRACTION (scatter (db × feature × split)) ↓ - MACHINE_LEARNING + RANDOM_FOREST + GRAPH_MODEL (per db × combo / model) + NEURAL_NETWORK + RANDOM_FOREST + GRAPH_MODEL (per db × combo / model) ↓ EVAL_ONE (scatter — one tiny JSON per prediction) ↓ @@ -18,18 +18,18 @@ include { DDI_EXTRACTION } from '../../../modules/local/ddi_extraction/main.nf' include { FEATURE_EXTRACTION } from '../../../modules/local/feature_extraction/main.nf' -include { MACHINE_LEARNING } from '../../../modules/local/machine_learning/main.nf' +include { NEURAL_NETWORK } from '../../../modules/local/neural_network/main.nf' include { RANDOM_FOREST } from '../../../modules/local/random_forest/main.nf' include { GRAPH_MODEL } from '../../../modules/local/graph_model/main.nf' include { EVAL_ONE; EVALUATION } from '../../../modules/local/evaluation/main.nf' def csvToList(s) { - s instanceof List ? s.findAll { it } : (s ? s.tokenize(',')*.trim().findAll { it } : []) -} - -def powerSet(List lst) { - lst.inject([[]] as List) { acc, e -> acc + acc.collect { it + e } } + // Accept either a List or a comma-separated String. Filter out blanks + // and the sentinel 'none' so callers can disable a list cleanly with + // `--graph_models none` (or `--graph_models ''` once nf-schema allows it). + def items = s instanceof List ? s : (s ? s.tokenize(',') : []) + items*.trim().findAll { it && it.toLowerCase() != 'none' } } workflow PER_DB_BENCHMARK { @@ -39,7 +39,7 @@ workflow PER_DB_BENCHMARK { main: // --------------------------------------------------------------- - // Feature filtering (constant per pipeline run) + // Feature / model filtering (constant per pipeline run) // --------------------------------------------------------------- def skip_features = csvToList(params.skip) @@ -49,8 +49,15 @@ workflow PER_DB_BENCHMARK { def all_graph_models = csvToList(params.graph_models) def graph_model_names = all_graph_models.findAll { !skip_features.contains(it) } - def feature_combos = powerSet(ml_features) - .findAll { it && it.size() <= params.max_machine_learning_features } + def all_ml_models = csvToList(params.machine_learning_models) + def ml_model_names = all_ml_models.findAll { !skip_features.contains(it) } + def nn_enabled = ml_model_names.contains('neural_network') + def rf_enabled = ml_model_names.contains('random_forest') + + // One run per feature (singleton) plus one all-feature concatenation + // run when more than one feature is available. + def feature_combos = ml_features.collect { [it] } + if (ml_features.size() >= 2) feature_combos << ml_features def ml_config = file(params.modeljson) / 'NeuralNetwork.json' def rf_config = file(params.modeljson) / 'RandomForest.json' @@ -73,36 +80,41 @@ workflow PER_DB_BENCHMARK { ddi_keyed = ddi_ch.map { meta, ddi_dir -> tuple(meta.id, meta, ddi_dir) } // --------------------------------------------------------------- - // ML / RF (powerset of features capped per params) + // NN / RF (per-feature singletons + one all-concat run, gated by + // params.machine_learning_models and --skip) // --------------------------------------------------------------- - ml_input_ch = ddi_keyed + nn_input_ch = nn_enabled ? ddi_keyed .join(feature_dirs_per_db) .combine(Channel.fromList(feature_combos.collect { [it] })) .combine(Channel.value(ml_config)) .map { _db_id, meta, ddi_dir, feature_dirs, combo, cfg -> + def combo_id = combo.size() == 1 ? combo[0] : 'all' def m = [ - id : "${meta.id}_machine_learning_${combo.join('_')}", + id : "${meta.id}_neural_network_${combo_id}", db : meta.db, - model : 'machine_learning', - features: combo + model : 'neural_network', + features: combo, + combo_id: combo_id ] tuple(m, ddi_dir, feature_dirs, cfg) - } - MACHINE_LEARNING(ml_input_ch) + } : Channel.empty() + NEURAL_NETWORK(nn_input_ch) - rf_input_ch = ddi_keyed + rf_input_ch = rf_enabled ? ddi_keyed .join(feature_dirs_per_db) .combine(Channel.fromList(feature_combos.collect { [it] })) .combine(Channel.value(rf_config)) .map { _db_id, meta, ddi_dir, feature_dirs, combo, cfg -> + def combo_id = combo.size() == 1 ? combo[0] : 'all' def m = [ - id : "${meta.id}_random_forest_${combo.join('_')}", + id : "${meta.id}_random_forest_${combo_id}", db : meta.db, model : 'random_forest', - features: combo + features: combo, + combo_id: combo_id ] tuple(m, ddi_dir, feature_dirs, cfg) - } + } : Channel.empty() RANDOM_FOREST(rf_input_ch) // --------------------------------------------------------------- @@ -124,7 +136,7 @@ workflow PER_DB_BENCHMARK { // --------------------------------------------------------------- // Per-prediction evaluation (scatter) // --------------------------------------------------------------- - all_predictions_ch = MACHINE_LEARNING.out.predictions + all_predictions_ch = NEURAL_NETWORK.out.predictions .mix(RANDOM_FOREST.out.predictions) .mix(GRAPH_MODEL.out.predictions) .map { meta, pred -> @@ -165,7 +177,7 @@ workflow PER_DB_BENCHMARK { // --------------------------------------------------------------- ch_versions = DDI_EXTRACTION.out.versions .mix(FEATURE_EXTRACTION.out.versions) - .mix(MACHINE_LEARNING.out.versions) + .mix(NEURAL_NETWORK.out.versions) .mix(RANDOM_FOREST.out.versions) .mix(GRAPH_MODEL.out.versions) .mix(EVAL_ONE.out.versions) diff --git a/subworkflows/local/utils_nfcore_domainbenchmark_pipeline/main.nf b/subworkflows/local/utils_nfcore_domainbenchmark_pipeline/main.nf index 8eba52c..64f3409 100644 --- a/subworkflows/local/utils_nfcore_domainbenchmark_pipeline/main.nf +++ b/subworkflows/local/utils_nfcore_domainbenchmark_pipeline/main.nf @@ -10,8 +10,6 @@ include { UTILS_NFSCHEMA_PLUGIN } from '../../nf-core/utils_nfschema_plugin' include { paramsSummaryMap } from 'plugin/nf-schema' -include { samplesheetToList } from 'plugin/nf-schema' -include { paramsHelp } from 'plugin/nf-schema' include { completionEmail } from '../../nf-core/utils_nfcore_pipeline' include { completionSummary } from '../../nf-core/utils_nfcore_pipeline' include { UTILS_NFCORE_PIPELINE } from '../../nf-core/utils_nfcore_pipeline' @@ -35,9 +33,6 @@ workflow PIPELINE_INITIALISATION { help // boolean: Display help message and exit help_full // boolean: Show the full help message show_hidden // boolean: Show hidden parameters in the help message - // pipeline-specific input - db_list // string: Comma-separated list of databases to benchmark - db // string: Single database to benchmark (alternative to --db_list or --input samplesheet) main: @@ -92,28 +87,14 @@ workflow PIPELINE_INITIALISATION { .fromPath(params.input, checkIfExists: true) .splitCsv(header: true) .map { row -> - def db_path = file(row.db_path, checkIfExists: true) + def db_path = row.db_path.startsWith('/') + ? file(row.db_path, checkIfExists: true) + : file("${workflow.projectDir}/${row.db_path}", checkIfExists: true) def id = row.id ?: db_path.getName() tuple([ id: id, db: id ], db_path) } - } else if (params.db_list) { - def items = params.db_list instanceof List - ? params.db_list - : params.db_list.tokenize(',')*.trim().findAll { it } - db_ch = Channel - .fromList(items) - .map { p -> - def db_path = file(p, checkIfExists: true) - tuple([ id: db_path.getName(), db: db_path.getName() ], db_path) - } - } else if (params.db) { - def db_path = file(params.db, checkIfExists: true) - db_ch = Channel.value(tuple( - [ id: db_path.getName(), db: db_path.getName() ], - db_path - )) } else { - error "No input provided: set --input , --db_list , or --db " + error "No input provided: set --input " } emit: @@ -190,12 +171,16 @@ def validateInputSamplesheet(input) { // Generate methods description for MultiQC // def toolCitationText() { - // TODO nf-core: Optionally add in-text citation tools to this list. - // Can use ternary operators to dynamically construct based conditions, e.g. params["run_xyz"] ? "Tool (Foo et al. 2023)" : "", - // Uncomment function in methodsDescriptionText to render in MultiQC report def citation_text = [ "Tools used in the workflow included:", - "MultiQC (Ewels et al. 2016)", + "MultiQC (Ewels et al. 2016),", + "scikit-learn (Pedregosa et al. 2011),", + "PyTorch (Paszke et al. 2019),", + "NetworkX (Hagberg et al. 2008),", + "ESM-3 (Hayes et al. 2024),", + "ESM-C (EvolutionaryScale 2024),", + "ProtT5 (Elnaggar et al. 2022),", + "and ProtDCal (Ruiz-Blanco et al. 2015)", "." ].join(' ').trim() @@ -203,11 +188,15 @@ def toolCitationText() { } def toolBibliographyText() { - // TODO nf-core: Optionally add bibliographic entries to this list. - // Can use ternary operators to dynamically construct based conditions, e.g. params["run_xyz"] ? "
  • Author (2023) Pub name, Journal, DOI
  • " : "", - // Uncomment function in methodsDescriptionText to render in MultiQC report def reference_text = [ - "
  • Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics , 32(19), 3047–3048. doi: /10.1093/bioinformatics/btw354
  • " + "
  • Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), 3047–3048. doi: 10.1093/bioinformatics/btw354
  • ", + "
  • Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
  • ", + "
  • Paszke, A. et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeurIPS 32.
  • ", + "
  • Hagberg, A., Schult, D., & Swart, P. (2008). Exploring network structure, dynamics, and function using NetworkX. SciPy 2008.
  • ", + "
  • Hayes, T. et al. (2024). Simulating 500 million years of evolution with a language model. bioRxiv. doi: 10.1101/2024.07.01.600583
  • ", + "
  • EvolutionaryScale (2024). ESM Cambrian: revealing the mysteries of proteins with unsupervised learning.
  • ", + "
  • Elnaggar, A. et al. (2022). ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE TPAMI, 44(10), 7112–7127. doi: 10.1109/TPAMI.2021.3095381
  • ", + "
  • Ruiz-Blanco, Y. B., Paz, W., Green, J., & Marrero-Ponce, Y. (2015). ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins. BMC Bioinformatics, 16, 162. doi: 10.1186/s12859-015-0586-0
  • " ].join(' ').trim() return reference_text @@ -234,12 +223,8 @@ def methodsDescriptionText(mqc_methods_yaml) { meta["nodoi_text"] = meta.manifest_map.doi ? "" : "
  • If available, make sure to update the text to include the Zenodo DOI of version of the pipeline used.
  • " // Tool references - meta["tool_citations"] = "" - meta["tool_bibliography"] = "" - - // TODO nf-core: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled! - // meta["tool_citations"] = toolCitationText().replaceAll(", \\.", ".").replaceAll("\\. \\.", ".").replaceAll(", \\.", ".") - // meta["tool_bibliography"] = toolBibliographyText() + meta["tool_citations"] = toolCitationText().replaceAll(", \\.", ".").replaceAll("\\. \\.", ".").replaceAll(", \\.", ".") + meta["tool_bibliography"] = toolBibliographyText() def methods_text = mqc_methods_yaml.text diff --git a/subworkflows/local/utils_nfcore_domainbenchmark_pipeline/meta.yml b/subworkflows/local/utils_nfcore_domainbenchmark_pipeline/meta.yml new file mode 100644 index 0000000..1238d07 --- /dev/null +++ b/subworkflows/local/utils_nfcore_domainbenchmark_pipeline/meta.yml @@ -0,0 +1,37 @@ +name: utils_nfcore_domainbenchmark_pipeline +description: | + Utility helpers for the daisybio/domainbenchmark pipeline: + initialisation, completion handlers, parameter validation, citation / + methods description rendering, and other nf-core boilerplate. +keywords: + - utility + - nf-core + - initialisation + - completion +components: + - utils_nextflow_pipeline + - utils_nfcore_pipeline + - utils_nfschema_plugin + - completionemail + - completionsummary +input: + - nextflow_cli_args: + type: list + description: | + Positional CLI arguments passed to Nextflow. Validated by + `validateInputParameters`. + - outdir: + type: string + description: Output directory for the pipeline run. + - input: + type: string + description: Path to the samplesheet CSV. +output: + - samplesheet: + type: channel + description: | + Channel of validated samplesheet rows parsed by `samplesheetToList`. +authors: + - "Konstantin Pelz" +maintainers: + - "Konstantin Pelz"