Skip to content

Add MOOSEz segmentation workflow with Dockerfiles, WDL pipelines, and Terra notebooks#67

Open
Sunderlandkyl wants to merge 27 commits into
ImagingDataCommons:mainfrom
Sunderlandkyl:moose_test
Open

Add MOOSEz segmentation workflow with Dockerfiles, WDL pipelines, and Terra notebooks#67
Sunderlandkyl wants to merge 27 commits into
ImagingDataCommons:mainfrom
Sunderlandkyl:moose_test

Conversation

@Sunderlandkyl

Copy link
Copy Markdown
  • Add inference and post-processing Dockerfiles
  • Add Terra WDL workflows (single and split VM variants)
  • Add preprocessing, inference, and post-processing notebooks

… Terra notebooks

- Add inference and post-processing Dockerfiles
- Add Terra WDL workflows (single and split VM variants)
- Add preprocessing, inference, and post-processing notebooks
@review-notebook-app

Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Sunderlandkyl and others added 26 commits June 23, 2026 12:30
dataset.json's "labels" field is {organ_name: label_id} (nnU-Net v2 schema,
confirmed against moosez's own Model.__get_organ_indices), but the bundling
code assumed the opposite ({label_id: organ_name}) and filtered on
k.isdigit() -- which is never true for a name key, so organ_indices came out
empty for every model and every segment fell back to "segment_N".

Capture organ_indices straight from each Model object's own .organ_indices
during inference instead of re-parsing dataset.json, so this can't drift
from how moosez itself reads its files again.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Enable running moosePostProcessNotebook.ipynb standalone in Colab or a
local Jupyter/Python kernel. A presence-gated setup cell installs the
missing Python packages, the lz4 CLI, dcmqi 1.5.4 (itkimage2segimage with
labelmap support) and SNOMED.py, mirroring the post_process_moose
Dockerfile. Inside the prebuilt image every check is a no-op.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ence

Switch the post_process_moose Dockerfile and post-process notebook from the
dcmqi release tarball to pip (dcmqi==1.5.5). Remove dcmqi from the inference
image, which only produces NIfTI and never runs itkimage2segimage.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…stly

The PyPI `dcmqi` package version is independent of the dcmqi release it bundles;
`dcmqi==1.5.5` does not exist, so the install failed. Pin `dcmqi==0.4.1`, which
ships dcmqi binaries v1.5.5, in the post-process Dockerfile and notebook.

In the notebook, also locate itkimage2segimage via the installed package's
bundled bin/ directory and prepend it to PATH, so the converter is found even
when pip's console-script shim lands in a directory off PATH (e.g. ~/.local/bin).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…i 1.5.5

Add --useLabelIDAsSegmentNumber to itkimage2segimage so the DICOM-SEG segment
numbers match the source moosez label values instead of being renumbered 1..N.

Also change the bare --skip switch to "--skip 1": in dcmqi 1.5.5 --skip takes an
int (default 1 = skip empty slices), so the valueless form from older dcmqi
would now fail argument parsing. "1" preserves the prior skip-empty-slices
behavior.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The inference step bundled each label's SNOMED {ID, name} from moosez's own
mappings.SNOMED into moose_organ_indices.json. That data still carries the
upstream issues (left/right structures collapsed to one code, gluteus_minimus
mapped to the medius code) and only a single type code with no category,
laterality modifier, anatomic region, or curated color.

Switch the source of truth to workflows/MOOSE/resources/moose_snomed_mapping.csv:

- Inference notebook: replace the moosez.mappings.SNOMED lookup with a stdlib-csv
  loader that reads the curated CSV (wget'd next to the notebook) and bundles a
  rich record per label -- category, type, laterality modifier, anatomic region
  (+ modifier), and rgb -- into moose_organ_indices.json. Canonicalizes the
  vertebra/vertebrae naming difference.
- Post-process notebook: build_dcmqi_config now emits the bundled category, type,
  laterality modifier, and anatomic region, and uses the CSV color (falling back
  to the distinct-color palette). Removes the now-dead SNOMED.py download.
- twoVM.wdl: wget the CSV in the inference task.

Builds on the corrected CSV (gluteus_minimus -> 75297007; laterality encoded as
a Type Modifier so left/right no longer share one code).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The SNOMED mapping is static, so attaching it to every label in
moose_organ_indices.json at inference time (and rewriting that JSON on every
run) is needless work on the GPU VM. Move the CSV resolution to the
post-process step, which is where the DICOM-SEG metadata is actually built.

- Inference notebook: organ_indices is back to {model: {label_id: name}} --
  just the moosez label names. Drops the CSV loader/lookup and the SNOMED
  coverage print.
- Post-process notebook: loads moose_snomed_mapping.csv once and resolves each
  label name (canonicalizing vertebra/vertebrae) in build_dcmqi_config, emitting
  category / type / laterality modifier / anatomic region / curated color, with
  the distinct-color palette as the rgb fallback. organ_name() tolerates the
  older bundled {"name","snomed"} shape for back-compat.
- twoVM.wdl: wget the CSV in the post-process task instead of the inference task.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pass --compress deflate to itkimage2segimage so each generated SEG is
written compressed, shrinking the .dcm files and the packaged archive.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Task 2 now receives Task 1's usageMetricsCsv as a new WDL input,
merges it with its own metrics (inference_/postprocess_ prefixes
resolve the series_download_s column collision between phases), and
uploads the combined CSV to a _metrics/ subfolder under
dicomSegBucketUri. Drop the JSON metrics files entirely -- the CSV
already covers everything needed for analysis, and the full
usage_metrics dict is still visible in the output notebook's printed
cell output if needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MOOSE: source SNOMED coding from curated CSV (laterality, gluteus_minimus, category/region/colors)
…ization

The post-process notebook no longer substitutes a generic placeholder
(Body structure + palette color) for labels it can't resolve. build_dcmqi_config
now collects every label that lacks a complete SNOMED entry (category + type +
color) and raises KeyError listing them, so mapping gaps fail loudly instead of
producing meaningless DICOM SEG metadata.

Also removes _canon_label, which rewrote "vertebrae_*" -> "vertebra_*". moosez
emits plural vertebra labels (vertebrae_C1, ...) and the curated CSV now keys them
plural too, so the lookup is an exact match and the canonicalization (applied to
both sides) was redundant. The loader keys by the raw label name and raises on a
genuine duplicate-with-conflicting-codes.

Updates the stale section doc (SNOMED is sourced from the curated CSV here, not
moosez.mappings.SNOMED) and the "unknown segments use generic codes" warning.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… and twoVM.wdl

Lets both notebooks pull DICOM from a gs:// URI via s5cmd (HMAC creds from
Secret Manager) instead of IDC, sorting downloaded files into series
folders by SeriesInstanceUID. Threads the new input_uri/secret_project
parameters through twoVM.wdl so the source can be selected from Terra.

In the inference notebook, checkpoint restore is moved to run after both
download paths so its run_key reflects the series actually processed,
since GCS-sourced series_uids aren't known until after download+sort
completes. In the post-process notebook, GCS DICOM is bulk-downloaded and
sorted once before the per-series loop rather than per series.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pass Task 1's moose_stats.tar.lz4 into the post-process task and upload its
per-series volume/HU-intensity CSVs to a _stats/ prefix under
dicomSegBucketUri, mirroring the existing _metrics/ usage-metrics upload.
Neither MOOSE Docker image installs the gcloud SDK, so GCS-input-mode
runs failed with FileNotFoundError: 'gcloud'. Switch both notebooks to
the google-cloud-secret-manager Python client (matching the existing
google-cloud-storage/ADC pattern), pin the package in both Dockerfiles,
and add runtime pip-install fallbacks (inference: on ImportError;
post-process: via its existing prereq-installer cell, which now also
installs s5cmd for standalone/Colab GCS-mode runs).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants