Skip to content

Fix duplicates#131

Draft
christinehc wants to merge 75 commits intomainfrom
fix_duplicates
Draft

Fix duplicates#131
christinehc wants to merge 75 commits intomainfrom
fix_duplicates

Conversation

@christinehc
Copy link
Copy Markdown
Collaborator

@christinehc christinehc commented Feb 19, 2026

Note: IN PROGRESS

Ideally, I would like to update the build manifest to fully use figshare files, but not all github data has been uploaded to figshare yet. However, if we want to merge changes and open a separate branch to update the manifest, that's fine too.

Summary

Fixes duplication of Sample_ID and updates pipeline to fully produce zebrafish benchmark dose curve response files using new data files. Includes several QOL improvements to the code.

Changelog

  • Fixes error that was creating too many unique Sample_IDs from non-unique samples (i.e. duplicates)
  • Loads (most) data from figshare
  • Checks schemas for output data, including zebrafish. Updated zebrafish schemas accordingly, using the is_a construct as a template for Dose, Fits, and BMD types.
  • Tests BMDRC pipeline on actual data. Note that BMDRC package has undergone updates accordingly to ensure congruence with the pipeline.
  • Updates ontology mappings
  • Adds link to chemical structure figure for portal
  • Fixes figshare uploading so that files are fully uploaded and downloadable
  • Adds DataLoader class to handle data loading, from figshare or otherwise
  • Adds DataManifest class to handle interaction with build manifest (data/srp_build_files.csv), including handling file versions
  • Changes sigGeneStats to new longer format (NOTE: sigGeneStats does not appear to be used in the pipeline currently, potentially need to fix?)
  • Minor changes: improved formatting, refactored code for readability, added full documentation, added secrets, update schema

Issues

changelog:
- previously, unique sample IDs were being assigned to samples with the same metadata parameters due to improper grouping of duplicate samples. The code has been updated to correct sample ID assignment to require uniqueness.
changelog:
- add bmdrc code to main build pipeline via `fitCurveFiles` function. validated to work locally
changelog:
- style: reformat and lint code using ruff
- refactor: change output filename format
- refactor: file cleanup-- e.g. move large list/dict params to separate params file to make code easier to follow.
- style: add some documentation
- feat: add new CLI args for specifying filename format
changelog:
- Manifest handler object for interfacing with and downloading files from the manifest now exists.
- Schema parser to pull columns and slots from schema classes.
@sgosline sgosline marked this pull request as draft March 4, 2026 19:08
@sgosline
Copy link
Copy Markdown
Member

sgosline commented Mar 4, 2026

I converted PR to draft status, just to keep track :)

changelog:
- refactor: previously, zebrafish chemical and sample file generation code was occurring across two separate files (map_samples_to_chemicals.py and build_script.py). Sample files were being produced by the build_script.py, but were not being produced correctly. Code has been updated to remove the code generating incorrect zebrafish sample files and moving all zebrafish file combination code to build_script.py. This also involved moving code dependencies to src/samples.py.
- fix: fix the generation of incorrect sample files (see above)
(note: code has been updated to remove trailing spaces from fields in the dataframe when loaded, but the file itself has been updated as well)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment