Fix duplicates by christinehc · Pull Request #131 · PNNL-CompBio/srpAnalytics

christinehc · 2026-02-19T23:18:37Z

Note: IN PROGRESS

Ideally, I would like to update the build manifest to fully use figshare files, but not all github data has been uploaded to figshare yet. However, if we want to merge changes and open a separate branch to update the manifest, that's fine too.

Summary

Fixes duplication of Sample_ID and updates pipeline to fully produce zebrafish benchmark dose curve response files using new data files. Includes several QOL improvements to the code.

Changelog

Fixes error that was creating too many unique Sample_IDs from non-unique samples (i.e. duplicates)
Loads (most) data from figshare
Checks schemas for output data, including zebrafish. Updated zebrafish schemas accordingly, using the is_a construct as a template for Dose, Fits, and BMD types.
Tests BMDRC pipeline on actual data. Note that BMDRC package has undergone updates accordingly to ensure congruence with the pipeline.
Updates ontology mappings
Adds link to chemical structure figure for portal
Fixes figshare uploading so that files are fully uploaded and downloadable
Adds DataLoader class to handle data loading, from figshare or otherwise
Adds DataManifest class to handle interaction with build manifest (data/srp_build_files.csv), including handling file versions
Changes sigGeneStats to new longer format (NOTE: sigGeneStats does not appear to be used in the pipeline currently, potentially need to fix?)
Minor changes: improved formatting, refactored code for readability, added full documentation, added secrets, update schema

Issues

Fixes Validate ontology links #107
Fixes Handle problematic (null/duplicate) values #102
Fixes Add chemical structure PNG links #109
Fixes Evaluate and integrate bmdrc pipeline with real data #111
Addresses Add endpoint metadata mapping to pipeline #106
Addresses Update zf data schemas #110
Addresses Resolve figshare upload error #105
Fixes Incorporate zebrafish sample data into pipeline #115
Fixes Update pipeline with new data files #103
Addresses Fix LinkML schema to handle NULL values #121
Fixes Add columns to samplesToChemicals #119
Fixes Pipeline output changes #118
Fixes Add Response column to Fits data #130
Fixes Necessary Changes to sigGeneStats #120

changelog: - previously, unique sample IDs were being assigned to samples with the same metadata parameters due to improper grouping of duplicate samples. The code has been updated to correct sample ID assignment to require uniqueness.

changelog: - add bmdrc code to main build pipeline via `fitCurveFiles` function. validated to work locally

changelog: - style: reformat and lint code using ruff - refactor: change output filename format - refactor: file cleanup-- e.g. move large list/dict params to separate params file to make code easier to follow. - style: add some documentation - feat: add new CLI args for specifying filename format

changelog: - Manifest handler object for interfacing with and downloading files from the manifest now exists. - Schema parser to pull columns and slots from schema classes.

…105, #115

changelog: - feat: add schema checks and update related functions to correctly parse class name from file name types with and without underscores. - refactor: rename `res` variable -> `result` for clearer code

sgosline · 2026-03-04T19:08:20Z

I converted PR to draft status, just to keep track :)

changelog: - refactor: previously, zebrafish chemical and sample file generation code was occurring across two separate files (map_samples_to_chemicals.py and build_script.py). Sample files were being produced by the build_script.py, but were not being produced correctly. Code has been updated to remove the code generating incorrect zebrafish sample files and moving all zebrafish file combination code to build_script.py. This also involved moving code dependencies to src/samples.py. - fix: fix the generation of incorrect sample files (see above)

(note: code has been updated to remove trailing spaces from fields in the dataframe when loaded, but the file itself has been updated as well)

christinehc added 30 commits November 26, 2025 11:40

feat: include bmd files in workflow

6da2a34

build: use ubuntu-latest for actions

c3856e2

build: fix container name

a9ed382

build: fix container name in docker pull

9538052

build: include bmd in artifacts upload. force continue upload

36255b1

build: create new figshare articles for each upload

ec654f8

build: use correct category IDs for metadata

52c51a8

build: use secret for project ID. specify dataset.

04a07e4

build: remove tags from metadata

067e460

build: fix variable name

1d5910d

chore: update ontology mappings. fixes #107

e33f5f0

fix: include link to structure image png. fixes #109

7119ffe

fix: generate image links only for valid IDs. addresses #109

b9155d4

feat: adapt bmdrc code for zf data. fixes #111

b8e7e68

changelog: - add bmdrc code to main build pipeline via `fitCurveFiles` function. validated to work locally

chore: rename zfBMDS -> zfBMDs to align with bmdrc output

c929b11

chore: add new image_link col to schema

2c4d044

docs: update CLI flag for output directory

56c4f3a

feat: test github actions for figshare data download

b879bd8

build: trigger workflow on push

94140a3

feat: create manifest handler and schema parser. addresses #106

f3967fb

changelog: - Manifest handler object for interfacing with and downloading files from the manifest now exists. - Schema parser to pull columns and slots from schema classes.

feat: create figshare data downloader. addresses #106

f6c5a47

fix: update schema for zebrafish data. addresses #110

e97cd28

chore,format,refactor: clean up unused code and update docstrings

4013769

refactor: simplify mappings scripts and remove reliance on params

4a4cf3d

feat,fix,build: update main build (new files/zf pipeline). addresses #…

84ef109

…105, #115

build: edit dose classes to reflect sample vs. chem classes

139805f

build: update manifest with new files (+ figshare)

9ad5097

chore: add clarifying comments

60c4c19

christinehc added 7 commits February 4, 2026 09:03

chore: pipe temporary output to tmp/, not /tmp

8027037

feat: include new data build files. fixes #111, closes #119, closes #115

542882b

fix: add data post-bmdrc fixes. addresses #130

2bcdfb4

fix: check schema flexibly for zf output files

9120d73

chore: add response col to fits. addresses #130

65e499b

feat,refactor: add response column and schema checks. fixes #130

265114b

changelog: - feat: add schema checks and update related functions to correctly parse class name from file name types with and without underscores. - refactor: rename `res` variable -> `result` for clearer code

fix: update sigGeneStats format. fixes #120

68474c4

sgosline marked this pull request as draft March 4, 2026 19:08

christinehc added 21 commits April 9, 2026 09:10

fix: run schema check on correct filenames

87d8767

fix: enable keyword args for pandas loading

88f2798

build: update zebrafish sample/chem schemas

55396f1

build: update build files with new endpoint mapping file

33497e5

chore: remove unnecessary trailing spaces in endpoint file

3fef149

(note: code has been updated to remove trailing spaces from fields in the dataframe when loaded, but the file itself has been updated as well)

chore: remove unused/commented code

ea21ebb

chore: remove unused action

e725167

fix: trigger database rebuild with manifest changes

5b9ead4

chore: remove unused paths and reorganize

8b152ae

build,fix: use API token variable for figshare

2b83eae

build: improve error logging

ff89bba

chore: include stderr messages for github actions debug

2357065

chore: pipe output to stdout

54a8c9e

build: include dotenv in requirements

7716df0

chore: verify all dependencies installed correctly

eebc59e

format: remove additional line spacing

da01383

fix: include global requirements install

05d9bfd

fix: undo renaming requirements file

dfad26d

chore: include error traceback

951111a

chore: remove pip list validation

06e94ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix duplicates#131

Fix duplicates#131
christinehc wants to merge 75 commits intomainfrom
fix_duplicates

christinehc commented Feb 19, 2026 •

edited

Loading

Uh oh!

sgosline commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

christinehc commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note: IN PROGRESS

Summary

Changelog

Issues

Uh oh!

sgosline commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

christinehc commented Feb 19, 2026 •

edited

Loading