Skip to content

feat: skip completed files in resumable dataset uploads#5929

Open
carloea2 wants to merge 4 commits into
apache:mainfrom
carloea2:batch-resumable-upload
Open

feat: skip completed files in resumable dataset uploads#5929
carloea2 wants to merge 4 commits into
apache:mainfrom
carloea2:batch-resumable-upload

Conversation

@carloea2

@carloea2 carloea2 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

image

This PR improves dataset upload retry behavior by distinguishing between two cases that currently look similar to users:

Case Behavior
Active multipart upload session exists Ask whether to resume or restart the upload.
Matching file already exists in the dataset Ask whether to upload again or skip it.

The backend adds a dataset endpoint that checks candidate upload files against committed and staged dataset files by path and size. The frontend uses that endpoint after checking active multipart sessions, so failed uploads can still be resumed while completed matching files can be skipped from the retry batch.

The user-facing copy intentionally says a file with the same path and size exists, instead of implying byte-for-byte certainty.

Any related issues, documentation, discussions?

Related to discussion #5744: Improve resumable upload: track completion at the batch/session level.

How was this PR tested?

Added backend and frontend tests covering:

Test Coverage
DatasetResourceSpec Finds committed and staged files only when path and size match.
DatasetService spec Forces one multipart part upload to fail, then verifies retry uploads only the missing part.
FilesUploaderComponent spec Verifies a mixed retry batch asks Resume/Restart for a failed multipart file and Upload/Skip for a completed matching file.

Commands run:

sbt "FileService/testOnly org.apache.texera.service.resource.DatasetResourceSpec -- -z findExistingUploadFiles"

yarn ng run gui:test --include=src/app/dashboard/service/user/dataset/dataset.service.spec.ts

yarn ng run gui:test --include=src/app/dashboard/component/user/files-uploader/files-uploader.component.spec.ts

yarn eslint src/app/dashboard/component/user/files-uploader/files-uploader.component.ts src/app/dashboard/component/user/files-uploader/files-uploader.component.spec.ts src/app/dashboard/component/user/files-uploader/conflicting-file-modal-content/conflicting-file-modal-content.component.ts src/app/dashboard/service/user/dataset/dataset.service.ts src/app/dashboard/service/user/dataset/dataset.service.spec.ts

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Me

@github-actions github-actions Bot added frontend Changes related to the frontend GUI platform Non-amber Scala service paths labels Jun 24, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • Contributors with relevant context: @Yicong-Huang, @xuang7, @Ma77Ball
    You can notify them by mentioning @Yicong-Huang, @xuang7, @Ma77Ball in a comment.

@codecov-commenter

codecov-commenter commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 76.59574% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.89%. Comparing base (2ebfc28) to head (53bfa9c).

Files with missing lines Patch % Lines
...nt/user/files-uploader/files-uploader.component.ts 78.94% 2 Missing and 6 partials ⚠️
...ache/texera/service/resource/DatasetResource.scala 50.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5929      +/-   ##
============================================
- Coverage     54.91%   54.89%   -0.02%     
+ Complexity     2956     2924      -32     
============================================
  Files          1117     1117              
  Lines         43133    43259     +126     
  Branches       4648     4680      +32     
============================================
+ Hits          23685    23746      +61     
- Misses        18054    18094      +40     
- Partials       1394     1419      +25     
Flag Coverage Δ *Carryforward flag
access-control-service 70.00% <ø> (ø)
agent-service 34.36% <ø> (ø) Carriedforward from c3161f7
amber 56.52% <ø> (-0.50%) ⬇️ Carriedforward from c3161f7
computing-unit-managing-service 0.00% <ø> (ø)
config-service 51.56% <ø> (ø)
file-service 60.33% <50.00%> (+1.31%) ⬆️
frontend 49.13% <80.48%> (+0.45%) ⬆️
notebook-migration-service 78.57% <ø> (ø)
pyamber 90.20% <ø> (ø) Carriedforward from c3161f7
python 90.76% <ø> (ø) Carriedforward from c3161f7
workflow-compiling-service 55.14% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@carloea2 carloea2 force-pushed the batch-resumable-upload branch from 94ce43f to ce97bb4 Compare June 24, 2026 07:48
@carloea2 carloea2 force-pushed the batch-resumable-upload branch from ce97bb4 to a586457 Compare June 24, 2026 18:41
@xuang7

xuang7 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Hi @carloea2, thanks for the PR. Could you open a corresponding issue for this PR? The existing discussion may not include the implementation design details for this PR.

Also, I was wondering whether the notification window might overlap with the resumable/restart flow for incomplete uploads.

Thanks!

@carloea2

Copy link
Copy Markdown
Contributor Author

What do you mean with overlap?

Issue: #5938

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves resumable dataset uploads by distinguishing “resumable multipart session exists” from “file already exists in the dataset (same path + size)”, enabling users to skip already-completed files during a retry batch.

Changes:

  • Backend: add POST /dataset/{did}/existing-upload-files to detect already-present committed/staged files by normalized path + size.
  • Frontend: extend the upload retry flow to first handle active multipart sessions, then prompt Upload/Skip for completed matching files.
  • Tests: add/extend Scala + Angular unit tests to cover the new endpoint and the mixed retry-batch UX.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
frontend/src/app/dashboard/service/user/dataset/dataset.service.ts Adds client method to query existing upload files by path + size.
frontend/src/app/dashboard/service/user/dataset/dataset.service.spec.ts Adds coverage for existing-upload-files call and multipart resume behavior.
frontend/src/app/dashboard/component/user/user-dataset/user-dataset-explorer/dataset-detail.component.html Wires dataset id (did) into the uploader component.
frontend/src/app/dashboard/component/user/files-uploader/files-uploader.component.ts Adds Upload/Skip prompt for already-existing files and integrates both checks in retry flow.
frontend/src/app/dashboard/component/user/files-uploader/files-uploader.component.spec.ts Adds new tests for mixed retry batches and Upload/Skip For All behavior.
frontend/src/app/dashboard/component/user/files-uploader/conflicting-file-modal-content/conflicting-file-modal-content.component.ts Extends modal data with optional hint.
frontend/src/app/dashboard/component/user/files-uploader/conflicting-file-modal-content/conflicting-file-modal-content.component.html Renders a custom hint when provided.
file-service/src/test/scala/org/apache/texera/service/resource/DatasetResourceSpec.scala Adds backend tests for matching by path+size and validation/auth checks.
file-service/src/main/scala/org/apache/texera/service/resource/DatasetResource.scala Implements the new existing-upload-files endpoint.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1052 to +1055
val requested = Option(request)
.map(_.files)
.getOrElse(List.empty)
.map { file =>

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +1063 to +1069
val committed = getLatestDatasetVersion(ctx, did)
.map(v =>
LakeFSStorageClient
.retrieveObjectsOfVersion(dataset.getRepositoryName, v.getVersionHash)
.map(obj => obj.getPath -> obj.getSizeBytes.longValue())
)
.getOrElse(List.empty)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +1071 to +1074
val staged = LakeFSStorageClient
.retrieveUncommittedObjects(dataset.getRepositoryName)
.filterNot(diff => Option(diff.getType).exists(_.getValue.equalsIgnoreCase("removed")))
.flatMap(diff => Option(diff.getSizeBytes).map(size => diff.getPath -> size.longValue()))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend Changes related to the frontend GUI platform Non-amber Scala service paths

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants