feat: skip completed files in resumable dataset uploads#5929
Conversation
Automated Reviewer SuggestionsBased on the
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5929 +/- ##
============================================
- Coverage 54.91% 54.89% -0.02%
+ Complexity 2956 2924 -32
============================================
Files 1117 1117
Lines 43133 43259 +126
Branches 4648 4680 +32
============================================
+ Hits 23685 23746 +61
- Misses 18054 18094 +40
- Partials 1394 1419 +25
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
94ce43f to
ce97bb4
Compare
ce97bb4 to
a586457
Compare
|
Hi @carloea2, thanks for the PR. Could you open a corresponding issue for this PR? The existing discussion may not include the implementation design details for this PR. Also, I was wondering whether the notification window might overlap with the resumable/restart flow for incomplete uploads. Thanks! |
|
What do you mean with overlap? Issue: #5938 |
There was a problem hiding this comment.
Pull request overview
This PR improves resumable dataset uploads by distinguishing “resumable multipart session exists” from “file already exists in the dataset (same path + size)”, enabling users to skip already-completed files during a retry batch.
Changes:
- Backend: add
POST /dataset/{did}/existing-upload-filesto detect already-present committed/staged files by normalized path + size. - Frontend: extend the upload retry flow to first handle active multipart sessions, then prompt Upload/Skip for completed matching files.
- Tests: add/extend Scala + Angular unit tests to cover the new endpoint and the mixed retry-batch UX.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| frontend/src/app/dashboard/service/user/dataset/dataset.service.ts | Adds client method to query existing upload files by path + size. |
| frontend/src/app/dashboard/service/user/dataset/dataset.service.spec.ts | Adds coverage for existing-upload-files call and multipart resume behavior. |
| frontend/src/app/dashboard/component/user/user-dataset/user-dataset-explorer/dataset-detail.component.html | Wires dataset id (did) into the uploader component. |
| frontend/src/app/dashboard/component/user/files-uploader/files-uploader.component.ts | Adds Upload/Skip prompt for already-existing files and integrates both checks in retry flow. |
| frontend/src/app/dashboard/component/user/files-uploader/files-uploader.component.spec.ts | Adds new tests for mixed retry batches and Upload/Skip For All behavior. |
| frontend/src/app/dashboard/component/user/files-uploader/conflicting-file-modal-content/conflicting-file-modal-content.component.ts | Extends modal data with optional hint. |
| frontend/src/app/dashboard/component/user/files-uploader/conflicting-file-modal-content/conflicting-file-modal-content.component.html | Renders a custom hint when provided. |
| file-service/src/test/scala/org/apache/texera/service/resource/DatasetResourceSpec.scala | Adds backend tests for matching by path+size and validation/auth checks. |
| file-service/src/main/scala/org/apache/texera/service/resource/DatasetResource.scala | Implements the new existing-upload-files endpoint. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| val requested = Option(request) | ||
| .map(_.files) | ||
| .getOrElse(List.empty) | ||
| .map { file => |
| val committed = getLatestDatasetVersion(ctx, did) | ||
| .map(v => | ||
| LakeFSStorageClient | ||
| .retrieveObjectsOfVersion(dataset.getRepositoryName, v.getVersionHash) | ||
| .map(obj => obj.getPath -> obj.getSizeBytes.longValue()) | ||
| ) | ||
| .getOrElse(List.empty) |
| val staged = LakeFSStorageClient | ||
| .retrieveUncommittedObjects(dataset.getRepositoryName) | ||
| .filterNot(diff => Option(diff.getType).exists(_.getValue.equalsIgnoreCase("removed"))) | ||
| .flatMap(diff => Option(diff.getSizeBytes).map(size => diff.getPath -> size.longValue())) |
What changes were proposed in this PR?
This PR improves dataset upload retry behavior by distinguishing between two cases that currently look similar to users:
The backend adds a dataset endpoint that checks candidate upload files against committed and staged dataset files by path and size. The frontend uses that endpoint after checking active multipart sessions, so failed uploads can still be resumed while completed matching files can be skipped from the retry batch.
The user-facing copy intentionally says a file with the same path and size exists, instead of implying byte-for-byte certainty.
Any related issues, documentation, discussions?
Related to discussion #5744: Improve resumable upload: track completion at the batch/session level.
How was this PR tested?
Added backend and frontend tests covering:
DatasetResourceSpecDatasetServicespecFilesUploaderComponentspecCommands run: