fix: delete stale error.pb before uploading outputs on JobSet restart#1204
fix: delete stale error.pb before uploading outputs on JobSet restart#1204AdilFayyaz wants to merge 2 commits into
Conversation
Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>
Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>
| if hasattr(fs, "_rm_file"): | ||
| await fs._rm_file(path) # pylint: disable=W0212 | ||
| return | ||
| if hasattr(fs, "_rm"): |
There was a problem hiding this comment.
do we need rm? from my claude:
_delete_path should follow the same shape. I verified the methods exist on the backends you actually hit:
- AsyncFileSystem._rm_file is defined on the base, and obstore's FsspecStore implements _rm_file directly (S3/GCS/ABFS).
- sync LocalFileSystem has rm_file.
So the whole thing collapses to:
async def _delete_path(path: str) -> None:
fs = storage.get_underlying_filesystem(path=path)
if isinstance(fs, AsyncFileSystem):
await fs._rm_file(path) # pylint: disable=W0212
else:
fs.rm_file(path)
There was a problem hiding this comment.
this is cleaner I agree
| stale_error_uri = error_path(output_path) | ||
| if not await storage.exists(stale_error_uri): |
There was a problem hiding this comment.
Why do we need it? Don't we write output.pb and error.pb to different folder (<raw_output>/0, <raw_output>/1) for different attempts?
There was a problem hiding this comment.
Because a Jobset restart attempt is a k8s Jobsets restart within the same attempt. This is different from flyte attempts because they go to different output folders. For Jobsets, the crashed errors.pb and successful restarts error.pb go in the same folder.
But are you suggesting we write per restart outputs to a jobsets suffixed restart attempt subfolder instead of deleting? Will have to scope that out and would require plugin changes.
There was a problem hiding this comment.
If we use Jobsets restart, do we see the attempt (1/2/3/4) on UI
There was a problem hiding this comment.
if we delete the stale error.pb, how do see the error on UI for the previous attempt
There was a problem hiding this comment.
Nope you wont see attempt 1/2/3/4 because this is a Jobset level restart not a Flyte one. And regarding the deletion, we only clear error.pb when the run is about to write to outputs.pb (on success - i.e. on recovery). If the run fails it goes through a different path are returns before deletion so the real error is seen on the UI.
Motivation
When a clustered (JobSet) task fails, it writes an error.pb to the output path. If a later restart attempt succeeds, that stale error file is left behind and the action can be misreported as failed despite a successful run.
Summary
Test Plan