Skip to content

fix intermittent 404 (NoSuchKey) in Dir.download_sync for remote directories#1129

Open
samhita-alla wants to merge 2 commits into
mainfrom
fix-download-sync
Open

fix intermittent 404 (NoSuchKey) in Dir.download_sync for remote directories#1129
samhita-alla wants to merge 2 commits into
mainfrom
fix-download-sync

Conversation

@samhita-alla

Copy link
Copy Markdown
Contributor

Problem

Dir.download_sync() intermittently fails with a 404 when downloading a remote directory, even though the directory's objects exist:

    FileNotFoundError: Object at location .../fineweb-edu-seq2048 not found
    ... GET https://s3.../fineweb-edu-seq2048 -> 404 NoSuchKey

It "works on retry" though. Observed in sync (def) tasks that download a directory produced upstream.

Root cause

download_sync delegated to fsspec's recursive fs.get(self.path, dest, recursive=True). Before copying, fsspec must decide whether the remote prefix is a directory:

  1. check the dircache,
  2. else HEAD the bare prefix key - a 404, since only the child objects are real keys, not the prefix itself,
  3. fall back to a listing to infer "directory"

This HEAD-then-list-then-cache detection is timing/state dependent. When it resolves to "directory", the children are listed and downloaded. When the listing/cache evidence isn't present at the moment _isdir is consulted, fsspec
treats the prefix as a file and issues a GET on the directory key resulting in 404 NoSuchKey. The outcome is nondeterministic across invocations, hence it fails first and succeeds on retry. The data is always present; only the
classification flakes.

Note the async Dir.download() is unaffected: it routes through storage.get()ObstoreParallelReader, which lists the prefix and downloads each concrete object. The sync path was the only one relying on fsspec's recursive-get directory detection.

Fix

download_sync now mirrors what the async reader does, using sync fsspec: enumerate concrete object keys with fs.find(self.path) and download each withfs.get_file(). Only real keys are ever fetched, so there is no file-vs-directory
inference to flake. Falls back to the previous recursive get when find returns nothing, preserving prior behavior/errors for empty/edge cases.

…ctories

Signed-off-by: Samhita Alla <aallasamhita@gmail.com>
Comment thread src/flyte/io/_dir.py
return local_dest

src_prefix = fs._strip_protocol(self.path).rstrip("/")
for remote_file in files:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you not just re-implementing fsspec in the sdk?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recursive get in fsspec does:

rpaths = await self._expand_path(rpath, recursive=True)   # may include rpath itself
rpaths = [p for p in rpaths if not (trailing_sep(p) or await self._isdir(p))]
await asyncio.gather(*[self._get_file(p, ...) for p in rpaths])

fsspec has to handle both “rpath is a single file” and “rpath is a directory” so it inspects the root via _isdir. for obstore that does a HEAD on the bare prefix (404, because the prefix isn’t an object) and then falls back to a listing and when that detection flakes, the prefix isn’t filtered out and fsspec proceeds to _cat_file leading to 404 NoSuchKey. this PR composes it differently than recursive get does internally minus the file-vs-directory inspection of the root path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants