Elsewhere, @jameshadfield wrote:
Using --augur ~/github/nextstrain/augur (as per #419 (comment)) works but it takes 20min to upload (because an in-use augur repo size baloons to 675MB). #295 should help here, or excluding certain paths (our docs, .mypy_cache etc).
I replied with some potential solutions:
Perhaps runner.aws_batch.s3.upload_workdir needs to be extended to a) use .gitignore files for the overlay volumes (but not the workdir itself) and/or b) support a Nextstrain-specific ignore file (e.g. .nextstrain-ignore) which could be applied everywhere (as suggested in 6d465f0).
I looked into this a bit more, and we could use git check-ignore (conditional on git being available) to filter paths for upload. It could be invoked one-at-a-time on each path, which would be simplest to integrate but slow, or be fed a stream of paths on stdin while we read its stdout concurrently, which would be more complex to integrate but fast. This seemed promising and would be transparent, Just Working™ without a new ignores file or intervention from anyone.
In some ad-hoc testing, though, I realized a big caveat: for overlay purposes, we actually need some of the files ignored by git, e.g. Python packaging metadata in nextstrain_augur.egg-info/ for Augur (for the installed augur to locate its entrypoint) and dist/ for Auspice (for the transpiled code served by auspice view). The way I realized this was by routinely deleting all ignored files with git clean -fXd and then noticing (to my surprise) that an overlay no longer worked.
This makes git ignores entirely inappropriate for use as excludes in overlays, I think. And for basically the same reason we wouldn't apply git ignores to workdirs themselves: build time artifacts not suitable for version control are important for execution time.
That leaves us with the new feature of Nextstrain-specific ignore files, which nicely enough can be applied to both overlay sources and workdirs alike. Implementation will require a fair bit of new complexity, but I don't see any major algorithm questions or uncertainty. Biggest questions are about design/interface, perhaps:
- What's the filename we use?
.nextstrain-ignore? .nextstrain-exclude
- Is there anywhere else this ignore file would be used that we should be taking into account?
- If not (2), perhaps we should make the filename (1) more specific to AWS Batch or Nextstrain CLI?
And of course, maybe there's another option/solution to consider.
Elsewhere, @jameshadfield wrote:
I replied with some potential solutions:
I looked into this a bit more, and we could use
git check-ignore(conditional ongitbeing available) to filter paths for upload. It could be invoked one-at-a-time on each path, which would be simplest to integrate but slow, or be fed a stream of paths on stdin while we read its stdout concurrently, which would be more complex to integrate but fast. This seemed promising and would be transparent, Just Working™ without a new ignores file or intervention from anyone.In some ad-hoc testing, though, I realized a big caveat: for overlay purposes, we actually need some of the files ignored by git, e.g. Python packaging metadata in
nextstrain_augur.egg-info/for Augur (for the installedaugurto locate its entrypoint) anddist/for Auspice (for the transpiled code served byauspice view). The way I realized this was by routinely deleting all ignored files withgit clean -fXdand then noticing (to my surprise) that an overlay no longer worked.This makes git ignores entirely inappropriate for use as excludes in overlays, I think. And for basically the same reason we wouldn't apply git ignores to workdirs themselves: build time artifacts not suitable for version control are important for execution time.
That leaves us with the new feature of Nextstrain-specific ignore files, which nicely enough can be applied to both overlay sources and workdirs alike. Implementation will require a fair bit of new complexity, but I don't see any major algorithm questions or uncertainty. Biggest questions are about design/interface, perhaps:
.nextstrain-ignore?.nextstrain-excludeAnd of course, maybe there's another option/solution to consider.