From b3bb56a8443526fbce57e35e1a8431931a87fa7d Mon Sep 17 00:00:00 2001 From: Michael Sarahan Date: Tue, 5 Nov 2024 10:38:16 -0600 Subject: [PATCH] add index priority pep draft --- peps/pep-9999.rst | 616 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 616 insertions(+) create mode 100644 peps/pep-9999.rst diff --git a/peps/pep-9999.rst b/peps/pep-9999.rst new file mode 100644 index 00000000000..cfaee829398 --- /dev/null +++ b/peps/pep-9999.rst @@ -0,0 +1,616 @@ +PEP: 9999 +Title: Standardizing selection of Python package files across multiple sources +Author: Michael Sarahan, msarahan@gmail.com +Sponsor: Barry Warsaw, barry@python.org +PEP-Delegate: +Discussions-To: +Status: Draft +Type: Standards Track +Topic: Packaging +Requires: 777 +Created: 05-Nov-2024 +Post-History: + +Abstract +======== + +Python installers are diverging with respect to how they handle multiple +indexes. Pip merges all indexes before matching distributions, while uv +matches distributions on one index before moving on to the next. Each +approach has advantages and disadvantages. This PEP aims to standardize +each of these behaviors, which are referred to as “version priority” and +“index priority” respectively, so that community discussions and +troubleshooting can share a common vocabulary. Codifying these behaviors +will also facilitate building tools that rely on these behaviors. + +Motivation +========== + +Goals +----- + +- Establish standard for evaluation order of various levels of priority + between indexes and packages +- Standardize verbiage of different approaches, so that cross-tool + discussions are talking “apples to apples” +- Suggest standard configuration options, to minimize user cognitive + load between tools +- Provide guidelines for how ecosystem tools should implement index + priority if they would like, and suggest reasons why they might want + to do so + +Non-goals +--------- + +- Do not relax the simplifying assumption that for a given distribution + filename, all indexes must have the same file. This has been expressed + as “all indexes are equal priority”, but can be more accurately + described as “`pip considers all distributions with the same name and + version to be interchangeable regardless of which index they came + from `__.” + This is a key assumption for how pip behaves today, and changing this + assumption is deemed too disruptive for this PEP. +- Do not promote index priority as a solution to dependency confusion. + It is related, but `PEP 708 `__ is + a better, more specific solution that does not require user + configuration +- Do not dictate that every tool must implement every strategy + +Rationale +========= + +This PEP proposes standards of installer behavior when using multiple +sources, so that the user experience and expectations across tools is +more homogeneous. Pip has long defined the de-facto standard installer +behavior in the ecosystem, but new tools have been implementing new +approaches. Uv and PDM have each added support for some notion of index +priority. + +Index priority is `the default behavior in +uv `__, +and supported, but `not default behavior in +PDM `__. +In uv, `users have reported some +issues `__ with the +behavior and asked for keeping pip’s behavior. Uv provides this choice +as a ``--index-strategy`` flag, with the ``unsafe-best-match`` value +matching pip’s current behavior, and the ``first-match`` value matching +this PEP’s proposed behavior. Defining index priority is expected to +help with `situations where a local package has remote +dependencies `__, and the user +wishes to prioritize local packages over remote dependencies, while +still falling back to remote dependencies where needed. + +The topic of prioritizing indexes is being proposed as a PEP rather than +as a change specifically to pip, because predictable index behavior +across the many tools in our ecosystem is important for consistent, +predictable behavior between tools. + +Specification +============= + +“Version priority” +------------------ + +This is pip’s current behavior. All distributions from all configured +repositories are collated, and the “best” among them is selected. The +criteria for what is “best” are: + +- `Match any specified + hashes `__ +- `Absence of + yanking `__ +- `Wheels over + sdists `__ +- `Maximize + version `__ + (including `Local version + tags `__, + `restricted to locally patched private + versions `__) +- `Maximize specificity of platform + tag `__ +- `Build + tags `__ + with the first value being an integer interpreted as ascending values + being “better”, and the second being a string with lexical sorting + +This behavior is characterized by the installer always getting the +latest version of a package, regardless of the repository that it comes +from. It is especially useful when all configured repositories are +equally trusted and well-behaved regarding the distribution +interchangeability assumption. That interchangeability assumption is +what makes comparing distributions of a given package meaningful - +without it, the installer is no longer comparing “apples to apples.” In +practice, it is common for different indexes to have files that have +different contents than other indexes, such as builds for special +hardware, and this version priority behavior can lead to undesirable, +unexpected outcomes. + +Uv refers to `its implementation of the version priority +behavior `__ +as “``unsafe-best-match``.” Naming is really hard here. On one hand, it +isn’t accurate to call pip’s default behavior intrinsically “unsafe.” +The addition of possibly malicious indexes/repositories is what +introduces concern with this behavior. PEP 708 added a way to restrict +installers from drawing packages from unexpected, potentially insecure +indexes. On the other hand, the term “best-match” is technically +correct, but also misleading. The “best match” varies by user and by +application. “Best” is technically correct in the sense that it is a +global optimum according to the match criteria specified above, but that +is not necessarily what is “best” in a user’s eyes. “Version priority” +is a proposed term that avoids the concerns with the uv terminology, +while approximating the behavior in the most user-identifiable way that +packages are compared. + +“Index priority” +---------------- + +Relative to “version priority” behavior, this strategy does not collate +the contents of each index before evaluating the best match. Where the +best-match behavior seeks a global best match across all repositories, +the first-match strategy seeks a local best match within one repo, and +accepts the first match that it finds. The uv and PDM projects have led +the way in experimenting with different ways of dealing with multiple +package sources. Uv’s default behavior is one that implements +first-match behavior (and coins the terms “first-match” and +“best-match”). In general, the strategy evaluates best match on the +top-priority repo, and proceeds to subsequent repos only if the current +package request has no viable candidates. The criteria and process for +evaluating “best match” is the same for both index priority and version +priority. The difference is that version priority combines all indexes +before ordering and prioritizing them, while index priority iterates +over the indexes in user-specified order, stopping at the first matching +distribution. The index (or “source” in PDM terms) priority is +explicitly defined by a configuration hierarchy. Each package’s finder +starts at the beginning of the list of repositories, so each package +starts over with the repository list. In other words, if one package has +no valid candidates on the first index, but finds a hit on the second +index, subsequent packages still start their search on the first index, +rather than starting on the second. + +One desirable behavior that the index priority strategy implies is that +there are no “surprise” updates, where a version bump on a +lower-priority repo wins out over a curated, approved higher-priority +repo. This is related to the security improvement of PEP 708, where +packages can restrict the external repos that distributions can come +from, but it is more configurable by end users. The package installs are +only expected to change when either the higher-priority repo or the +index priority configuration change. This stability and predictability +makes it viable to configure indexes as a more persistent property of an +environment, rather than a one-off argument for one install command. + +One important implementation detail of index priority is that caching +should now include the index from which distributions were downloaded. +Without this aspect, it is possible that after changing the list of +configured indexes, the cache could provide a similarly-named +distribution from a lower-priority index. If every index follows the +recommended behavior of providing identical files across indexes for a +given filename, this is not an issue. However, that recommendation is +not readily enforceable, and augmenting the cache key with origin index +would be a wise defensive change. + +Ways that a request falls through to a lower priority index +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +- Package name is not present at all in higher priority index +- All distributions from higher priority index filtered out due to + version specifier, python version, platform tag, yanking or otherwise +- A denylist specifies that a particular package name should be ignored + on a given index +- A higher priority index is unreachable (e.g. blocked by firewall + rules, temporarily unavailable due to maintenance, other miscellaneous + and temporary networking issues). This is a less clear-cut detail that + should be controllable by users. On one hand, this behavior would lead + to less predictable, likely unreproducible results by unexpectedly + falling through to lower priority indexes. On the other hand, graceful + fallback may be more valuable to some users, especially if they can + safely assume that all of their indexes are equally trusted. Pip’s + behavior today is graceful fallback: you see warnings if an index is + having connection issues, but the installation will proceed with any + other available indexes. Because index priority can convey different + trust levels between indexes, installers should default to raising + errors and aborting on network issues, but there should be a flag to + allow fall-through to lower-priority indexes. + +Treatment within a given index follows existing behavior, but stops at +the bounds of one index and moves on to the next index only after all +priority preferences within the one index are exhausted. This means that +existing priorities among the unified collection of packages apply to +each index individually before falling through to a lower priority +index. + +- wheel vs sdist: Wheels are preferred within one index. The resolution + will use an sdist from a higher-priority index before trying a wheel + from a lower-priority index. +- more platform-specific wheels before less specific ones. The + resolution will use less specific wheels from higher-priority indexes + before using more specific wheels from lower priority indexes. + +Priority configuration +~~~~~~~~~~~~~~~~~~~~~~ + +The order of specification of indexes determines their priority in the +finding process. As a result, the way that installers load the index +configuration must be predictable and reproducible. The configuration +precedence hierarchy proposed here matches `the existing pip +configuration +hierarchy `__. +Other tools should ideally conform to this hierarchy for predictable +behavior, but ultimately are free to define or retain their existing +configuration schemes to maintain idiomatic usage for themselves. +Notably, some tools differ in whether multiple sources of configuration +add to one another (conda), or preclude/clobber one another (pip). + +The order of specified URLs will follow: + +1. | CLI arguments (if present), as in: + | ``--extra-index-url --find-links --extra-index-url `` + +2. Environment variables (if present), with ordered evaluation: + + 1. PIP FIND_LINKS=” ” + 2. PIP_EXTRA_INDEX_URL=” ” + +3. Configuration files, as in \`pip.conf\`: + + | [global] + | extra_index_url = + | + | + | + | find_links = + | + | + | + +Ordering source types +~~~~~~~~~~~~~~~~~~~~~ + +There are two "source types" supported by pip: + +- ``find-links / PIP_FIND_LINKS`` +- ``extra-index-url / PIP_EXTRA_INDEX_URL`` + +The ``find-links`` option specifies one or more paths to search for +packages. It behaves very much like an index URL, except that it +searches filesystems, rather than resolving distribution filenames from +PEP 503 HTML or PEP 691 JSON data. This parameter is often passed with +the \`no-index\` parameter as a way of forcing the local packages to be +used. For the purposes of index priority, the ``find-links`` parameter +takes precedence over the values from ``extra-index-url``. The order of +these flags in a configuration file must not affect this evaluation +order. If both are defined, the value of ``extra-index-url`` is appended +to the value of ``find-links``. This is in keeping with `pip’s current +handling `__ +of these options. + +These source types can be specified in the three ways listed above: CLI, +environment variables, and configuration files. Among these, CLI flags have an +intrinsic evaluation order from left to right, and different source types can be +interleaved. In contrast, configuration files and in environment variables, the +state of several variables all exists simultaneously. There is no way to +interleave source types. For the sake of consistency among the three places +where configuration can be specified, the interleaving capability of the CLI +should not be respected. Instead, entries from each of the two source types should +be collected into their respective groups, and these groups should be evaluated +in the same order as environment variables and configuration files. + +For example, the CLI example given above will evaluate with pseudocode: + +:: + + --extra-index-url --find-links --extra-index-url + # ------------ + find_links = [] + extra_index_urls = [] + for argname, value in cli_args: + match argname: + case "find-links": + find_links.append(value) + case "extra-index-url": + extra_index_urls.append(value) + search_urls = find_links + extra_index_urls + + +The ultimate result of this evalution will be: + +1. next_priority +2. highest +3. lowest + +The value of ``index-url`` (or its default value of PyPI) is always the +lowest-priority entry in the search order. + +Requirements.txt file inclusions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As a further complication, `requirements.txt files can include index url +options `__, +including ``--extra-index-url`` and ``--find-links``. Requirements.txt +can be included via an environment variable (PIP_REQUIREMENT) and `by +command line +arguments `__. +For the purposes of evaluating priority, these requirements files should +be evaluated where they are specified (i.e. their position on the CLI +args), and place their index url options at the priority of the +originating file inclusion (either the env var or CLI args). For +example, if a requirements.txt file has: + +:: + + --extra-index-url + --find-links + --extra-index-url + my-package + +The CLI command of: + +:: + + pip install --extra-index-url -r requirements.txt --extra-index-url + +Would expand to an index priority list of: + +1. +2. +3. +4. +5. +6. + +Because the requirements.txt file is parsed from top to bottom, it +evaluates ``--extra-index-url`` and ``--find-links`` arguments in the +order that they appear in the requirements.txt file, as opposed to +reordering them such that ``--find-links`` arguments always come before +any ``--extra-index-url`` arguments. Requirements files may also include +other requirements files. In the general case, depth-first +traversal of requirements files defines the collection and ordering +of index-urls and find-links options. + +Proposed deny list format +~~~~~~~~~~~~~~~~~~~~~~~~~ + +As described above, `it may be desirable to intentionally omit one or +more packages from consideration on a given +index `__. For example, if +a repository includes builds of a package that introduces issues, but +otherwise has more desirable packages, the one problematic package can +be omitted. The configuration for this denylist should include a list of +mappings from repository URL to a list of package names with optional +version constraints: + +:: + + repository_package_exclusions: + - url: https://some-index.com + packages: + - some-problem-package-name<=3.5 + +Backwards Compatibility +======================= + +This PEP does not prescribe any changes as mandatory for any installer, +so it only introduces compatibility concerns if tools choose to adopt an +index behavior other than the behavior(s) they currently implement. + +This PEP’s language does not quite align with existing tools, including +pip and uv. Either this PEP’s language can change during review, or if +this PEP’s language is preferred, other projects could conform to it. +The important thing is that all the projects are using the same language +for the same concepts. + +As some tools rely on one or the other behavior, there are some possible +issues that may emerge, where tailoring available resources/packages for +a particular behavior may detract from the user experience for people +who rely on the other behavior. + +- Different repos may have different metadata. For example, one cannot + assume that the metadata for package “something” on repository “A” has + the same dependencies as “something” on repository “B”. When an + installer falls back to a different repository in the search order, it + implies refreshing the package metadata from the new repository. This + is both an improvement and a complication. It is a complication in the + sense that a cached metadata entry must be keyed by both package name + and index url, instead of just package name. It is a potential + improvement in that different implementation variants of a package can + differ in dependencies as long as their distributions are separated + into different indexes. +- Users may not get updates as they expect, because some higher priority + index has not updated/synchronized with PyPI to get the latest + packages. If the higher priority index has a valid candidate, newer + packages will not be found. This will need to be communicated + verbosely, because it is counter to pip’s well-established behavior. +- Improving the predictability of which repository will be selected may + tempt people into using index priority as a way of having similarly + named files that have different contents. It would be helpful if tools + errored on mismatching hashes, or otherwise gave the user feedback + that they are breaking a key assumption. This PEP does not mandate + alternative fixes, such as augmenting the cache key with the index, + because these changes have unknown unintended consequences. However, + it is advisable to augment the cache key with the index, as doing so + would remedy known “gotchas.” + +Security Implications +===================== + +Index priority creates a mechanism for users to explicitly specify a +trust hierarchy among their indexes. As such, it limits the potential +for dependency confusion attacks. Index priority was `rejected by PEP +708 `__ +as a solution for dependency confusion attacks. This PEP requests that +the rejection be reconsidered, with index priority serving a different +purpose. This PEP is primarily motivated by the desire to support +implementation variants, which is the subject of `another discussion +that hopefully leads to a +PEP `__. +It is not mutually exclusive with PEP 708, nor does it suggest reverting +or withdrawing PEP 708. It is an answer to “\ `how we could allow users +to choose which index to use at a more fine grained level than “per +install”. `__ + +For a more thorough discussion of the PEP 708 rejection of index +priority, please see `the section below <#bookmark=id.wca5suv4ctal>`__. + +How to Teach This +================= + +At the outset, the goal is not to convert pip or any other tool to +change its default priority behavior. The best way to teach is perhaps +to watch message boards, GitHub issue trackers and chat channels, +keeping an eye out for problems that index priority could help solve. +There are `several `__ +`long-standing `__ +`discussions `__ +`that `__ +`would `__ be good places to +start advertising the concepts. The topics of the two officially +supported behaviors need documentation, and we, the authors of this +PEP, would develop these as part of the review period of this PEP. +These docs would likely consist of additions across several +repositories, cross-linking the concepts between installers. At a +minimum, we expect to add to the +`PyPUG `__ and to `pip’s +documentation `__. + +It will be important to advertise the active behavior, especially in +error messaging, and that will provide ways to provide resources to +users about these behaviors. + +Uv users are already experiencing index priority. Uv `documents this +behavior `__ +well, but it is always possible to `improve the +discoverability `__ of that +documentation from the command line, `where users will actually +encounter the unexpected +behavior `__. + +Reference Implementation +======================== + +The uv project demonstrates index priority with its default behavior. Uv +is implemented in Rust, though, so if a python reference implementation +is necessary, we, the authors of this PEP, will provide one. For pip in +particular, we see the implementation plan as something like: + +- For users who don’t use ``--extra-index-url`` or ``--find-links``, + there will be no change, and no migration is necessary. +- Pip users would be able opt in to the index priority behavior with a + new config setting in the CLI and in pip.conf. This proposal does not + recommend any strategy as the default for any installer. It only + recommends documenting the strategies that a tool provides. +- Enable extra info-level output for any pip operation where more than + one index is used. In this output, state the current strategy setting, + and a terse summary of implied behavior, as well as a link to docs + that describe the different options +- Add debugging output that verbosely identifies the index being used at + each step, including where the file is in the configuration hierarchy, + and where it is being included (via config file, env var, or CLI + flag). +- Plumb tracking of which repository gets used for which + package/distribution through the entire pip install process. Store + this information so that it is available to tools like ``pip freeze`` +- Supplement PEP 751 (lockfiles) with capture of repository where a + package/distribution came from + +Rejected Ideas +============== + +- Tell users to set up a proxy/mirror, such as devpi or artifactory that + serves local files if present, and forwards to another server (PyPI) + if no local files match + + This matches the behavior of this proposal very closely, except that + this method requires hosting some server, and may be inaccessible or + not configurable to users in some environments. It is also important + to consider that for an organization that operates its own repository + (for overcoming PyPI size restrictions, for example), this does not + solve the need for ``--extra-index-url`` or proxy/mirror for end + users. That is, organizations get no improvement from this approach + unless they proxy/mirror PyPI as a whole, and get users to configure + their proxy/mirror as their sole repository. + +- Provide tiers of priorities, but keep version priority within groups. + For example, `Poetry specifies “primary” and “supplemental” + tiers `__. + + This keeps control in the hands of the user, rather than an + administrator, but adds complexity. Its behavior within a group is + meant to mimic pip’s behavior without index priority, which is an + anti-goal of this proposal. The granularity of `specifying per-package + sources `__, + solves the ambiguity problem, at the cost of flexibility/portability + of those sources. + +- Are build tags and/or local version specifiers enough? + + Build tags and local version specifiers will take precedence over + packages without those tags and/or local version specifiers. In a pool + of packages, builds that have these additions hosted on a server other + than PyPI will take priority over packages on PyPI, which rarely use + build tags, and forbid local version specifiers. This approach is + viable when package providers want to provide their own local + override, such as `HPC maintainers who provide optimized builds for + their + users `__. + It is less viable in some ways, such as build tags not showing up in + ``pip freeze`` metadata, and `local version specifiers not being + allowed on + PyPI `__. + There is also significant work entailed in building and maintaining + package collections with local build tag variants. + + https://discuss.python.org/t/dependency-notation-including-the-index-url/5659/21 + +- What about `PEP 708 `__? Isn’t that + enough? + + PEP 708 is aimed specifically at addressing dependency confusion + attacks, and doesn’t address the potential for implementation variants + among indexes. It is a way of filtering external URLs and encoding an + allow-list for external indexes in index metadata. It does not change + the lack of priority or preference among channels that currently + exists. + +- `Namespacing `__ + + Namespacing is a means of specifying a package such that the Python + usage of the package does not change, but the package installation + restricts where the package comes from. `PEP + 752 `__ recently proposed a way to + multiplex a package’s owners in a flat package namespace (e.g. the + PyPI repo) by reserving prefixes as grouping elements. `NPM’s concept + of “scopes” `__ has + been raised as another good example of how this might look. This PEP + differs in that it is targeted to multiple repos, not a flat package + namespace. The net effect is roughly the same in terms of predictably + choosing a particular package source, except that the namespacing + approach relies more on naming packages with these namespace prefixes, + whereas this PEP would be less granular, pulling in packages on + whatever higher-priority index the user specifies. The namespacing + approach relies on all configured indexes treating a given namespace + similarly, which leaves the usual concern that not all configured + indexes are trusted equally. The namespace idea is not incompatible + with this PEP, but it also does not improve expression of trust of + indexes in the way that this PEP does. + +Open Issues +=========== + +[Any points that are still being decided/discussed.] + +Acknowledgements +================ + +This work was supported financially by NVIDIA through employment of the author. +NVIDIA teammates dramatically improved this PEP in its early state with their +input. Astral pioneered the behaviors of index priority and layed the +foundation of this document. The Pip authors deserve great praise for their +consistent direction and patient communication of the version priority behavior, +especially in the face of contentious security concerns. + +Copyright +========= + +This document is placed in the public domain or under the +CC0-1.0-Universal license, whichever is more permissive.