Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
297 changes: 297 additions & 0 deletions peps/pep-9999.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,297 @@
PEP: 9999
Title: JSON Package Metadata
Author: Emma Harper Smith <[email protected]>
PEP-Delegate: Paul Moore
Discussions-To: Pending
Status: Draft
Type: Standards Track
Topic: Packaging
Created: 2025-12-09
Post-History: Pending


Abstract
========

Python package metadata ("core metadata") was first defined in :pep:`241` to
use :rfc:`822` email headers to encode information about packages. This was
reasonable at the time; email messages were the only widely used, standardized
text format that had a parser in the standard library at the time. However,
issues with handling different encodings, differing handling of line breaks,
and other differences between implementations have caused numerous packaging
bugs. To resolve these issues, this PEP proposes introducing a
`Javascript Object Notation (JSON) <https://www.json.org/json-en.html>`_
encoded file containing core metadata in Python packages.


Motivation
==========

The email message format has a number of complexities and limitations which
reduce its utility as a portable textual interchange format for packaging
metadata. Due to the :mod:`email` parser requiring configuration changes to
properly generate valid core metadata, many projects do not use the
:mod:`!email` module and instead generate core metadata in a custom manner.
There are many pitfalls with generating email headers that these custom
generators can hit. First, core metadata fields may contain newlines in the
value of fields. These newlines must be handled properly to "unfolded" multiple
lines per :rfc:`822`. Improperly escaped newlines can lead to generating
invalid core metadata. Second, as discussed in the core metadata
specifications:

.. epigraph::
The standard file format for metadata (including in wheels and installed
projects) is based on the format of email headers. However, email formats
have been revised several times, and exactly which email RFC applies to
packaging metadata is not specified. In the absence of a precise
definition, the practical standard is set by what the standard library
:mod:`email.parser` module can parse using the
:attr:`email.policy.compat32` policy.

Since no specific email RFC is selected, the current core metadata
specification is ambiguous whether a given core metadata document is valid.
:rfc:`822` is the only email standard to be explicitly listed in a PEP.
However, the core metadata specifications also requires that core metadata is
encoded using UTF-8 when written to a file. This de-facto makes the core
metadata follow :rfc:`6532`, which specifies internationalization of email
headers. This has practical interoperability concerns. Until a few years ago,
it was unspecified how to handle non-ASCII encoded content in core metadata,
causing confusion about how to properly encode non-ASCII emails in core
metadata. Third, the current format is difficult to properly validate and
parse. Many tools do not check for issues with the output of the :mod:`!email`
parser. If a document is malformed, it may still parse without error by the
:mod:`!email` module as a valid email message. Furthermore, due to limitations
in the email format, fields like ``Project-Url`` must create custom encodings
of nested key-value items, further complicating parsing. Finally, the lack of
a schema makes it difficult to validate the contents of email message encoded
metadata. While introducing a specification for the current format has been
`discussed previously <https://discuss.python.org/t/python-metadata-format-specification-and-implementation/7550>`_,
no progress had been made, and converting to JSON was a suggested resolution
to the issues raised.


Rationale
=========

Introducing a new core metadata file with a well-specified format will greatly
ease generating, parsing, and validating metadata. JSON is a natural choice for
storing package core metadata. It is easily machine readable and writable, is
understandable to humans, and is well supported across many languages.
Furthermore, :pep:`566` already specifies a canonicalization of email formatted
core metadata to JSON. JSON is also a frequently used format for data
interchange on the web. For discussion of other formats considered, please
refer to the rejected ideas section.

To maintain backwards compatibility, the JSON metadata file MUST be generated
alongside the existing email formatted metadata file. This ensures that tools
that do not support the new format can still read package metadata for new
packages.

The JSON formatted metadata file must be semantically equivalent to the email
encoded file. This ensures that the metadata is unambiguous between the two
formats, and tools may read either when both are present. To maintain
performance, this equivalence is not required to be verified by installers,
though other tools may do so. Some tools may choose to make the check dependent
on a configuration flag.

Package indexes SHOULD check that the metadata files are semantically
equivalent when the package is added to the index. This is a low-cost, one-time
check that ensures users of the index are served valid packages.


Specification
=============

JSON Format Core Metadata File
------------------------------

A new optional file ``METADATA.json`` shall be introduced as a metadata file
for Python packages. If generated, the ``METADATA.json`` file MUST be placed in
the same directory as the current email formatted ``METADATA`` or ``PKG-INFO``
file.

For wheels, this means that ``METADATA.json`` MUST be located in the
``.dist-info`` directory. The wheel format minor version will be incremented to
indicate the change in the format.

For source distribution packages, the ``METADATA.json`` file MUST be located
in the root directory of the project sources. Tools that prefer the JSON
formatted metadata file MUST check for the existence of a ``METADATA.json``
in the source distribution before reading the file.

The semantic contents of the ``METADATA`` and ``METADATA.json`` files MUST be
equivalent if ``METADATA.json`` is present. Installers MAY verify this
information. Public package indexes SHOULD verify the files are semantically
equivalent.

Conversion to JSON Encoding
---------------------------

Conversion from the current email format for core metadata to JSON should
follow the process described in :pep:`566`, with the following modification:
the ``Project-URL`` entries should be converted into an object with keys
containing the labels and values containing the URLs from the original email
value. The overall process thus becomes:

#. The original key-value format should be read with
``email.parser.HeaderParser``;
#. All transformed keys should be reduced to lower case. Hyphens should be
replaced with underscores, but otherwise should retain all other characters;
#. The transformed value for any field marked with "(Multiple-use") should be a
single list containing all the original values for the given key;
#. The ``Keywords`` field should be converted to a list by splitting the
original value on commas;
#. The ``Project-URL`` field should be converted into a JSON object with keys
containing the labels and values containing the URLs from the original email
value.
#. The message body, if present, should be set to the value of the
``description`` key.
#. The result should be stored as a string-keyed dictionary.

One edge case in the above conversion is that the ``Project-URL`` label is
"free text, with a maximum length of 32 characters." This presents a problem
when trying to decode the label. Therefore this PEP sets the requirement that
the ``Project-URL`` label be any text *except* the comma (``,``) character.
This allows for unambiguous parsing of the ``Project-URL`` entries by splitting
the text on the left-most comma (``,``) character.

JSON Schema for Core Metadata
-----------------------------

To enable verification of JSON encoded core metadata, a
`JSON schema <https://json-schema.org/>`_ for core metadata has been produced.
This schema will be updated with each revision to the core metadata
specification. The schema is available in
:ref:`9999-core-metadata-json-schema`.

TODO: where should the schema be served/what should the $id be?

Serving METADATA.json in the Simple Repository API
--------------------------------------------------

:pep:`658` introduced a means of serving package metadata in the Simple
Repository API. The JSON encoded version of the package metadata may also be
served, via the following modifications to the Simple Repository API:

A new attribute ``data-dist-info-metadata-json`` may be added to anchor tags
in the Simple API. This attribute should have a value containing the hash
information for the ``METADATA.json`` file in the same format as
``data-dist-info-metadata``. If ``data-dist-info-metadata-json`` is present,
the repository MUST serve the JSON encoded metadata file at the
distribution's path with ``.metadata.json`` appended to it. For example, if a
distribution is served at ``/simple/foo-1.0-py3-none-any.whl``, the JSON
encoded core metadata file MUST be served at
``/simple/foo-1.0-py3-none-any.whl.metadata.json``.

Deprecation of the ``METADATA`` and ``PKG-INFO`` Files
------------------------------------------------------

The ``METADATA`` and ``PKG-INFO`` files are now deprecated. This means that a
future PEP may make the ``METADATA`` and ``PKG-INFO`` files optional and
require ``METADATA.json`` to be present. Please see the next section for
caveats to that change.

Despite the ``METADATA`` and ``PKG-INFO`` files being deprecated, new core
metadata revisions should be implemented for both JSON and email to ensure that
they may remain semantically equivalent.

Backwards Compatibility
=======================

The specification for ``METADATA.json`` is designed such that the new format is
completely backwards compatible. Existing tools may read metadata from the
existing email formatted files, and new tools may take advantage of the new
format.

A future major revision of the wheel specification may make the ``METADATA``
and ``PKG-INFO`` files optional and make the ``METADATA.json`` file required.
Note that tools will need to maintain parsing of email metadata indefinitely to
support parsing metadata for old packages which only have the ``METADATA`` or
``PKG-INFO`` files.


Security Implications
=====================

One attack vector with JSON encoded core metadata is if the JSON payload is
designed to consume excessive memory or CPU resources in a denial of service
attack. While this attack is not likely to affect users whom can cancel
resource-intensive operations, it may be an issue for package indexes.

There are several mitigations that can be made to prevent this:

#. The length of the JSON payload can be restricted to a reasonable size.
#. The reader may use a :class:`~json.JSONDecoder` to omit parsing :class:`int`
and :class:`float` values to avoid quadratic number parsing time complexity
attacks.
#. I plan to contribute a change to the :class:`~json.JSONDecoder` in Python
3.15+ that will allow it to be configured to restrict the nesting of JSON
payloads to a reasonable depth.

With these mitigations in place, concerns about denial of service attacks with
JSON encoded core metadata are minimal.


Reference Implementation
========================

A reference implementation of the JSON schema for JSON core metadata is
available in :ref:`9999-core-metadata-json-schema`.

Furthermore, a reference implementation in the ``packaging`` library `is
available
<https://github.com/wheelnext/packaging/tree/PEP-9999-JSON-metadata>`__.


Rejected Ideas
==============

Using Another File Format (TOML, YAML, etc.)
--------------------------------------------

While TOML or another format could be used for the new core metadata file
format, JSON has been chosen for a few reasons:

#. Core metadata is mostly meant as a machine interchange format to be used by
tools and services which wish to interoperate. Therefore the
human-readability of TOML is not an important consideration in this
selection.
#. JSON parsers are implemented in many languages' standard libraries and the
:mod:`json` module has been part of Python's standard library for a very
long time.
#. JSON is fast to parse and emit.
#. JSON schemas are JSON native and commonly used.


Open Issues
===========

Where Should the JSON Schema be Served?
---------------------------------------

Where should the standard JSON Schema be served? Some options would be
packaging.python.org, pypi.org, python.org, or pypa.org.

My first choice would be packaging.python.org, but I am open to other options.

Should we also update the ``WHEEL`` metadata file format to be JSON encoded?
----------------------------------------------------------------------------

The ``WHEEL`` metadata file format is also an email formatted file. This means
that it is subject to the same parsing and validation issues as the
``METADATA`` and ``PKG-INFO`` files. However, the ``WHEEL`` file is part of the
initial wheel format version check done by installers. Changing the file format
might harm backwards compatibility by making old installers unable to read new
metadata.

I think it could make sense to introduce a ``WHEEL.json`` file. Then a future
wheel major version could remove the ``WHEEL`` file and require the
``WHEEL.json`` file instead.


Copyright
=========

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.

11 changes: 11 additions & 0 deletions peps/pep-9999/appendix-core-metadata-json-schema.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
:orphan:

.. _9999-core-metadata-json-schema:

Appendix: JSON Schema for Core Metadata
=======================================

.. literalinclude:: core-metadata.schema.json
:language: json
:linenos:
:name: core-metadata-schema
Loading
Loading