[Security] Disable external entity processing in XML upload to prevent XXE#4119
Open
zhoujinsong wants to merge 1 commit intoapache:masterfrom
Open
[Security] Disable external entity processing in XML upload to prevent XXE#4119zhoujinsong wants to merge 1 commit intoapache:masterfrom
zhoujinsong wants to merge 1 commit intoapache:masterfrom
Conversation
…t XXE When uploading XML configuration files (e.g. core-site.xml, hdfs-site.xml), the uploaded bytes are parsed by Hadoop's Configuration.addResource(). Although the current classpath includes Woodstox (which does not expand external entities by default), this implicit protection is fragile and can silently break if dependencies change. This patch explicitly disables external entity processing using a hardened XMLInputFactory before delegating to Hadoop Configuration, ensuring XXE protection regardless of the underlying XML parser implementation. Changes: - Pre-validate the XML stream with XMLInputFactory configured to: - IS_SUPPORTING_EXTERNAL_ENTITIES = false - SUPPORT_DTD = false - FEATURE_SECURE_PROCESSING = true - Switch to Configuration(false) to avoid loading default Hadoop configs
00275a0 to
2ed10c0
Compare
xxubai
reviewed
Mar 20, 2026
Contributor
xxubai
left a comment
There was a problem hiding this comment.
Can you also add some junit5 tests for PlatformFileInfoController?
| XMLInputFactory xif = XMLInputFactory.newInstance(); | ||
| xif.setProperty(XMLInputFactory.IS_SUPPORTING_EXTERNAL_ENTITIES, false); | ||
| xif.setProperty(XMLInputFactory.SUPPORT_DTD, false); | ||
| xif.setProperty(XMLConstants.FEATURE_SECURE_PROCESSING, true); |
Contributor
There was a problem hiding this comment.
FEATURE_SECURE_PROCESSING is not a standard property of XMLInputFactory and may throw an exception.
| xif.createXMLStreamReader(new ByteArrayInputStream(bytes)).close(); | ||
|
|
||
| Configuration configuration = new Configuration(false); | ||
| configuration.addResource(new ByteArrayInputStream(bytes)); |
Contributor
There was a problem hiding this comment.
The XXE fix is still incomplete. The new XMLInputFactory parse only acts as a pre-check, but the actual validation still happens in Configuration.addResource(new ByteArrayInputStream(bytes)). In Hadoop 3.4.0, this overload does not use the restricted parser path by default, so the XML is reparsed in a way that can still allow DTD / external entity processing. In other words, this adds an extra parse, but it does not fully close the XXE risk.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
When users upload XML configuration files (e.g.
core-site.xml,hdfs-site.xml) via the AMS dashboard, the uploaded bytes are parsed byHadoop Configuration.addResource(). Although the current classpath includes Woodstox (which does not expand external entities by default), this implicit protection is fragile — it can silently break if the dependency is excluded due to a version conflict in the future.This patch adds explicit XXE protection at the Amoro layer before delegating to Hadoop
Configuration, ensuring the security guarantee holds regardless of the underlying XML parser implementation on the classpath.Why are the changes needed?
Without explicit protection, a malicious user could upload a crafted XML file containing an external entity reference like:
If the XXE implicit protection were ever lost (e.g. Woodstox excluded), this could allow:
How was this patch tested?
Manual testing: uploaded a well-formed XML file (accepted) and an XML file with an external entity reference (rejected with error response).
Does this PR introduce any user-facing change?
No. Legitimate Hadoop XML configuration files (
core-site.xml,hdfs-site.xml,hive-site.xml) do not use external entities. Valid files continue to upload successfully.