integration: handle silent MAC address assignment failures on Fedora#5310
integration: handle silent MAC address assignment failures on Fedora#5310lifubang wants to merge 3 commits into
Conversation
eada0b2 to
92e7353
Compare
| ip link del dev dummy0 | ||
| ip link show dummy0 >/dev/null 2>&1 | ||
| exists=$? | ||
| if [ $exists -eq 0 ]; then | ||
| ip link del dev dummy0 | ||
| fi |
There was a problem hiding this comment.
Sorry, I don't think I have enough context. Why keeping the mac address and doing this change woudl avoid the # Cannot find device "dummy0" from the issue?
There was a problem hiding this comment.
Yes, the device has been moved to the container’s network namespace once the container started successfully.
There was a problem hiding this comment.
I don't understand the second commit either. This is teardown, we just need to remove the device. If it does not exist, it's fine -- we can just hide the stderr:
IOW I'd rather see
ip link del dev dummy0 &>/dev/null || true
or something like that.
There was a problem hiding this comment.
I don't understand the second commit either.
Actually, it’s not related to the random MAC address issue -- it’s a refactor aimed at reducing unnecessary log output in CI. I'll add a comment in the commit.
There was a problem hiding this comment.
But why the change of mac address is triggering that we don't find the device? I can't make sense of that. What other context am I missing?
There was a problem hiding this comment.
But why the change of mac address is triggering that we don't find the device?
In fact, ip link del dev dummy0 always prints Cannot find device "dummy0" when the test passes — because dummy0 has been moved into the container's network namespace. Bats hides this error when the test succeeds, but shows it in the teardown output when the test fails, which is noisy and misleading.
This commit is independent of the MAC address fix. The "Cannot find device" noise is caused by the container starting successfully (dummy0 gets moved into the container netns), not by anything related to MAC address.
This commit isn't required to fix the Fedora flake -- it's a separate cleanup to reduce noise in CI logs. I kept it in this PR because both issues were found while debugging #5013. But I'm happy to split them if you preferred. For context on why suppressing this noise matters:
#5307 (comment)
92e7353 to
6c29d72
Compare
699659b to
368bb86
Compare
|
I'm not quite sure how I feel about this, in particular what the intended semantics of the auto-setting of |
This PR only touches test code in tests/integration/ -- there are The situation is: on Fedora, the kernel may silently reject explicit MAC assignments on dummy devices ( The test's purpose is to verify that the container preserves the network device state. The fix reads back whatever MAC the kernel actually assigned and uses that for the assertion -- so we compare against what the device actually has, not what we tried and failed to set. The fix accommodates the kernel's MAC policy rather than working against it. If there's concern about the wording in the comment, I can clarify it to explicitly note this is about kernel behavior, not runc's. |
kolyshkin
left a comment
There was a problem hiding this comment.
The title of the first commit needs updating.
kolyshkin
left a comment
There was a problem hiding this comment.
I looked into kernel self-tests (linux/tools/testing/selftests) and they never set mac address on dummy0.
So maybe we should not do it either, and just check the mac is the same after as it was before?
PS I've also noticed they always bring the interface up:
net/cmsg_ip.sh:ip -netns $NS link add type dummy
net/cmsg_ip.sh-ip -netns $NS link set dev dummy0 up
--
net/cmsg_so_mark.sh:ip -netns $NS link add type dummy
net/cmsg_so_mark.sh-ip -netns $NS link set dev dummy0 up
--
net/cmsg_time.sh:ip -netns $NS link add type dummy
net/cmsg_time.sh-ip -netns $NS link set dev dummy0 up
--
net/fq_band_pktlimit.sh:ip link add type dummy
net/fq_band_pktlimit.sh-ip link set dev dummy0 up
In my defense, that wasn't the case when I last reviewed it (it used to have the change in |
| ip link set address "$mac_address" dev dummy0 | ||
| ip address add "$global_ip" dev dummy0 | ||
| # May fail silently on Fedora, so read back the MAC address to verify. | ||
| mac_address=$(ip link show dummy0 | awk '$1=="link/ether"{print $2}') |
There was a problem hiding this comment.
Does bats use pipefail? Or is the idea that this will fail later because this variable is empty in the failure case?
There was a problem hiding this comment.
Ah, I misunderstood what you meant by "fail" -- you mean that the address will be random rather than the one we specified?
If so, the comment could be a little clearer.
There was a problem hiding this comment.
you mean that the address will be random rather than the one we specified?
Yes, if there is a ip link list after ip link set address, we will get this:
not ok 34 checkpoint and restore with netdevice
# (in test file tests/integration/checkpoint.bats, line 164)
# `[[ "$output" == *"ether $mac_address "* ]]' failed
# runc spec (status=0)
#
# 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
# link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
# 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
# link/ether 52:55:55:bd:9b:52 brd ff:ff:ff:ff:ff:ff
# altname enx525555bd9b52
# 27: dummy0: <BROADCAST,NOARP> mtu 1789 qdisc noop state DOWN mode DEFAULT group default qlen 1000
# link/ether 2a:2b:cd:25:53:43 brd ff:ff:ff:ff:ff:ff
# runc run -d --console-socket /tmp/bats-run-kRNrN4/runc.GgeEBy/tty/sock test_busybox_netdevice (status=0)
Please see: https://github.com/opencontainers/runc/actions/runs/26936316745/job/79466846626
I'll update the comment.
I'll add it.
I think there is no hurt to set mac address on dummy0, because on most os it works fine. |
368bb86 to
2f5ba21
Compare
| ip link set dev dummy0 up | ||
| # Even when a specific MAC address is explicitly set, Fedora may still randomize it. | ||
| # Read back the actual address to confirm. | ||
| mac_address=$(ip link show dummy0 | awk '$1=="link/ether"{print $2}') |
There was a problem hiding this comment.
Maybe just cat /sys/class/net/dummy0/address also works here?
To reduce unnecessary log output in CI: Cannot find device "dummy0" Signed-off-by: lifubang <lifubang@acmcoder.com>
On Fedora, certain conditions can cause MAC address assignment to fail without clear indication. To ensure reliability across distributions, the code now explicitly reads back the MAC address after attempting to set it and verifies the result. Signed-off-by: lifubang <lifubang@acmcoder.com>
Co-authored-by: Kir Kolyshkin <kolyshkin@gmail.com> Signed-off-by: lifubang <lifubang@acmcoder.com>
2f5ba21 to
66cfdb2
Compare
|
Seeing flakes again in #5316. @rata @kolyshkin, could you take a look? Should we go ahead with the merge, or is there a better way forward? |
| # Even when a specific MAC address is explicitly set, Fedora may still randomize it. | ||
| # Read back the actual address to confirm. | ||
| mac_address=$(ip link show dummy0 | awk '$1=="link/ether"{print $2}') |
There was a problem hiding this comment.
Do we understand why fedora is randomizing the mac? Maybe we can create the device differently so it is not randomized? Or, if there is nothing to do, maybe skip only on fedora?
It feels weird to allow to set the mac, but then the tests just don't check it.
There was a problem hiding this comment.
@lifubang This is basically disabling the mac address check, right? IMHO I think it makes sense to understand why this happens, and after that if we want to disable mac tests, only do it in distros that need this.
| # set a custom mac address to the interface | ||
| ip link set address "$mac_address" dev dummy0 | ||
| ip link set dev dummy0 up | ||
| # Even when a specific MAC address is explicitly set, Fedora may still randomize it. |
There was a problem hiding this comment.
s/Fedora/<SPECIFIC COMPONENT (systemd-networkd?)> since <SPECIFIC VERSION>/
There was a problem hiding this comment.
Is this a bug or a feature? Where is this documented in?
There was a problem hiding this comment.
Is this a bug or a feature? Where is this documented in?
I believe this is a bug, not a feature.
In a previous test, I added logging to the BATS script, and the output clearly indicated unexpected behavior—consistent with a bug (see: #5310 (comment)).
That said, I haven’t found any official documentation describing this behavior, nor have I seen it reported elsewhere online. If it were intentional, I’d expect it to be documented -- but so far, I’ve found no such reference.
There was a problem hiding this comment.
s/Fedora/<SPECIFIC COMPONENT (systemd-networkd?)> since /
In fact, I’m not certain about the exact component (e.g., systemd-networkd?) or the specific version in which this behavior was introduced -- so I used “Fedora” as a placeholder. If anyone knows the precise subsystem and version where this started, I’d appreciate the clarification!
There was a problem hiding this comment.
It might be systemd-udevd or NetworkManager.
In fact, maybe if we bring the device up, wait for udev to settle and only then assign the mac, this might be all what's needed to fix this flake.
Let me give it a try.
There was a problem hiding this comment.
So, taking your script from #5310 (comment) and adding this patch:
--- ./testMAC.sh-orig 2026-06-16 17:33:50.309502068 +0000
+++ ./testMAC.sh 2026-06-16 17:34:36.028654316 +0000
@@ -5,6 +5,7 @@
while true; do
((round++))
ip link add dummy0 type dummy
+ udevadm settle
# If we bring the dev up, there's still a one-in-a-thousand chance
# to get a random MAC addr.
# ip link set dummy0 upit never fails for me.
There was a problem hiding this comment.
So, something like this may fix it entirely: kolyshkin@63e0cbc
(Frankly I can't decide which way is better, as now we have 3 different ones -- the above patch, this PR, and #5324).
There was a problem hiding this comment.
[kir@lima-fedora-rh ~]$ timeout 3m sudo ./testMAC.sh-orig
got a random MAC(ee:f7:41:ac:d4:bd) in round(328),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(904),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(1566),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(3746),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(3757),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(4548),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(4919),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(5139),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(5445),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(5478),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(6012),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(6082),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(7657),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(7773),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(7776),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(7785),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(7904),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(7916),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(8116),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(8452),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(8544),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(9901),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(10675),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(10862),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(11112),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(11669),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(11717),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(11983),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(12162),\n
retring for 1 times......
got a random MAC(ee:f7:41:ac:d4:bd) in round(13018),\n
retring for 1 times......
[kir@lima-fedora-rh ~]$ timeout 10m sudo ./testMAC.sh
[kir@lima-fedora-rh ~]$ # ^^^ Look ma, no failures!There was a problem hiding this comment.
@kolyshkin I tried that too in the PR I opened :) Then I changed it to just remove the MAC, as I think it doesn't make much sense to test something we don't change at all.
I think we could do the 3 things:
- udevadm settle, just after creating the interface
- Put the data in the same command we create the interface, as then udev usually doesn't act on values that were set
- Remove the MAC addr changes, as we literally don't do anything about it in runc code
What do you think?
|
In fact, this flaky test is not limited to Fedora -- it also occurs on Ubuntu. #!/bin/bash
round=0
mac_address="00:11:22:33:44:55"
new_addr=""
while true; do
((round++))
ip link add dummy0 type dummy
# If we bring the dev up, there's still a one-in-a-thousand chance
# to get a random MAC addr.
# ip link set dummy0 up
ip link set address "$mac_address" dev dummy0
st=$?
if [ $st -ne 0 ]; then
echo "exit with $st"
fi
new_addr=$(cat /sys/class/net/dummy0/address)
try=0
while [[ "$mac_address" != "$new_addr" ]]; do
((try++))
echo "got a random MAC($new_addr) in round($round),\n"
echo " retring for $try times......"
if [ $try -eq 10 ]; then
echo "round$round: $new_addr"
exit
fi
ip link set address "$mac_address" dev dummy0
new_addr=$(cat /sys/class/net/dummy0/address)
done
ip link del dev dummy0
done
ip link del dev dummy0When running this script, we observe output like the following: Interestingly, reapplying the same command almost always succeeds on the first retry. |
|
Moreover, if we specify the MAC address directly in the ip link add command, it always succeeds. |
|
@lifubang Awesome, thanks! Maybe I'm missing something, but it seems the fail is real. Let's see how to fix the real underlying issue :) EDIT: Oh, it seems the spec doesn't talk about mac address? https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#network-devices. I wonder if this only happens with fake devices and probably not with real network interfaces? |
Oh, I missed this comment! I was also testing that :-D. I've also opened another PR, it also seems to fix it. But now that I look at the code, I think we should just remove the check of the mac address. Changed my PR to remove the MAC checks. Let me know what you think. I'm also okay with doing it at netdev creation time. |
|
catching up |
|
I'm going to remove this from the 1.5 milestone so I can release 1.5.0 today -- this is an existing issue that only affects CI and we can always backport a fix later once we figure out the best approach. |
Fix #5013
On Fedora, certain conditions can cause MAC address assignment to fail
without clear indication. To ensure reliability across distributions,
the code now explicitly reads back the MAC address after attempting to
set it and verifies the result.
This PR also refactor the tear down script to reduce unnecessary log when testing net dev:
Cannot find device "dummy0".