-
Notifications
You must be signed in to change notification settings - Fork 5
Implement Nvidia GPU passthrough #680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds GPU passthrough support: new Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Server
participant ResourceClaimer
participant MachineStore
participant MachineReconciler
participant Libvirt
Client->>Server: CreateMachine(request for GPU-capable class)
Server->>ResourceClaimer: Claim("nvidia.com/gpu" resources)
ResourceClaimer-->>Server: Claims (include PCI addresses)
Server->>MachineStore: Create Machine with Spec.Gpu (PCI addresses)
MachineStore-->>Server: Machine created
Server-->>Client: CreateMachineResponse
MachineReconciler->>MachineStore: Watch/Fetch Machine
MachineReconciler->>MachineReconciler: claimedGPUsToHostDevs(Machine.Spec.Gpu)
MachineReconciler->>Libvirt: Define domain with GPU hostdevs
Libvirt-->>MachineReconciler: Domain configured
sequenceDiagram
participant MachineReconciler
participant MachineStore
participant ResourceClaimer
MachineReconciler->>MachineStore: Reconcile delete event -> Get Machine
MachineStore-->>MachineReconciler: Machine with Spec.Gpu (PCI addresses)
MachineReconciler->>ResourceClaimer: Release GPU claims (PCI addresses)
ResourceClaimer-->>MachineReconciler: Release ack / error
MachineReconciler->>MachineStore: Delete Machine record
MachineStore-->>MachineReconciler: Delete confirmed
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| hostutils "github.com/ironcore-dev/provider-utils/storeutils/host" | ||
| ) | ||
|
|
||
| func GetClaimedPCIAddressesFromMachineStore(ctx context.Context, machineStore *hostutils.Store[*api.Machine]) ([]pci.Address, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wdyt about dropping the utils and having a claim package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would want to move it to provider-utils, all providers need to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/server/machine_create.go (1)
90-110: Guard against nil resourceClaimer and skip GPU claiming for CPU-only classes.The current code calls
s.resourceClaimer.Claim()unconditionally, which would panic ifresourceClaimeris nil. Additionally, for CPU-only machine classes,filterNvidiaGPUResources()returns an empty resource list, and callingClaim()with empty resources may cause unnecessary errors. Add a guard to check if GPUs are requested before attempting to claim them and verify the claimer is configured.🛡️ Suggested guard
- gpus, err := s.resourceClaimer.Claim(ctx, filterNvidiaGPUResources(class.Capabilities.Resources)) - if err != nil { - log.Error(err, "Failed to claim GPUs") - return nil, fmt.Errorf("failed to claim GPUs: %w", err) - } - - pciAddrs := getPCIAddresses(gpus) + gpuRes := filterNvidiaGPUResources(class.Capabilities.Resources) + var gpus claim.Claims + if len(gpuRes) > 0 { + if s.resourceClaimer == nil { + return nil, fmt.Errorf("gpu resources requested but resource claimer is not configured") + } + gpus, err = s.resourceClaimer.Claim(ctx, gpuRes) + if err != nil { + log.Error(err, "Failed to claim GPUs") + return nil, fmt.Errorf("failed to claim GPUs: %w", err) + } + } + + pciAddrs := getPCIAddresses(gpus)
🤖 Fix all issues with AI agents
In `@internal/controllers/machine_controller_gpus_test.go`:
- Around line 14-99: The three It blocks testing claimedGPUsToHostDevs all share
the same description and one By message has a typo ("nil Gu field"), which makes
failures ambiguous; update the three It descriptions to unique, specific strings
(e.g., "should convert two claimed GPUs to libvirt host devices with correct PCI
addresses", "should return empty host devices for empty GPU slice", "should
return empty host devices for nil Gpu field") and fix the typo in the By call
inside the third test from "nil Gu field" to "nil Gpu field"; locate these
changes around the It blocks that call claimedGPUsToHostDevs and adjust only the
description and By message text.
In `@internal/server/machine_create.go`:
- Line 44: When claiming GPUs in machine_create.go (the s.resourceClaimer.Claim
call that obtains gpus), add a deferred cleanup that releases those GPU claims
if any subsequent error occurs (e.g., failures in SetObjectMetadata or
machineStore.Create). Implement a defer func() that checks the surrounding error
return (err) and, when non-nil, constructs the same claim.Claims used elsewhere
(claim.Claims{v1alpha1.ResourceName(api.NvidiaGPUPlugin):
gpu.NewGPUClaim(getPCIAddresses(gpus))}) and calls
s.resourceClaimer.Release(ctx, claims), logging any release error; keep the
Claim/Release logic keyed to the existing symbols s.resourceClaimer.Claim,
s.resourceClaimer.Release, getPCIAddresses, gpu.NewGPUClaim, claim.Claims and
api.NvidiaGPUPlugin so the deferred release runs on all error paths.
🧹 Nitpick comments (7)
internal/utils/claims_test.go (1)
19-47: Test covers the happy path well.The test correctly validates retrieval of a single PCI address from a machine store.
Consider adding tests for edge cases in a follow-up:
- Empty store (no machines)
- Machine with nil/empty
Gpufield- Multiple machines with multiple GPUs to verify aggregation
internal/utils/utils_suite_test.go (1)
13-16: Consider renaming the test function for clarity.The function
TestSyncdoesn't match the suite name "Utils Suite". Consider renaming toTestUtilsfor consistency and clarity.Suggested rename
-func TestSync(t *testing.T) { +func TestUtils(t *testing.T) { RegisterFailHandler(Fail) RunSpecs(t, "Utils Suite") }internal/server/machine_create_test.go (1)
170-187: Consider adding an assertion on the error message.The test validates that machine creation fails when insufficient GPUs are available, which is the correct behavior. Consider adding an assertion on the error message or code to ensure the failure is specifically due to GPU unavailability rather than another issue.
Suggested enhancement
Expect(err).To(HaveOccurred()) + Expect(err.Error()).To(ContainSubstring("GPU")) // or a more specific error substring Expect(createResp).To(BeNil())internal/controllers/machine_controller_gpus.go (1)
52-64: Consider adding a nil/empty check to avoid unnecessary operations.The function will attempt to release claims even when
pciAddrsis empty, which could result in unnecessary work or API calls. Consider adding an early return for empty input.Suggested enhancement
func (r *MachineReconciler) releaseResourceClaims(ctx context.Context, log logr.Logger, pciAddrs []pci.Address) error { + if len(pciAddrs) == 0 { + return nil + } log.V(2).Info("Releasing GPU claims", "pciAddresses", pciAddrs)cmd/libvirt-provider/app/app.go (1)
320-324: GPU support hardcoded to Nvidia 3D controllers.The PCI reader is initialized with
pci.VendorNvidiaandpci.Class3DController, restricting GPU detection to Nvidia devices using PCI class 0x0302. This excludes AMD Instinct accelerators (which enumerate as class 0x0380) and Intel Data Center GPUs (also class 0x0380). If multi-vendor GPU support is planned, consider making the vendor and device class configurable.internal/server/machine_create.go (1)
34-40: Guard the GPU claim type assertion.A direct type assertion can panic if the claims map contains an unexpected type; prefer a checked assertion and fall back to empty results.
🔧 Suggested hardening
- if gpuClaim, ok := claims[api.NvidiaGPUPlugin]; ok { - gpu := gpuClaim.(gpu.Claim) - return gpu.PCIAddresses() - } + if raw, ok := claims[api.NvidiaGPUPlugin]; ok { + if gpuClaim, ok := raw.(gpu.Claim); ok { + return gpuClaim.PCIAddresses() + } + }internal/server/server_suite_test.go (1)
202-215: Wait for the ResourceClaimer to start before proceeding.
WaitUntilStartedcurrently runs inside the goroutine, soBeforeSuitecan continue before the claimer is ready. Consider waiting in the main goroutine and use a cancelable context for cleanup to avoid flakes.✅ Suggested adjustment
resClaimer, err := claim.NewResourceClaimer( log, gpu.NewGPUClaimPlugin(log, api.NvidiaGPUPlugin, NewTestingPCIReader([]pci.Address{ {Domain: 0, Bus: 3, Slot: 0, Function: 0}, {Domain: 0, Bus: 3, Slot: 0, Function: 1}, }), []pci.Address{}), ) Expect(err).ToNot(HaveOccurred()) + claimerCtx, cancelClaimer := context.WithCancel(context.Background()) + DeferCleanup(cancelClaimer) go func() { defer GinkgoRecover() - err := resClaimer.Start(context.Background()) + err := resClaimer.Start(claimerCtx) Expect(err).ToNot(HaveOccurred()) - Expect(resClaimer.WaitUntilStarted(ctx)).To(BeNil()) }() + Expect(resClaimer.WaitUntilStarted(ctx)).To(BeNil())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/server/server_suite_test.go (1)
1-1: Fix copyright year typo.The copyright year "20253" appears to be a typo. It should likely be "2023" or "2025".
-// SPDX-FileCopyrightText: 20253 SAP SE or an SAP affiliate company and IronCore contributors +// SPDX-FileCopyrightText: 2023 SAP SE or an SAP affiliate company and IronCore contributors
🤖 Fix all issues with AI agents
In `@internal/server/server_suite_test.go`:
- Around line 136-145: The machine class named machineClassx3xlargegpu requests
4 GPUs but the test PCI fixture from NewTestingPCIReader only provides 2
addresses, causing allocation failures; fix by making the GPU counts
consistent—either change the Resources GPU entry in machineClassx3xlargegpu from
4 to 2, or add two additional PCI addresses in NewTestingPCIReader (and the
similar test PCI setup referenced around the same area) so it supplies four
addresses to match the class.
🧹 Nitpick comments (7)
internal/utils/utils_suite_test.go (1)
13-16: Consider renaming the test function for consistency.The function
TestSyncdoesn't align with the suite name "Utils Suite". Convention is to name itTestUtilsto match the package and suite being tested.Suggested fix
-func TestSync(t *testing.T) { +func TestUtils(t *testing.T) { RegisterFailHandler(Fail) RunSpecs(t, "Utils Suite") }internal/server/server.go (1)
90-104: Consider validating that ResourceClaimer is not nil when GPU-enabled machine classes exist.If
opts.ResourceClaimeris nil but a machine class with GPU resources is requested,createMachineFromIRIMachinewill panic or fail unexpectedly when attempting to claim GPUs. You may want to either:
- Validate at startup that ResourceClaimer is set if GPU machine classes are configured, or
- Add a nil check in the GPU claiming code path with a descriptive error.
internal/server/machine_create_test.go (1)
170-187: Consider asserting on the specific error message or type.The test verifies that an error occurs when insufficient GPUs are available, but doesn't validate the error content. Asserting on the error message or gRPC status code would make the test more robust against regressions where a different error path might trigger.
Expect(err).To(HaveOccurred()) // Consider adding: // Expect(err.Error()).To(ContainSubstring("failed to claim GPU"))internal/controllers/machine_controller_gpus.go (1)
52-64: Consider adding a short-circuit for empty PCI addresses.If
pciAddrsis empty, the method still constructs a claim and callsRelease. While this likely works, adding an early return would clarify intent and avoid unnecessary work:Suggested improvement
func (r *MachineReconciler) releaseResourceClaims(ctx context.Context, log logr.Logger, pciAddrs []pci.Address) error { + if len(pciAddrs) == 0 { + return nil + } log.V(2).Info("Releasing GPU claims", "pciAddresses", pciAddrs)cmd/libvirt-provider/app/app.go (1)
19-20: Consider grouping the claim import with other provider-utils imports.The
claimimport on line 19 is separated from otherprovider-utilsimports (lines 43-49) with a blank line. For consistency, consider moving it to be adjacent to the otherprovider-utilsimports.internal/server/machine_create.go (1)
34-41: Consider adding a type assertion check for defensive coding.The type assertion
claims[api.NvidiaGPUPlugin].(gpu.Claim)on line 36 will panic if the claim is not of typegpu.Claim. While this is unlikely given the controlled input, adding a comma-ok check would be more defensive:if gpuClaim, ok := claims[api.NvidiaGPUPlugin]; ok { if gc, ok := gpuClaim.(gpu.Claim); ok { return gc.PCIAddresses() } }internal/server/server_suite_test.go (1)
210-215: Inconsistent context usage in ResourceClaimer goroutine.
resClaimer.Startusescontext.Background()whileWaitUntilStartedusesctx(SpecContext). Consider using a consistent context, likely derived from the spec context, so the claimer properly respects test lifecycle cancellation.go func() { defer GinkgoRecover() - err := resClaimer.Start(context.Background()) + err := resClaimer.Start(ctx) Expect(err).ToNot(HaveOccurred()) Expect(resClaimer.WaitUntilStarted(ctx)).To(BeNil()) }()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
internal/server/server_suite_test.go (1)
1-1: Fix copyright year typo.The copyright year
20253should be2025(or2023to match other files in this PR).🔧 Proposed fix
-// SPDX-FileCopyrightText: 20253 SAP SE or an SAP affiliate company and IronCore contributors +// SPDX-FileCopyrightText: 2025 SAP SE or an SAP affiliate company and IronCore contributorsinternal/controllers/machine_controller.go (1)
66-79: Add nil validation forResourceClaimerin the constructor or handle nil inreleaseResourceClaims.The code calls
r.resourceClaimer.Release()inreleaseResourceClaims(line 57 ofmachine_controller_gpus.go) without any nil check. Unlikehost,machines, andmachineEventswhich validate non-nil in the constructor,ResourceClaimeris assigned without validation. Since the test suite omitsResourceClaimerfromMachineReconcilerOptions, this will cause a nil pointer panic ifmachine.Spec.Gpuis populated. Either validateResourceClaimeras mandatory inNewMachineReconciler, or add a nil check and early return inreleaseResourceClaims.
🤖 Fix all issues with AI agents
In `@cmd/libvirt-provider/app/app.go`:
- Around line 406-413: The current g.Go closure calls resClaimer.Start(ctx) and
then immediately returns resClaimer.WaitUntilStarted(ctx), but Start(ctx) can
block until context cancellation, making WaitUntilStarted unreachable; change
the flow so WaitUntilStarted is awaited before starting dependent components:
spawn resClaimer.Start(ctx) in its own goroutine (or call Start non-blocking),
then call resClaimer.WaitUntilStarted(ctx) from the main flow (or the same g.Go
before starting machineReconciler), and only after WaitUntilStarted completes
proceed to start or g.Go the machineReconciler; reference resClaimer.Start and
resClaimer.WaitUntilStarted to locate where to split the blocking start and the
readiness wait.
In `@internal/controllers/machine_controller_gpus_test.go`:
- Around line 88-89: Fix the typos in the test case description and its By
message: update the It(...) string in machine_controller_gpus_test.go from
"should return empy host devices for nil GPU field" to "should return empty host
devices for nil GPU field", and update the By(...) call from "creating a machine
with nil Gu field" to "creating a machine with nil Gpu field" so both strings
read correctly; locate the literals inside the test case (the It and By calls)
and change only the misspelled words.
In `@internal/server/machine_create.go`:
- Around line 34-41: The getPCIAddresses function uses an unsafe type assertion
gpuClaim.(gpu.Claim); replace it with the comma-ok form so you only call
PCIAddresses() when the assertion succeeds. In other words, inside
getPCIAddresses check "gpu, ok := gpuClaim.(gpu.Claim)" and if ok return
gpu.PCIAddresses(), otherwise return an empty []pci.Address{}; this prevents a
panic when the claim exists but is not the expected type.
🧹 Nitpick comments (4)
internal/server/machine_create_test.go (1)
203-206: Consider using a more robust error assertion.The exact error message check on line 205 is fragile and may break if the error message format changes. Consider checking for the error type or a key substring.
♻️ Suggested improvement
By("ensuring the correct error is returned") Expect(err).To(HaveOccurred()) - Expect(err.Error()).To(ContainSubstring("failed to claim GPUs: insufficient resources\ninsufficient resource for nvidia.com/gpu")) + Expect(err.Error()).To(ContainSubstring("insufficient resource")) Expect(createResp2).To(BeNil())internal/controllers/machine_controller_gpus.go (1)
19-50: Add nil check to prevent potential panic.The function doesn't check if
machineormachine.Specis nil before accessingmachine.Spec.Gpu. While callers may guarantee non-nil values, defensive coding would prevent potential panics.♻️ Proposed defensive check
func claimedGPUsToHostDevs(machine *api.Machine) []libvirtxml.DomainHostdev { + if machine == nil || len(machine.Spec.Gpu) == 0 { + return nil + } hostDevs := make([]libvirtxml.DomainHostdev, len(machine.Spec.Gpu))internal/server/server.go (1)
71-71: Consider validating ResourceClaimer is not nil.Unlike
IDGenwhich has a default value set insetOptionsDefaults,ResourceClaimerhas no default and no validation. If a caller omits this option,s.resourceClaimer.Claim()inmachine_create.gowill panic on nil pointer dereference.♻️ Option 1: Add validation in New()
func New(opts Options) (*Server, error) { setOptionsDefaults(&opts) + if opts.ResourceClaimer == nil { + return nil, fmt.Errorf("ResourceClaimer is required") + } + baseURL, err := url.ParseRequestURI(opts.BaseURL)♻️ Option 2: Provide a no-op default
If GPU support should be optional, consider providing a no-op claimer implementation as a default.
Also applies to: 76-80, 96-96
internal/server/server_suite_test.go (1)
294-306: Consider using pointer receiver for consistency.
NewTestingPCIReaderreturns*TestingPCIReader, butRead()uses a value receiver. While this works, using a pointer receiver would be more consistent.♻️ Suggested change
-func (t TestingPCIReader) Read() ([]pci.Address, error) { +func (t *TestingPCIReader) Read() ([]pci.Address, error) { return t.pciAddrs, nil }
|
@coderabbitai help |
ChatThere are 3 ways to chat with CodeRabbit:
CodeRabbit commands
Other keywords and placeholders
CodeRabbit configuration file (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/controllers/controllers_suite_test.go (1)
188-212: AddResourceClaimerfield toMachineReconcilerOptionsin the machine controller initialization.The resource claimer is initialized at lines 189-192 but not passed to the controller. The
MachineReconcilerOptionsstruct (line 73 ofmachine_controller.go) includes aResourceClaimerfield, but it's missing from theMachineReconcilerOptionsstruct initialization at lines 195-211. AddResourceClaimer: resClaimer,to match the pattern used inserver_suite_test.go(line 209).
🧹 Nitpick comments (1)
internal/controllers/controllers_suite_test.go (1)
303-316: DuplicateTestingPCIReaderimplementation.This type, its
Readmethod, andNewTestingPCIReaderconstructor are identical to the implementation ininternal/server_suite_test.go(lines 293-305). Consider extracting to a shared test utilities package to avoid duplication.♻️ Suggested approach
Create a shared test helper, e.g.,
internal/testutil/pci_reader.go:package testutil import "github.com/ironcore-dev/provider-utils/claimutils/pci" type TestingPCIReader struct { pciAddrs []pci.Address } func (t TestingPCIReader) Read() ([]pci.Address, error) { return t.pciAddrs, nil } func NewTestingPCIReader(addrs []pci.Address) *TestingPCIReader { return &TestingPCIReader{ pciAddrs: addrs, } }Then import and use
testutil.NewTestingPCIReaderin both test suites.
|
Just a small nitpick: Can we stick to PR titles which do not include the categorization e.g. feature etc? I would suggest to keep the release log as they are generated at the moment consistent: https://github.com/ironcore-dev/libvirt-provider/releases |
…gh' into feat/678-implement-gpu-passthrough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/server/server_suite_test.go (1)
1-1:⚠️ Potential issue | 🟡 MinorTypo in copyright year.
The year
20253appears to be a typo. Should this be2023or2025?Proposed fix
-// SPDX-FileCopyrightText: 20253 SAP SE or an SAP affiliate company and IronCore contributors +// SPDX-FileCopyrightText: 2025 SAP SE or an SAP affiliate company and IronCore contributors
🧹 Nitpick comments (2)
internal/server/server_suite_test.go (1)
254-291: LGTM! Good cleanup logic that properly releases GPU claims before machine deletion.Minor nit: Line 273
claimer := resClaimerassigns the package variable to a local variable unnecessarily. You could useresClaimerdirectly.Optional simplification
// Release GPU claims if len(m.Spec.Gpu) > 0 { GinkgoWriter.Printf("Releasing claims ID=%s: claims=%s\n", machineID, m.Spec.Gpu) - claimer := resClaimer - err = claimer.Release(ctx, claim.Claims{ + err = resClaimer.Release(ctx, claim.Claims{ api.NvidiaGPUPlugin: gpu.NewGPUClaim(m.Spec.Gpu), })internal/server/machine_create.go (1)
103-114: Optional: Skip resource claim when class has no GPUs.When
filterNvidiaGPUResourcesreturns an emptyResourceList(no GPU plugin in class capabilities), callings.resourceClaimer.Claimis redundant. While safe, this can be optimized by checking the filtered resources first and skipping both the claim and address extraction when no GPUs are required.♻️ Optional optimization
- claimedGPUs, err = s.resourceClaimer.Claim(ctx, filterNvidiaGPUResources(class.Capabilities.Resources)) - if err != nil { - log.Error(err, "Failed to claim GPUs") - return nil, fmt.Errorf("failed to claim GPUs: %w", err) - } - - pciAddrs, err := getPCIAddresses(claimedGPUs) - if err != nil { - log.Error(err, "Failed to get PCI addresses from GPU claims") - return nil, fmt.Errorf("failed to get PCI addresses: %w", err) + gpuResources := filterNvidiaGPUResources(class.Capabilities.Resources) + var pciAddrs []pci.Address + if len(gpuResources) > 0 { + claimedGPUs, err = s.resourceClaimer.Claim(ctx, gpuResources) + if err != nil { + log.Error(err, "Failed to claim GPUs") + return nil, fmt.Errorf("failed to claim GPUs: %w", err) + } + + pciAddrs, err = getPCIAddresses(claimedGPUs) + if err != nil { + log.Error(err, "Failed to get PCI addresses from GPU claims") + return nil, fmt.Errorf("failed to get PCI addresses: %w", err) + } + log.V(2).Info("Claimed GPU PCI addresses", "pciAddresses", pciAddrs) } - log.V(2).Info("Claimed GPU PCI addresses", "pciAddresses", fmt.Sprintf("%v", pciAddrs))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@internal/server/server_suite_test.go`:
- Around line 270-280: The cleanupMachine block can re-release GPU claims on
Eventually retries; after a successful call to resClaimer.Release in
cleanupMachine (where it uses m.Spec.Gpu), clear or nil out m.Spec.Gpu
immediately (e.g., set m.Spec.Gpu = nil or empty slice) before proceeding to
machineStore.Delete so that if the surrounding Eventually retries the function
it won't call resClaimer.Release again with the same specs; update the code
around the resClaimer.Release call and subsequent error handling to clear
m.Spec.Gpu only after Release returns nil.
🧹 Nitpick comments (1)
internal/server/server_suite_test.go (1)
292-304: DuplicateTestingPCIReaderimplementation.This type is identically defined in
internal/controllers/controllers_suite_test.go(lines 304-316). Consider extracting it to a shared test utilities package (e.g.,internal/testutil/pci.go) to reduce duplication and ensure consistent behavior across test suites.
| // Release GPU claims | ||
| if len(m.Spec.Gpu) > 0 { | ||
| GinkgoWriter.Printf("Releasing claims ID=%s: claims=%s\n", machineID, m.Spec.Gpu) | ||
| err = resClaimer.Release(ctx, claim.Claims{ | ||
| api.NvidiaGPUPlugin: gpu.NewGPUClaim(m.Spec.Gpu), | ||
| }) | ||
| if err != nil { | ||
| GinkgoWriter.Printf("Releasing claims failed ID=%s: claims=%v, err=%v\n", machineID, m.Spec.Gpu, err) | ||
| return err | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's find the Release method implementation in the claim/claimer
fd -t f -e go | xargs grep -l "func.*Release" | head -20Repository: ironcore-dev/libvirt-provider
Length of output: 104
🏁 Script executed:
# Search for the claim package and Release method
rg -n "func.*Release\(" --type go -A 5 | head -100Repository: ironcore-dev/libvirt-provider
Length of output: 487
🏁 Script executed:
# Find the claimer interface definition and Release signature
ast-grep --pattern $'interface Claimer {
$$$
}'Repository: ironcore-dev/libvirt-provider
Length of output: 55
🏁 Script executed:
# Look for GPU claim release implementation
rg -n "Release" --type go | grep -i "claim\|gpu" | head -30Repository: ironcore-dev/libvirt-provider
Length of output: 720
🏁 Script executed:
# Find where ResourceClaimer is instantiated in server_suite_test.go
sed -n '190,200p' internal/server/server_suite_test.goRepository: ironcore-dev/libvirt-provider
Length of output: 459
🏁 Script executed:
# Get full cleanupMachine function context
sed -n '250,310p' internal/server/server_suite_test.goRepository: ironcore-dev/libvirt-provider
Length of output: 1480
🏁 Script executed:
# Check how Release is handled in other parts of the code (machine_create.go)
sed -n '45,60p' internal/server/machine_create.goRepository: ironcore-dev/libvirt-provider
Length of output: 605
🏁 Script executed:
# Look for any patterns of Release error handling
rg -n -B3 -A3 "\.Release\(" --type goRepository: ironcore-dev/libvirt-provider
Length of output: 1810
🏁 Script executed:
# Check if there are tests that verify Release behavior on repeated calls
rg -n "Release" --type go internal/server/server_suite_test.go -B2 -A2Repository: ironcore-dev/libvirt-provider
Length of output: 359
🏁 Script executed:
# Look for any tests or patterns in controllers that might show Release idempotency
rg -n "Release" --type go internal/controllers/ -B2 -A2 | head -60Repository: ironcore-dev/libvirt-provider
Length of output: 962
🏁 Script executed:
# Search for any comments or docs about Release idempotency in provider-utils imports
rg -n "claim\." --type go | grep -i "idempotent\|safe\|multiple" | head -20Repository: ironcore-dev/libvirt-provider
Length of output: 55
🏁 Script executed:
# Check if there are any error types returned by Release that suggest idempotency
rg -n "ErrAlreadyReleased\|already.*release" --type goRepository: ironcore-dev/libvirt-provider
Length of output: 55
Fix potential double-release on Eventually retry.
The cleanupMachine function releases GPU claims within an Eventually retry loop. If machineStore.Delete fails after claims are successfully released (line 279), the entire block retries and attempts to release the same claims again. Since m.Spec.Gpu is not modified after the first release, subsequent retries will call resClaimer.Release with identical GPU specs, which could cause issues if Release is not idempotent.
Clear the GPU spec after successful release to prevent retry from re-releasing the same claims:
♻️ Suggested improvement
// Release GPU claims
if len(m.Spec.Gpu) > 0 {
GinkgoWriter.Printf("Releasing claims ID=%s: claims=%s\n", machineID, m.Spec.Gpu)
err = resClaimer.Release(ctx, claim.Claims{
api.NvidiaGPUPlugin: gpu.NewGPUClaim(m.Spec.Gpu),
})
if err != nil {
GinkgoWriter.Printf("Releasing claims failed ID=%s: claims=%v, err=%v\n", machineID, m.Spec.Gpu, err)
return err
}
+ // Clear GPU spec to prevent double-release on retry
+ m.Spec.Gpu = nil
+ if _, err = machineStore.Update(ctx, m); err != nil && !errors.Is(err, store.ErrNotFound) {
+ return err
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Release GPU claims | |
| if len(m.Spec.Gpu) > 0 { | |
| GinkgoWriter.Printf("Releasing claims ID=%s: claims=%s\n", machineID, m.Spec.Gpu) | |
| err = resClaimer.Release(ctx, claim.Claims{ | |
| api.NvidiaGPUPlugin: gpu.NewGPUClaim(m.Spec.Gpu), | |
| }) | |
| if err != nil { | |
| GinkgoWriter.Printf("Releasing claims failed ID=%s: claims=%v, err=%v\n", machineID, m.Spec.Gpu, err) | |
| return err | |
| } | |
| } | |
| // Release GPU claims | |
| if len(m.Spec.Gpu) > 0 { | |
| GinkgoWriter.Printf("Releasing claims ID=%s: claims=%s\n", machineID, m.Spec.Gpu) | |
| err = resClaimer.Release(ctx, claim.Claims{ | |
| api.NvidiaGPUPlugin: gpu.NewGPUClaim(m.Spec.Gpu), | |
| }) | |
| if err != nil { | |
| GinkgoWriter.Printf("Releasing claims failed ID=%s: claims=%v, err=%v\n", machineID, m.Spec.Gpu, err) | |
| return err | |
| } | |
| // Clear GPU spec to prevent double-release on retry | |
| m.Spec.Gpu = nil | |
| if _, err = machineStore.Update(ctx, m); err != nil && !errors.Is(err, store.ErrNotFound) { | |
| return err | |
| } | |
| } |
🤖 Prompt for AI Agents
In `@internal/server/server_suite_test.go` around lines 270 - 280, The
cleanupMachine block can re-release GPU claims on Eventually retries; after a
successful call to resClaimer.Release in cleanupMachine (where it uses
m.Spec.Gpu), clear or nil out m.Spec.Gpu immediately (e.g., set m.Spec.Gpu = nil
or empty slice) before proceeding to machineStore.Delete so that if the
surrounding Eventually retries the function it won't call resClaimer.Release
again with the same specs; update the code around the resClaimer.Release call
and subsequent error handling to clear m.Spec.Gpu only after Release returns
nil.
Out of Scope:
Contributes to #678
Summary by CodeRabbit
New Features
Configuration
Behavior
Tests
Chores