Implement Nvidia GPU passthrough #680

friegger · 2026-01-22T13:54:16Z

Claim GPUs during IRI machine create
Controller adds GPU PCI addresses to domain XML for passthrough
Controller releases claims on machine deletion

Out of Scope:

release claims in case of errors -> separate PR

Contributes to #678

Summary by CodeRabbit

New Features
- GPU allocation and management: machines can claim, expose, and use Nvidia GPUs; a resource claimer starts with the server and coordinates claims/releases. Claims are released when machines are deleted. Server now requires a resource claimer on startup.
Configuration
- Machine class schema supports GPU resources alongside CPU and memory.
- Machine specs persist GPU PCI addresses when GPUs are assigned.
Behavior
- Creating a machine will fail if requested GPUs are exhausted.
Tests
- Added tests for GPU provisioning, claim exhaustion, PCI address handling, and host-device conversion.
Chores
- Dependency versions updated.
- Added macOS .DS_Store to ignore rules.

Contributes to #678

coderabbitai · 2026-01-22T13:54:27Z

Walkthrough

Adds GPU passthrough support: new api.MachineSpec.Gpu and NvidiaGPUPlugin; server claims nvidia.com/gpu and records PCI addresses during machine creation; ResourceClaimer is wired into server and reconciler; reconciler injects libvirt hostdevs and releases claims on deletion; tests and helpers added.

Changes

Cohort / File(s)	Summary
API & Types `api/common_types.go`, `api/machine.go`	Add `NvidiaGPUPlugin` constant and `Gpu []pci.Address` field to `api.MachineSpec`.
CLI / App Init `cmd/libvirt-provider/app/app.go`	Initialize PCI reader and GPU `ResourceClaimer`; recover claimed PCI addresses; inject claimer into reconciler and server options; start and await claimer at startup.
Server wiring & creation `internal/server/server.go`, `internal/server/machine_create.go`, `internal/server/server_suite_test.go`, `internal/server/machine_create_test.go`	Server/Options gain `ResourceClaimer`; machine creation claims `nvidia.com/gpu`, extracts PCI addresses into `Spec.Gpu`; added helpers to filter GPU resources and extract PCI addresses; tests updated/added for GPU creation and exhaustion.
Controller & Reconciliation `internal/controllers/machine_controller.go`, `internal/controllers/machine_controller_gpus.go`, `internal/controllers/machine_controller_gpus_test.go`	Reconciler options include `ResourceClaimer`; reconciler stores claimer and validates presence; inject claimed GPUs as libvirt hostdevs in domain XML; release GPU claims on machine deletion; unit tests for GPU→hostdev conversion.
Utilities `internal/utils/claims.go`, `internal/utils/claims_test.go`, `internal/utils/utils_suite_test.go`	New helper to aggregate claimed PCI addresses from machine store and accompanying tests/bootstrap.
Tests & Test Suites `internal/controllers/controllers_suite_test.go`, `internal/server/server_suite_test.go`, `internal/console/server_test.go`	Test suites wired to initialize/start a ResourceClaimer or NOOP claimer; added TestingPCIReader and NOOPResourceClaimer helpers; updated cleanup to release claims before deletion.
Config & Examples `config/development/machineclasses.json`	Migrate machine class schema to nested `resources` and add `nvidia.com/gpu` for a GPU-enabled class.
Deps & Misc `go.mod`, `.gitignore`	Bumped `provider-utils` and indirect deps; added `.DS_Store` to `.gitignore`.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Server
    participant ResourceClaimer
    participant MachineStore
    participant MachineReconciler
    participant Libvirt

    Client->>Server: CreateMachine(request for GPU-capable class)
    Server->>ResourceClaimer: Claim("nvidia.com/gpu" resources)
    ResourceClaimer-->>Server: Claims (include PCI addresses)
    Server->>MachineStore: Create Machine with Spec.Gpu (PCI addresses)
    MachineStore-->>Server: Machine created
    Server-->>Client: CreateMachineResponse

    MachineReconciler->>MachineStore: Watch/Fetch Machine
    MachineReconciler->>MachineReconciler: claimedGPUsToHostDevs(Machine.Spec.Gpu)
    MachineReconciler->>Libvirt: Define domain with GPU hostdevs
    Libvirt-->>MachineReconciler: Domain configured

sequenceDiagram
    participant MachineReconciler
    participant MachineStore
    participant ResourceClaimer

    MachineReconciler->>MachineStore: Reconcile delete event -> Get Machine
    MachineStore-->>MachineReconciler: Machine with Spec.Gpu (PCI addresses)
    MachineReconciler->>ResourceClaimer: Release GPU claims (PCI addresses)
    ResourceClaimer-->>MachineReconciler: Release ack / error
    MachineReconciler->>MachineStore: Delete Machine record
    MachineStore-->>MachineReconciler: Delete confirmed

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related issues

Implement GPU passthrough support #678: Implements end-to-end GPU passthrough (adds MachineSpec.Gpu, server claim logic, ResourceClaimer wiring, controller hostdev injection and claim release), matching this PR's scope.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The description covers key changes (claiming GPUs, adding to domain XML, releasing claims) but uses bullet points instead of the template's structure with 'Proposed Changes' section and 'Fixes' reference.	Reformat to follow the template: use 'Proposed Changes' heading with bullet points and include 'Fixes `#678`' for the linked issue.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Implement Nvidia GPU passthrough' is concise and clearly summarizes the main objective of the pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/678-implement-gpu-passthrough

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…u-passthrough

lukasfrank · 2026-01-27T12:04:33Z

internal/utils/claims.go

+	hostutils "github.com/ironcore-dev/provider-utils/storeutils/host"
+)
+
+func GetClaimedPCIAddressesFromMachineStore(ctx context.Context, machineStore *hostutils.Store[*api.Machine]) ([]pci.Address, error) {


Wdyt about dropping the utils and having a claim package?

I would want to move it to provider-utils, all providers need to do this.

internal/server/machine_create.go

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/server/machine_create.go (1)
90-110: Guard against nil resourceClaimer and skip GPU claiming for CPU-only classes.

The current code calls s.resourceClaimer.Claim() unconditionally, which would panic if resourceClaimer is nil. Additionally, for CPU-only machine classes, filterNvidiaGPUResources() returns an empty resource list, and calling Claim() with empty resources may cause unnecessary errors. Add a guard to check if GPUs are requested before attempting to claim them and verify the claimer is configured.
🛡️ Suggested guard
-	gpus, err := s.resourceClaimer.Claim(ctx, filterNvidiaGPUResources(class.Capabilities.Resources))
-	if err != nil {
-		log.Error(err, "Failed to claim GPUs")
-		return nil, fmt.Errorf("failed to claim GPUs: %w", err)
-	}
-
-	pciAddrs := getPCIAddresses(gpus)
+	gpuRes := filterNvidiaGPUResources(class.Capabilities.Resources)
+	var gpus claim.Claims
+	if len(gpuRes) > 0 {
+		if s.resourceClaimer == nil {
+			return nil, fmt.Errorf("gpu resources requested but resource claimer is not configured")
+		}
+		gpus, err = s.resourceClaimer.Claim(ctx, gpuRes)
+		if err != nil {
+			log.Error(err, "Failed to claim GPUs")
+			return nil, fmt.Errorf("failed to claim GPUs: %w", err)
+		}
+	}
+
+	pciAddrs := getPCIAddresses(gpus)

🤖 Fix all issues with AI agents

In `@internal/controllers/machine_controller_gpus_test.go`:
- Around line 14-99: The three It blocks testing claimedGPUsToHostDevs all share
the same description and one By message has a typo ("nil Gu field"), which makes
failures ambiguous; update the three It descriptions to unique, specific strings
(e.g., "should convert two claimed GPUs to libvirt host devices with correct PCI
addresses", "should return empty host devices for empty GPU slice", "should
return empty host devices for nil Gpu field") and fix the typo in the By call
inside the third test from "nil Gu field" to "nil Gpu field"; locate these
changes around the It blocks that call claimedGPUsToHostDevs and adjust only the
description and By message text.

In `@internal/server/machine_create.go`:
- Line 44: When claiming GPUs in machine_create.go (the s.resourceClaimer.Claim
call that obtains gpus), add a deferred cleanup that releases those GPU claims
if any subsequent error occurs (e.g., failures in SetObjectMetadata or
machineStore.Create). Implement a defer func() that checks the surrounding error
return (err) and, when non-nil, constructs the same claim.Claims used elsewhere
(claim.Claims{v1alpha1.ResourceName(api.NvidiaGPUPlugin):
gpu.NewGPUClaim(getPCIAddresses(gpus))}) and calls
s.resourceClaimer.Release(ctx, claims), logging any release error; keep the
Claim/Release logic keyed to the existing symbols s.resourceClaimer.Claim,
s.resourceClaimer.Release, getPCIAddresses, gpu.NewGPUClaim, claim.Claims and
api.NvidiaGPUPlugin so the deferred release runs on all error paths.

🧹 Nitpick comments (7)

internal/utils/claims_test.go (1)

19-47: Test covers the happy path well.

The test correctly validates retrieval of a single PCI address from a machine store.

Consider adding tests for edge cases in a follow-up:

Empty store (no machines)

Machine with nil/empty Gpu field

Multiple machines with multiple GPUs to verify aggregation
internal/utils/utils_suite_test.go (1)
13-16: Consider renaming the test function for clarity.

The function TestSync doesn't match the suite name "Utils Suite". Consider renaming to TestUtils for consistency and clarity.
Suggested rename
-func TestSync(t *testing.T) {
+func TestUtils(t *testing.T) {
 	RegisterFailHandler(Fail)
 	RunSpecs(t, "Utils Suite")
 }
internal/server/machine_create_test.go (1)
170-187: Consider adding an assertion on the error message.

The test validates that machine creation fails when insufficient GPUs are available, which is the correct behavior. Consider adding an assertion on the error message or code to ensure the failure is specifically due to GPU unavailability rather than another issue.
Suggested enhancement
 		Expect(err).To(HaveOccurred())
+		Expect(err.Error()).To(ContainSubstring("GPU")) // or a more specific error substring
 		Expect(createResp).To(BeNil())
internal/controllers/machine_controller_gpus.go (1)
52-64: Consider adding a nil/empty check to avoid unnecessary operations.

The function will attempt to release claims even when pciAddrs is empty, which could result in unnecessary work or API calls. Consider adding an early return for empty input.
Suggested enhancement
 func (r *MachineReconciler) releaseResourceClaims(ctx context.Context, log logr.Logger, pciAddrs []pci.Address) error {
+	if len(pciAddrs) == 0 {
+		return nil
+	}
 	log.V(2).Info("Releasing GPU claims", "pciAddresses", pciAddrs)
cmd/libvirt-provider/app/app.go (1)

320-324: GPU support hardcoded to Nvidia 3D controllers.

The PCI reader is initialized with pci.VendorNvidia and pci.Class3DController, restricting GPU detection to Nvidia devices using PCI class 0x0302. This excludes AMD Instinct accelerators (which enumerate as class 0x0380) and Intel Data Center GPUs (also class 0x0380). If multi-vendor GPU support is planned, consider making the vendor and device class configurable.
internal/server/machine_create.go (1)
34-40: Guard the GPU claim type assertion.

A direct type assertion can panic if the claims map contains an unexpected type; prefer a checked assertion and fall back to empty results.
🔧 Suggested hardening
-	if gpuClaim, ok := claims[api.NvidiaGPUPlugin]; ok {
-		gpu := gpuClaim.(gpu.Claim)
-		return gpu.PCIAddresses()
-	}
+	if raw, ok := claims[api.NvidiaGPUPlugin]; ok {
+		if gpuClaim, ok := raw.(gpu.Claim); ok {
+			return gpuClaim.PCIAddresses()
+		}
+	}
internal/server/server_suite_test.go (1)
202-215: Wait for the ResourceClaimer to start before proceeding.

WaitUntilStarted currently runs inside the goroutine, so BeforeSuite can continue before the claimer is ready. Consider waiting in the main goroutine and use a cancelable context for cleanup to avoid flakes.
✅ Suggested adjustment
 	resClaimer, err := claim.NewResourceClaimer(
 		log, gpu.NewGPUClaimPlugin(log, api.NvidiaGPUPlugin, NewTestingPCIReader([]pci.Address{
 			{Domain: 0, Bus: 3, Slot: 0, Function: 0},
 			{Domain: 0, Bus: 3, Slot: 0, Function: 1},
 		}), []pci.Address{}),
 	)
 	Expect(err).ToNot(HaveOccurred())
 
+	claimerCtx, cancelClaimer := context.WithCancel(context.Background())
+	DeferCleanup(cancelClaimer)
 	go func() {
 		defer GinkgoRecover()
-		err := resClaimer.Start(context.Background())
+		err := resClaimer.Start(claimerCtx)
 		Expect(err).ToNot(HaveOccurred())
-		Expect(resClaimer.WaitUntilStarted(ctx)).To(BeNil())
 	}()
+	Expect(resClaimer.WaitUntilStarted(ctx)).To(BeNil())

internal/controllers/machine_controller_gpus_test.go

internal/server/machine_create.go

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/server/server_suite_test.go (1)
1-1: Fix copyright year typo.

The copyright year "20253" appears to be a typo. It should likely be "2023" or "2025".
-// SPDX-FileCopyrightText: 20253 SAP SE or an SAP affiliate company and IronCore contributors
+// SPDX-FileCopyrightText: 2023 SAP SE or an SAP affiliate company and IronCore contributors

🤖 Fix all issues with AI agents

In `@internal/server/server_suite_test.go`:
- Around line 136-145: The machine class named machineClassx3xlargegpu requests
4 GPUs but the test PCI fixture from NewTestingPCIReader only provides 2
addresses, causing allocation failures; fix by making the GPU counts
consistent—either change the Resources GPU entry in machineClassx3xlargegpu from
4 to 2, or add two additional PCI addresses in NewTestingPCIReader (and the
similar test PCI setup referenced around the same area) so it supplies four
addresses to match the class.

🧹 Nitpick comments (7)

internal/utils/utils_suite_test.go (1)
13-16: Consider renaming the test function for consistency.

The function TestSync doesn't align with the suite name "Utils Suite". Convention is to name it TestUtils to match the package and suite being tested.
Suggested fix
-func TestSync(t *testing.T) {
+func TestUtils(t *testing.T) {
 	RegisterFailHandler(Fail)
 	RunSpecs(t, "Utils Suite")
 }
internal/server/server.go (1)

90-104: Consider validating that ResourceClaimer is not nil when GPU-enabled machine classes exist.

If opts.ResourceClaimer is nil but a machine class with GPU resources is requested, createMachineFromIRIMachine will panic or fail unexpectedly when attempting to claim GPUs. You may want to either:

Validate at startup that ResourceClaimer is set if GPU machine classes are configured, or

Add a nil check in the GPU claiming code path with a descriptive error.
internal/server/machine_create_test.go (1)
170-187: Consider asserting on the specific error message or type.

The test verifies that an error occurs when insufficient GPUs are available, but doesn't validate the error content. Asserting on the error message or gRPC status code would make the test more robust against regressions where a different error path might trigger.
Expect(err).To(HaveOccurred())
// Consider adding:
// Expect(err.Error()).To(ContainSubstring("failed to claim GPU"))
internal/controllers/machine_controller_gpus.go (1)
52-64: Consider adding a short-circuit for empty PCI addresses.

If pciAddrs is empty, the method still constructs a claim and calls Release. While this likely works, adding an early return would clarify intent and avoid unnecessary work:
Suggested improvement
 func (r *MachineReconciler) releaseResourceClaims(ctx context.Context, log logr.Logger, pciAddrs []pci.Address) error {
+	if len(pciAddrs) == 0 {
+		return nil
+	}
 	log.V(2).Info("Releasing GPU claims", "pciAddresses", pciAddrs)
cmd/libvirt-provider/app/app.go (1)

19-20: Consider grouping the claim import with other provider-utils imports.

The claim import on line 19 is separated from other provider-utils imports (lines 43-49) with a blank line. For consistency, consider moving it to be adjacent to the other provider-utils imports.
internal/server/machine_create.go (1)
34-41: Consider adding a type assertion check for defensive coding.

The type assertion claims[api.NvidiaGPUPlugin].(gpu.Claim) on line 36 will panic if the claim is not of type gpu.Claim. While this is unlikely given the controlled input, adding a comma-ok check would be more defensive:
if gpuClaim, ok := claims[api.NvidiaGPUPlugin]; ok {
    if gc, ok := gpuClaim.(gpu.Claim); ok {
        return gc.PCIAddresses()
    }
}
internal/server/server_suite_test.go (1)
210-215: Inconsistent context usage in ResourceClaimer goroutine.

resClaimer.Start uses context.Background() while WaitUntilStarted uses ctx (SpecContext). Consider using a consistent context, likely derived from the spec context, so the claimer properly respects test lifecycle cancellation.
 go func() {
     defer GinkgoRecover()
-    err := resClaimer.Start(context.Background())
+    err := resClaimer.Start(ctx)
     Expect(err).ToNot(HaveOccurred())
     Expect(resClaimer.WaitUntilStarted(ctx)).To(BeNil())
 }()

internal/server/server_suite_test.go

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

internal/server/server_suite_test.go (1)
1-1: Fix copyright year typo.

The copyright year 20253 should be 2025 (or 2023 to match other files in this PR).
🔧 Proposed fix
-// SPDX-FileCopyrightText: 20253 SAP SE or an SAP affiliate company and IronCore contributors
+// SPDX-FileCopyrightText: 2025 SAP SE or an SAP affiliate company and IronCore contributors
internal/controllers/machine_controller.go (1)

66-79: Add nil validation for ResourceClaimer in the constructor or handle nil in releaseResourceClaims.

The code calls r.resourceClaimer.Release() in releaseResourceClaims (line 57 of machine_controller_gpus.go) without any nil check. Unlike host, machines, and machineEvents which validate non-nil in the constructor, ResourceClaimer is assigned without validation. Since the test suite omits ResourceClaimer from MachineReconcilerOptions, this will cause a nil pointer panic if machine.Spec.Gpu is populated. Either validate ResourceClaimer as mandatory in NewMachineReconciler, or add a nil check and early return in releaseResourceClaims.

🤖 Fix all issues with AI agents

In `@cmd/libvirt-provider/app/app.go`:
- Around line 406-413: The current g.Go closure calls resClaimer.Start(ctx) and
then immediately returns resClaimer.WaitUntilStarted(ctx), but Start(ctx) can
block until context cancellation, making WaitUntilStarted unreachable; change
the flow so WaitUntilStarted is awaited before starting dependent components:
spawn resClaimer.Start(ctx) in its own goroutine (or call Start non-blocking),
then call resClaimer.WaitUntilStarted(ctx) from the main flow (or the same g.Go
before starting machineReconciler), and only after WaitUntilStarted completes
proceed to start or g.Go the machineReconciler; reference resClaimer.Start and
resClaimer.WaitUntilStarted to locate where to split the blocking start and the
readiness wait.

In `@internal/controllers/machine_controller_gpus_test.go`:
- Around line 88-89: Fix the typos in the test case description and its By
message: update the It(...) string in machine_controller_gpus_test.go from
"should return empy host devices for nil GPU field" to "should return empty host
devices for nil GPU field", and update the By(...) call from "creating a machine
with nil Gu field" to "creating a machine with nil Gpu field" so both strings
read correctly; locate the literals inside the test case (the It and By calls)
and change only the misspelled words.

In `@internal/server/machine_create.go`:
- Around line 34-41: The getPCIAddresses function uses an unsafe type assertion
gpuClaim.(gpu.Claim); replace it with the comma-ok form so you only call
PCIAddresses() when the assertion succeeds. In other words, inside
getPCIAddresses check "gpu, ok := gpuClaim.(gpu.Claim)" and if ok return
gpu.PCIAddresses(), otherwise return an empty []pci.Address{}; this prevents a
panic when the claim exists but is not the expected type.

🧹 Nitpick comments (4)

internal/server/machine_create_test.go (1)
203-206: Consider using a more robust error assertion.

The exact error message check on line 205 is fragile and may break if the error message format changes. Consider checking for the error type or a key substring.
♻️ Suggested improvement
 		By("ensuring the correct error is returned")
 		Expect(err).To(HaveOccurred())
-		Expect(err.Error()).To(ContainSubstring("failed to claim GPUs: insufficient resources\ninsufficient resource for nvidia.com/gpu"))
+		Expect(err.Error()).To(ContainSubstring("insufficient resource"))
 		Expect(createResp2).To(BeNil())
internal/controllers/machine_controller_gpus.go (1)
19-50: Add nil check to prevent potential panic.

The function doesn't check if machine or machine.Spec is nil before accessing machine.Spec.Gpu. While callers may guarantee non-nil values, defensive coding would prevent potential panics.
♻️ Proposed defensive check
 func claimedGPUsToHostDevs(machine *api.Machine) []libvirtxml.DomainHostdev {
+	if machine == nil || len(machine.Spec.Gpu) == 0 {
+		return nil
+	}
 	hostDevs := make([]libvirtxml.DomainHostdev, len(machine.Spec.Gpu))
internal/server/server.go (1)
71-71: Consider validating ResourceClaimer is not nil.

Unlike IDGen which has a default value set in setOptionsDefaults, ResourceClaimer has no default and no validation. If a caller omits this option, s.resourceClaimer.Claim() in machine_create.go will panic on nil pointer dereference.
♻️ Option 1: Add validation in New()
 func New(opts Options) (*Server, error) {
 	setOptionsDefaults(&opts)
 
+	if opts.ResourceClaimer == nil {
+		return nil, fmt.Errorf("ResourceClaimer is required")
+	}
+
 	baseURL, err := url.ParseRequestURI(opts.BaseURL)
♻️ Option 2: Provide a no-op default

If GPU support should be optional, consider providing a no-op claimer implementation as a default.

Also applies to: 76-80, 96-96
internal/server/server_suite_test.go (1)
294-306: Consider using pointer receiver for consistency.

NewTestingPCIReader returns *TestingPCIReader, but Read() uses a value receiver. While this works, using a pointer receiver would be more consistent.
♻️ Suggested change
-func (t TestingPCIReader) Read() ([]pci.Address, error) {
+func (t *TestingPCIReader) Read() ([]pci.Address, error) {
 	return t.pciAddrs, nil
 }

cmd/libvirt-provider/app/app.go

internal/controllers/machine_controller_gpus_test.go

internal/server/machine_create.go

friegger · 2026-01-28T14:51:57Z

@coderabbitai help

coderabbitai · 2026-01-28T14:52:49Z

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
- @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
- @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve to resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit configuration file (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
You can also validate your configuration using the online YAML validator.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, support, documentation and community

Visit our status page to check the current availability of CodeRabbit.
Create a ticket on our support page for assistance with any issues or questions.
Visit our documentation site for detailed information on how to use CodeRabbit.
Join our Discord community to connect with other users and get help from the community.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/controllers/controllers_suite_test.go (1)

188-212: Add ResourceClaimer field to MachineReconcilerOptions in the machine controller initialization.

The resource claimer is initialized at lines 189-192 but not passed to the controller. The MachineReconcilerOptions struct (line 73 of machine_controller.go) includes a ResourceClaimer field, but it's missing from the MachineReconcilerOptions struct initialization at lines 195-211. Add ResourceClaimer: resClaimer, to match the pattern used in server_suite_test.go (line 209).

🧹 Nitpick comments (1)

internal/controllers/controllers_suite_test.go (1)
303-316: Duplicate TestingPCIReader implementation.

This type, its Read method, and NewTestingPCIReader constructor are identical to the implementation in internal/server_suite_test.go (lines 293-305). Consider extracting to a shared test utilities package to avoid duplication.
♻️ Suggested approach

Create a shared test helper, e.g., internal/testutil/pci_reader.go:
package testutil

import "github.com/ironcore-dev/provider-utils/claimutils/pci"

type TestingPCIReader struct {
	pciAddrs []pci.Address
}

func (t TestingPCIReader) Read() ([]pci.Address, error) {
	return t.pciAddrs, nil
}

func NewTestingPCIReader(addrs []pci.Address) *TestingPCIReader {
	return &TestingPCIReader{
		pciAddrs: addrs,
	}
}
Then import and use testutil.NewTestingPCIReader in both test suites.

afritzler · 2026-01-29T08:02:39Z

Just a small nitpick: Can we stick to PR titles which do not include the categorization e.g. feature etc? I would suggest to keep the release log as they are generated at the moment consistent: https://github.com/ironcore-dev/libvirt-provider/releases

…u-passthrough

…gh' into feat/678-implement-gpu-passthrough

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/server/server_suite_test.go (1)
1-1: ⚠️ Potential issue | 🟡 Minor

Typo in copyright year.

The year 20253 appears to be a typo. Should this be 2023 or 2025?
Proposed fix
-// SPDX-FileCopyrightText: 20253 SAP SE or an SAP affiliate company and IronCore contributors
+// SPDX-FileCopyrightText: 2025 SAP SE or an SAP affiliate company and IronCore contributors

🧹 Nitpick comments (2)

internal/server/server_suite_test.go (1)

254-291: LGTM! Good cleanup logic that properly releases GPU claims before machine deletion.

Minor nit: Line 273 claimer := resClaimer assigns the package variable to a local variable unnecessarily. You could use resClaimer directly.
Optional simplification
 			// Release GPU claims
 			if len(m.Spec.Gpu) > 0 {
 				GinkgoWriter.Printf("Releasing claims ID=%s: claims=%s\n", machineID, m.Spec.Gpu)
-				claimer := resClaimer
-				err = claimer.Release(ctx, claim.Claims{
+				err = resClaimer.Release(ctx, claim.Claims{
 					api.NvidiaGPUPlugin: gpu.NewGPUClaim(m.Spec.Gpu),
 				})

internal/server/machine_create.go (1)

103-114: Optional: Skip resource claim when class has no GPUs.

When filterNvidiaGPUResources returns an empty ResourceList (no GPU plugin in class capabilities), calling s.resourceClaimer.Claim is redundant. While safe, this can be optimized by checking the filtered resources first and skipping both the claim and address extraction when no GPUs are required.

♻️ Optional optimization

-	claimedGPUs, err = s.resourceClaimer.Claim(ctx, filterNvidiaGPUResources(class.Capabilities.Resources))
-	if err != nil {
-		log.Error(err, "Failed to claim GPUs")
-		return nil, fmt.Errorf("failed to claim GPUs: %w", err)
-	}
-
-	pciAddrs, err := getPCIAddresses(claimedGPUs)
-	if err != nil {
-		log.Error(err, "Failed to get PCI addresses from GPU claims")
-		return nil, fmt.Errorf("failed to get PCI addresses: %w", err)
+	gpuResources := filterNvidiaGPUResources(class.Capabilities.Resources)
+	var pciAddrs []pci.Address
+	if len(gpuResources) > 0 {
+		claimedGPUs, err = s.resourceClaimer.Claim(ctx, gpuResources)
+		if err != nil {
+			log.Error(err, "Failed to claim GPUs")
+			return nil, fmt.Errorf("failed to claim GPUs: %w", err)
+		}
+
+		pciAddrs, err = getPCIAddresses(claimedGPUs)
+		if err != nil {
+			log.Error(err, "Failed to get PCI addresses from GPU claims")
+			return nil, fmt.Errorf("failed to get PCI addresses: %w", err)
+		}
+		log.V(2).Info("Claimed GPU PCI addresses", "pciAddresses", pciAddrs)
 	}
-	log.V(2).Info("Claimed GPU PCI addresses", "pciAddresses", fmt.Sprintf("%v", pciAddrs))

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@internal/server/server_suite_test.go`:
- Around line 270-280: The cleanupMachine block can re-release GPU claims on
Eventually retries; after a successful call to resClaimer.Release in
cleanupMachine (where it uses m.Spec.Gpu), clear or nil out m.Spec.Gpu
immediately (e.g., set m.Spec.Gpu = nil or empty slice) before proceeding to
machineStore.Delete so that if the surrounding Eventually retries the function
it won't call resClaimer.Release again with the same specs; update the code
around the resClaimer.Release call and subsequent error handling to clear
m.Spec.Gpu only after Release returns nil.

🧹 Nitpick comments (1)

internal/server/server_suite_test.go (1)

292-304: Duplicate TestingPCIReader implementation.

This type is identically defined in internal/controllers/controllers_suite_test.go (lines 304-316). Consider extracting it to a shared test utilities package (e.g., internal/testutil/pci.go) to reduce duplication and ensure consistent behavior across test suites.

coderabbitai · 2026-02-04T12:32:26Z

internal/server/server_suite_test.go

+			// Release GPU claims
+			if len(m.Spec.Gpu) > 0 {
+				GinkgoWriter.Printf("Releasing claims ID=%s: claims=%s\n", machineID, m.Spec.Gpu)
+				err = resClaimer.Release(ctx, claim.Claims{
+					api.NvidiaGPUPlugin: gpu.NewGPUClaim(m.Spec.Gpu),
+				})
+				if err != nil {
+					GinkgoWriter.Printf("Releasing claims failed ID=%s: claims=%v, err=%v\n", machineID, m.Spec.Gpu, err)
+					return err
+				}
+			}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's find the Release method implementation in the claim/claimer fd -t f -e go | xargs grep -l "func.*Release" | head -20

Repository: ironcore-dev/libvirt-provider

Length of output: 104

🏁 Script executed:

# Search for the claim package and Release method rg -n "func.*Release\(" --type go -A 5 | head -100

Repository: ironcore-dev/libvirt-provider

Length of output: 487

🏁 Script executed:

# Find the claimer interface definition and Release signature ast-grep --pattern $'interface Claimer { $$$ }'

Repository: ironcore-dev/libvirt-provider

Length of output: 55

🏁 Script executed:

# Look for GPU claim release implementation rg -n "Release" --type go | grep -i "claim\|gpu" | head -30

Repository: ironcore-dev/libvirt-provider

Length of output: 720

🏁 Script executed:

# Find where ResourceClaimer is instantiated in server_suite_test.go sed -n '190,200p' internal/server/server_suite_test.go

Repository: ironcore-dev/libvirt-provider

Length of output: 459

🏁 Script executed:

# Get full cleanupMachine function context sed -n '250,310p' internal/server/server_suite_test.go

Repository: ironcore-dev/libvirt-provider

Length of output: 1480

🏁 Script executed:

# Check how Release is handled in other parts of the code (machine_create.go) sed -n '45,60p' internal/server/machine_create.go

Repository: ironcore-dev/libvirt-provider

Length of output: 605

🏁 Script executed:

# Look for any patterns of Release error handling rg -n -B3 -A3 "\.Release\(" --type go

Repository: ironcore-dev/libvirt-provider

Length of output: 1810

🏁 Script executed:

# Check if there are tests that verify Release behavior on repeated calls rg -n "Release" --type go internal/server/server_suite_test.go -B2 -A2

Repository: ironcore-dev/libvirt-provider

Length of output: 359

🏁 Script executed:

# Look for any tests or patterns in controllers that might show Release idempotency rg -n "Release" --type go internal/controllers/ -B2 -A2 | head -60

Repository: ironcore-dev/libvirt-provider

Length of output: 962

🏁 Script executed:

# Search for any comments or docs about Release idempotency in provider-utils imports rg -n "claim\." --type go | grep -i "idempotent\|safe\|multiple" | head -20

Repository: ironcore-dev/libvirt-provider

Length of output: 55

🏁 Script executed:

# Check if there are any error types returned by Release that suggest idempotency rg -n "ErrAlreadyReleased\|already.*release" --type go

Repository: ironcore-dev/libvirt-provider

Length of output: 55

Fix potential double-release on Eventually retry.

The cleanupMachine function releases GPU claims within an Eventually retry loop. If machineStore.Delete fails after claims are successfully released (line 279), the entire block retries and attempts to release the same claims again. Since m.Spec.Gpu is not modified after the first release, subsequent retries will call resClaimer.Release with identical GPU specs, which could cause issues if Release is not idempotent.

Clear the GPU spec after successful release to prevent retry from re-releasing the same claims:

♻️ Suggested improvement

// Release GPU claims if len(m.Spec.Gpu) > 0 { GinkgoWriter.Printf("Releasing claims ID=%s: claims=%s\n", machineID, m.Spec.Gpu) err = resClaimer.Release(ctx, claim.Claims{ api.NvidiaGPUPlugin: gpu.NewGPUClaim(m.Spec.Gpu), }) if err != nil { GinkgoWriter.Printf("Releasing claims failed ID=%s: claims=%v, err=%v\n", machineID, m.Spec.Gpu, err) return err } + // Clear GPU spec to prevent double-release on retry + m.Spec.Gpu = nil + if _, err = machineStore.Update(ctx, m); err != nil && !errors.Is(err, store.ErrNotFound) { + return err + } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Release GPU claims

if len(m.Spec.Gpu) > 0 {

GinkgoWriter.Printf("Releasing claims ID=%s: claims=%s\n", machineID, m.Spec.Gpu)

err = resClaimer.Release(ctx, claim.Claims{

api.NvidiaGPUPlugin: gpu.NewGPUClaim(m.Spec.Gpu),

})

if err != nil {

GinkgoWriter.Printf("Releasing claims failed ID=%s: claims=%v, err=%v\n", machineID, m.Spec.Gpu, err)

return err

}

}

// Release GPU claims

if len(m.Spec.Gpu) > 0 {

GinkgoWriter.Printf("Releasing claims ID=%s: claims=%s\n", machineID, m.Spec.Gpu)

err = resClaimer.Release(ctx, claim.Claims{

api.NvidiaGPUPlugin: gpu.NewGPUClaim(m.Spec.Gpu),

})

if err != nil {

GinkgoWriter.Printf("Releasing claims failed ID=%s: claims=%v, err=%v\n", machineID, m.Spec.Gpu, err)

return err

}

// Clear GPU spec to prevent double-release on retry

m.Spec.Gpu = nil

if _, err = machineStore.Update(ctx, m); err != nil && !errors.Is(err, store.ErrNotFound) {

return err

}

}

🤖 Prompt for AI Agents

In `@internal/server/server_suite_test.go` around lines 270 - 280, The cleanupMachine block can re-release GPU claims on Eventually retries; after a successful call to resClaimer.Release in cleanupMachine (where it uses m.Spec.Gpu), clear or nil out m.Spec.Gpu immediately (e.g., set m.Spec.Gpu = nil or empty slice) before proceeding to machineStore.Delete so that if the surrounding Eventually retries the function it won't call resClaimer.Release again with the same specs; update the code around the resClaimer.Release call and subsequent error handling to clear m.Spec.Gpu only after Release returns nil.

friegger added 2 commits January 22, 2026 13:43

Claim GPUs during IRI machine create

75e0fff

Contributes to #678

Controller adds GPU PCI addresses to domain XML

89dfc5c

github-actions bot added the size/L label Jan 22, 2026

friegger added 4 commits January 22, 2026 15:01

Fix lints

8cdb182

Adjust development machine classes

15ee866

Instantiate ResourceClaimer in app

1780358

Fix import

198011c

hardikdr added the area/compute label Jan 23, 2026

hardikdr added this to Roadmap Jan 23, 2026

friegger added 12 commits January 23, 2026 08:16

Fix instatiation of ResourceClaimer

1d1c865

Fix typo

e8b84b1

Use latest Claimer from povider-utils

d4f6410

Merge remote-tracking branch 'origin/main' into feat/678-implement-gp…

bcd0002

…u-passthrough

Bump povider-utils

8a5bcf0

Start resource claimer

44c96fb

Start resource claimer properly

cea7f9d

Bump provider-utils

b5bc3db

Adjust constructor call

31db036

Restore claimed PCI adresses from machine store

c9229ed

Add logging

a805c58

Bump provider-utils and release resource claims

e64e3e1

github-actions bot added size/XL and removed size/L labels Jan 26, 2026

Actually initialize resource claimer

36fbed8

lukasfrank reviewed Jan 27, 2026

View reviewed changes

friegger added 2 commits January 27, 2026 15:48

Bump provider-utils

14e01e9

Use constant for Nvidia GPU Plugin name and claims

da5b4eb

friegger marked this pull request as ready for review January 27, 2026 15:03

friegger requested a review from a team as a code owner January 27, 2026 15:03

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

internal/controllers/machine_controller_gpus_test.go Show resolved Hide resolved

internal/server/machine_create.go Outdated Show resolved Hide resolved

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

internal/server/server_suite_test.go Outdated Show resolved Hide resolved

Apply some review remarks

867a410

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

cmd/libvirt-provider/app/app.go Outdated Show resolved Hide resolved

internal/controllers/machine_controller_gpus_test.go Outdated Show resolved Hide resolved

internal/server/machine_create.go Outdated Show resolved Hide resolved

friegger added 3 commits January 28, 2026 15:14

Fix typos

3093ace

Wait for the res claimer properly

da7ed78

Cast savely

a64492b

friegger added 2 commits January 28, 2026 16:05

Review comments

df3bab2

Add resource claimer to server console test

5b77e3b

friegger added the integration-tests to run integration tests label Jan 28, 2026

Ensure machine reconciler has a resource claimer

381e184

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

Machine controller tests uses resource claimer

7ae1b49

friegger changed the title ~~Feat/678 implement gpu passthrough~~ Implement GPU passthrough Jan 29, 2026

friegger changed the title ~~Implement GPU passthrough~~ Implement Nvidia GPU passthrough Jan 29, 2026

friegger added 6 commits January 29, 2026 13:24

Ensure correct test package

5399b67

Merge branch 'main' into feat/678-implement-gpu-passthrough

e718b32

Rename shadowing variable

d980008

Merge remote-tracking branch 'origin/main' into feat/678-implement-gp…

36de6fc

…u-passthrough

Merge remote-tracking branch 'origin/feat/678-implement-gpu-passthrou…

61679fa

…gh' into feat/678-implement-gpu-passthrough

Cleanup of claims and small adjustments

dc4308e

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

Review comments

bac4719

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

Make machine list tests more resilient

e6a3ef8

Implement Nvidia GPU passthrough #680

Are you sure you want to change the base?

Implement Nvidia GPU passthrough #680

Uh oh!

Conversation

friegger commented Jan 22, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Uh oh!

lukasfrank Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

friegger Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

friegger commented Jan 28, 2026

Uh oh!

coderabbitai bot commented Jan 28, 2026

Chat

CodeRabbit commands

Other keywords and placeholders

CodeRabbit configuration file (.coderabbit.yaml)

Status, support, documentation and community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

afritzler commented Jan 29, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

friegger commented Jan 22, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 22, 2026 •

edited

Loading

friegger Jan 27, 2026 •

edited

Loading

CodeRabbit configuration file (`.coderabbit.yaml`)