feat(otel): OpenTelemetry traces, logs, drop counters, and OTEL metrics (Phase 1 + 2)#818
feat(otel): OpenTelemetry traces, logs, drop counters, and OTEL metrics (Phase 1 + 2)#818matthyx wants to merge 32 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces legacy OTEL wiring with provider-based InitProviders, adds an OTEL-backed metrics manager, implements ProfileLifecycleTracker, instruments rules/profiles/ebpf/malware/SBOM/gRPC, renames a metrics config flag, updates docs/go.mod, and adds tests and benchmarks. ChangesOpenTelemetry Setup and Integration
Sequence Diagram(s): omitted (changes are multiple feature areas; no single three-component sequential flow benefits more than the above diagrams). Estimated code review effort 🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
bd33437 to
dcb83de
Compare
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/containerwatcher/v2/container_watcher.go (1)
490-501:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRelease dropped events on the non-blocking drop path.
On Line 490, dropped events skip the worker callback (where
enrichedEvent.Event.Release()normally happens), so they are never released.💡 Proposed fix
} else { logger.L().Warning("ContainerWatcher - Worker channel full, dropping event", helpers.String("eventType", string(entry.EventType)), helpers.String("containerID", entry.ContainerID)) cw.ebpfDropCounter.Add(context.Background(), 1, metric.WithAttributes( attribute.String("event_type", string(entry.EventType)), attribute.String("reason", "worker_channel_full"), ), ) + enrichedEvent.Event.Release() } } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/containerwatcher/v2/container_watcher.go` around lines 490 - 501, The non-blocking "worker channel full" branch drops events without releasing their resources; update the else branch so that before logging/incrementing cw.ebpfDropCounter you call the event release used in the worker path (e.g. call enrichedEvent.Event.Release() or the appropriate Release() on the event stored in entry), guarding with a nil-check if necessary, so dropped events are properly released just like in the worker callback.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/main.go`:
- Around line 116-122: The deferred OTEL shutdown in main is skipped because
main calls os.Exit; move to the safer pattern: create a run() function that
contains the existing setup and the defer block that calls otelShutdown using
context.WithTimeout (and defers cancel), keep the defer exactly as in the diff,
replace all internal os.Exit(...) calls inside run() with returns of the
corresponding exit code, and change main to call os.Exit(run()); ensure you
reference the existing otelShutdown variable and the defer with
context.WithTimeout so the shutdown runs on SIGINT/SIGTERM paths.
- Around line 103-112: The OTEL provider call currently passes raw os.Getenv
values for NodeName, PodName, Namespace and ClusterName which bypass the
resolved config and clusterData; update the otelsetup.InitProviders
ProviderConfig construction (the call site where ProviderConfig is built) to use
the resolved config and cluster values (e.g., cfg.NodeName / cfg.PodName /
cfg.Namespace / cfg.ClusterName and existing clusterData fields) with optional
fallbacks to the env vars if needed so telemetry uses the final config values
instead of the raw environment.
In `@docs/CONFIGURATION.md`:
- Around line 112-115: Add a new environment variable row documenting
KS_LOGGER_NAME to the environment variables table, noting its default value and
valid options (e.g., "slog" vs "prettylogger"/"zaplogger") and that setting
KS_LOGGER_NAME=slog activates the ring buffer used by the retroactive log
export; also reference that ENABLE_DEBUG_LISTENER must be true to keep the last
7,500 log records in memory and that they can be re-emitted via a POST to
/debug/flush-ring-buffer.
In `@pkg/otelsetup/otelsetup_test.go`:
- Around line 156-161: Test TestSlowEvalThreshold_Default modifies the
package-global atomic slowEvalThresholdNs and does not restore it; update the
test to capture the current value of slowEvalThresholdNs before calling
slowEvalThresholdNs.Store(...), and defer restoring the saved value (using
slowEvalThresholdNs.Store(old)) so the global state is returned to its prior
value after the test; reference the slowEvalThresholdNs symbol and the
TestSlowEvalThreshold_Default test and keep the assertion on SlowEvalThreshold()
unchanged.
In `@pkg/otelsetup/setup.go`:
- Around line 185-187: The exporter option construction uses
otlptracegrpc.WithEndpoint (and likewise otlpmetricgrpc.WithEndpoint and
otlplogsgrpc.WithEndpoint) which requires host:port; update the logic that
builds traceOpts/metricOpts/logOpts to detect URL-style endpoints (use the
existing isARMOEndpoint() utility or check for "://") and call
WithEndpointURL(endpoint) when a scheme is present, otherwise keep
WithEndpoint(endpoint); apply the same change for the metric and log exporter
option blocks referenced around the other diffs so all three exporters handle
both host:port and full URL endpoints correctly.
- Around line 287-289: The shutdown closure currently derives shutdownCtx from
the incoming c which can be already cancelled; change it to create a fresh
timeout context using context.WithTimeout(context.Background(), 5*time.Second)
(instead of context.WithTimeout(c, ...)) so provider shutdown/flush calls are
not short-circuited by the caller's cancelled context; update the shutdown
function (the shutdown variable in setup.go) to use that fresh context and still
call defer cancel() as before.
---
Outside diff comments:
In `@pkg/containerwatcher/v2/container_watcher.go`:
- Around line 490-501: The non-blocking "worker channel full" branch drops
events without releasing their resources; update the else branch so that before
logging/incrementing cw.ebpfDropCounter you call the event release used in the
worker path (e.g. call enrichedEvent.Event.Release() or the appropriate
Release() on the event stored in entry), guarding with a nil-check if necessary,
so dropped events are properly released just like in the worker callback.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1c5cf97b-b20d-4d20-93f8-9fcf98af12b5
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (16)
cmd/main.godocs/CONFIGURATION.mdgo.modpkg/config/config.gopkg/config/config_test.gopkg/containerprofilemanager/v1/containerprofile_manager.gopkg/containerprofilemanager/v1/monitoring.gopkg/containerwatcher/v2/container_watcher.gopkg/containerwatcher/v2/event_handler_factory.gopkg/containerwatcher/v2/tracers/top.gopkg/malwaremanager/v1/malware_manager.gopkg/otelsetup/lifecycle.gopkg/otelsetup/otelsetup_test.gopkg/otelsetup/setup.gopkg/rulemanager/rule_manager.gopkg/sbommanager/v1/sbom_manager.go
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/otelsetup/setup.go (1)
171-177:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
ProviderConfigstill drops cluster/version metadata.
cfg.ClusterNameandcfg.ServiceVersionare passed intoInitProviders, but they never make it into the shared OTEL resource. That means all three signals still lose cluster identity and agent version even aftercmd/main.gowas updated to resolve those values first.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/otelsetup/setup.go` around lines 171 - 177, The OTEL resource built in resource.Merge (inside InitProviders / setup.go) omits cluster and service version; update the resource.NewWithAttributes call to include cfg.ClusterName and cfg.ServiceVersion by adding the appropriate semantic attributes (e.g. semconv.K8SClusterName(cfg.ClusterName) and semconv.ServiceVersion(cfg.ServiceVersion)) alongside semconv.ServiceName, semconv.K8SNodeName, semconv.K8SPodName, and semconv.K8SNamespaceName so the shared resource carries cluster identity and agent version.
♻️ Duplicate comments (1)
cmd/main.go (1)
116-122:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftDeferred OTEL shutdown is still skipped on hard-exit paths.
The defer at Lines 116-122 never runs after
os.Exit(...), and the laterFatal(...)paths terminate immediately as well. Normal SIGTERM/SIGINT shutdown can still bypass provider flush and drop batched telemetry.Also applies to: 131-139, 186-187, 495-528
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/main.go` around lines 116 - 122, The deferred otelShutdown in the anonymous defer block will be skipped on hard exits (os.Exit and log.Fatal paths) so ensure otelShutdown(shutdownCtx) is invoked before any immediate exit: replace direct os.Exit / log.Fatal usage or any hard-exit paths that occur near the locations mentioned with a small exit helper (e.g., callExitWithShutdown) that calls otelShutdown with a context timeout (same 5s pattern used in the current defer), waits for completion (or logs error), then performs the actual os.Exit; update all occurrences that terminate immediately (references: otelShutdown, the anonymous defer block, and places that call os.Exit / Fatal) to use that helper so provider flush runs on hard-exit paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@pkg/otelsetup/setup.go`:
- Around line 171-177: The OTEL resource built in resource.Merge (inside
InitProviders / setup.go) omits cluster and service version; update the
resource.NewWithAttributes call to include cfg.ClusterName and
cfg.ServiceVersion by adding the appropriate semantic attributes (e.g.
semconv.K8SClusterName(cfg.ClusterName) and
semconv.ServiceVersion(cfg.ServiceVersion)) alongside semconv.ServiceName,
semconv.K8SNodeName, semconv.K8SPodName, and semconv.K8SNamespaceName so the
shared resource carries cluster identity and agent version.
---
Duplicate comments:
In `@cmd/main.go`:
- Around line 116-122: The deferred otelShutdown in the anonymous defer block
will be skipped on hard exits (os.Exit and log.Fatal paths) so ensure
otelShutdown(shutdownCtx) is invoked before any immediate exit: replace direct
os.Exit / log.Fatal usage or any hard-exit paths that occur near the locations
mentioned with a small exit helper (e.g., callExitWithShutdown) that calls
otelShutdown with a context timeout (same 5s pattern used in the current defer),
waits for completion (or logs error), then performs the actual os.Exit; update
all occurrences that terminate immediately (references: otelShutdown, the
anonymous defer block, and places that call os.Exit / Fatal) to use that helper
so provider flush runs on hard-exit paths.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: aa4254e2-e99b-47b8-af17-b2fe32adc73c
📒 Files selected for processing (4)
cmd/main.gopkg/containerwatcher/v2/container_watcher.gopkg/otelsetup/otelsetup_test.gopkg/otelsetup/setup.go
|
@coderabbitai review again |
|
✅ Actions performedFull review triggered. |
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
da9e60f to
2e19259
Compare
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pkg/otelsetup/lifecycle.go (1)
39-45:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid evicting another container when replacing an existing tracked one.
At capacity, eviction runs before the replace check. If
containerIDalready exists, this can evict an unrelated active lifecycle span and reduce tracked coverage unnecessarily.Suggested patch
func (t *ProfileLifecycleTracker) OnLearningStarted(containerID, namespace, pod, image string) { t.mu.Lock() defer t.mu.Unlock() - if len(t.spans) >= maxTrackedProfiles { - t.evictOldest() - } if existing, ok := t.spans[containerID]; ok { existing.AddEvent("learning.replaced") existing.End() + } else if len(t.spans) >= maxTrackedProfiles { + t.evictOldest() } spanCtx, span := Tracer().Start(context.Background(), "container.profile.learning",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/otelsetup/lifecycle.go` around lines 39 - 45, The current logic evicts oldest span before checking for an existing span, which can remove an unrelated active lifecycle span when containerID is being replaced; change the flow in the lifecycle tracking function so you first check if containerID already exists in t.spans (the existing := t.spans[containerID] path and calls to existing.AddEvent("learning.replaced") / existing.End()), perform the replacement without calling t.evictOldest, and only if the containerID is not already present enforce capacity by calling t.evictOldest when len(t.spans) >= maxTrackedProfiles; this preserves existing spans and only evicts when adding a new tracked container.pkg/containerwatcher/v2/container_watcher.go (1)
179-183:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHandle the
Int64Countercreation error instead of discarding it.In
pkg/containerwatcher/v2/container_watcher.go(lines 179-183), the code ignores theerrfromotelsetup.Meter().Int64Counter(...). OpenTelemetry-Go’s contract intends the returned instrument to remain usable/non-nil even whenerr != nil, but the error still signals an instrument registration/name issue (e.g.,ErrInstrumentName) that can lead to incorrect/conflicting metric streams—so handle/log/propagateerrinstead of dropping it.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/containerwatcher/v2/container_watcher.go` around lines 179 - 183, The Int64Counter creation call using otelsetup.Meter().Int64Counter assigned to ebpfDropCounter currently discards its returned error; capture the error (e.g., ebpfDropCounter, err := otelsetup.Meter().Int64Counter(...)) and handle it—log it via the existing logger (processLogger or similar) or return it from the containing function so instrument registration problems (ErrInstrumentName, etc.) are visible; ensure the handling occurs where ebpfDropCounter is declared so downstream code either gets a valid instrument or the startup fails/alerts appropriately.
🧹 Nitpick comments (1)
pkg/otelsetup/otelsetup_test.go (1)
116-121: ⚡ Quick winMake eviction ordering deterministic in the test (avoid
time.Sleep).Using
time.Sleepfor ordering can cause intermittent flakes. SettingstartTimesexplicitly under lock keeps this test deterministic.Suggested patch
tracker := NewProfileLifecycleTracker() tracker.OnLearningStarted("old", "ns", "pod", "") - time.Sleep(time.Millisecond) tracker.OnLearningStarted("new", "ns", "pod", "") tracker.mu.Lock() + tracker.startTimes["old"] = time.Unix(0, 1) + tracker.startTimes["new"] = time.Unix(0, 2) tracker.evictOldest() tracker.mu.Unlock()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/otelsetup/otelsetup_test.go` around lines 116 - 121, The test relies on time.Sleep to create ordering for eviction; instead make ordering deterministic by acquiring tracker.mu, setting tracker.startTimes for the relevant keys to explicit timestamps (older for the entry you expect evicted, newer for the one to keep), release the lock, then call tracker.OnLearningStarted("new","ns","pod","") and invoke tracker.mu.Lock(); tracker.evictOldest(); tracker.mu.Unlock(); this replaces the non-deterministic time.Sleep and ensures evictOldest() sees the intended startTimes ordering.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/otelsetup/lifecycle.go`:
- Around line 97-103: OnEntrySaved currently increments t.counts[containerID]
before verifying the container is tracked, which can grow counts for unknown
IDs; fix by checking the map lookup result (ctx, ok := t.ctxs[containerID])
while holding t.mu and only incrementing t.counts[containerID] (and reading
count) if ok is true, moving the increment into the guarded branch and returning
early without touching t.counts when ok is false; ensure the lock/unlock
semantics around t.mu remain correct.
---
Outside diff comments:
In `@pkg/containerwatcher/v2/container_watcher.go`:
- Around line 179-183: The Int64Counter creation call using
otelsetup.Meter().Int64Counter assigned to ebpfDropCounter currently discards
its returned error; capture the error (e.g., ebpfDropCounter, err :=
otelsetup.Meter().Int64Counter(...)) and handle it—log it via the existing
logger (processLogger or similar) or return it from the containing function so
instrument registration problems (ErrInstrumentName, etc.) are visible; ensure
the handling occurs where ebpfDropCounter is declared so downstream code either
gets a valid instrument or the startup fails/alerts appropriately.
In `@pkg/otelsetup/lifecycle.go`:
- Around line 39-45: The current logic evicts oldest span before checking for an
existing span, which can remove an unrelated active lifecycle span when
containerID is being replaced; change the flow in the lifecycle tracking
function so you first check if containerID already exists in t.spans (the
existing := t.spans[containerID] path and calls to
existing.AddEvent("learning.replaced") / existing.End()), perform the
replacement without calling t.evictOldest, and only if the containerID is not
already present enforce capacity by calling t.evictOldest when len(t.spans) >=
maxTrackedProfiles; this preserves existing spans and only evicts when adding a
new tracked container.
---
Nitpick comments:
In `@pkg/otelsetup/otelsetup_test.go`:
- Around line 116-121: The test relies on time.Sleep to create ordering for
eviction; instead make ordering deterministic by acquiring tracker.mu, setting
tracker.startTimes for the relevant keys to explicit timestamps (older for the
entry you expect evicted, newer for the one to keep), release the lock, then
call tracker.OnLearningStarted("new","ns","pod","") and invoke
tracker.mu.Lock(); tracker.evictOldest(); tracker.mu.Unlock(); this replaces the
non-deterministic time.Sleep and ensures evictOldest() sees the intended
startTimes ordering.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 226ea8a5-cbde-422e-82d6-c29aa8dcd6c0
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (7)
cmd/main.gogo.modpkg/containerprofilemanager/v1/monitoring.gopkg/containerwatcher/v2/container_watcher.gopkg/otelsetup/lifecycle.gopkg/otelsetup/otelsetup_test.gopkg/otelsetup/setup.go
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. |
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/rulemanager/rule_manager.go (1)
372-385:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSlow-eval span gate is always true with the documented default.
Line 374 uses
evaluationTime >= otelsetup.SlowEvalThreshold(). IfOTEL_SLOW_EVAL_THRESHOLD_MSdefaults to0, this records spans for every rule evaluation, not only slow ones.Suggested fix
- if evaluationTime >= otelsetup.SlowEvalThreshold() { + if threshold := otelsetup.SlowEvalThreshold(); threshold > 0 && evaluationTime >= threshold { _, span := otelsetup.Tracer().Start(rm.ctx, "rule.evaluate", trace.WithAttributes(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/rulemanager/rule_manager.go` around lines 372 - 385, The slow-eval span gate currently uses >= which with the default OTEL_SLOW_EVAL_THRESHOLD_MS == 0 will record every evaluation; change the condition in rule_manager.go from evaluationTime >= otelsetup.SlowEvalThreshold() to evaluationTime > otelsetup.SlowEvalThreshold() (ensuring both sides are the same unit, e.g., time.Duration) so only truly slower-than-threshold evaluations create the "rule.evaluate" span; if otelsetup.SlowEvalThreshold() can be zero by design, also consider ensuring its default is a positive duration instead of 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@pkg/rulemanager/rule_manager.go`:
- Around line 372-385: The slow-eval span gate currently uses >= which with the
default OTEL_SLOW_EVAL_THRESHOLD_MS == 0 will record every evaluation; change
the condition in rule_manager.go from evaluationTime >=
otelsetup.SlowEvalThreshold() to evaluationTime > otelsetup.SlowEvalThreshold()
(ensuring both sides are the same unit, e.g., time.Duration) so only truly
slower-than-threshold evaluations create the "rule.evaluate" span; if
otelsetup.SlowEvalThreshold() can be zero by design, also consider ensuring its
default is a positive duration instead of 0.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0cfa1ecd-a736-45e0-baea-57fe5d4815f4
📒 Files selected for processing (10)
pkg/containerprofilemanager/v1/monitoring.gopkg/containerwatcher/v2/container_watcher.gopkg/exporters/alert_bulk_manager.gopkg/exporters/alert_manager.gopkg/exporters/http_exporter.gopkg/malwaremanager/v1/clamav/clamav.gopkg/malwaremanager/v1/clamav/exec.gopkg/malwaremanager/v1/clamav/open.gopkg/malwaremanager/v1/malware_manager.gopkg/rulemanager/rule_manager.go
✅ Files skipped from review due to trivial changes (2)
- pkg/exporters/http_exporter.go
- pkg/exporters/alert_bulk_manager.go
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
Introduces OTEL instrumentation across node-agent without touching existing Prometheus metrics: - pkg/otelsetup: new package with InitProviders (TracerProvider + LoggerProvider + MeterProvider via OTLP gRPC), ARMO auth header injection, ring-buffer log processor (7500-entry), ProfileLifecycleTracker (one span per container learning period, cap 10k), SlowEvalThreshold, EmitAlertLogRecord with 60s/1000-entry dedup LRU, and debug HTTP listener - pkg/rulemanager: emit alert OTEL log records per fired rule; add slow-path span for evaluations exceeding SlowEvalThreshold - pkg/malwaremanager: emit alert OTEL log records for malware detections - pkg/containerwatcher: count dropped eBPF events via Int64Counter (node_agent.ebpf.events_dropped.total) with reason label - pkg/containerprofilemanager: wire ProfileLifecycleTracker lifecycle hooks (OnLearningStarted/OnEntrySaved/OnLearningEnded) - pkg/sbommanager: attach otelgrpc stats handler to gRPC dial - docs/CONFIGURATION.md: document new OTEL env vars; mark OTEL_COLLECTOR_SVC as deprecated Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
InitProviders now guards each exporter behind its own non-empty endpoint check, so a traces-only config no longer instantiates log/metric exporters against empty targets (avoiding retry loops). The ARMO no-credentials guard is extended to cover metrics-only ARMO configs. Each provider pointer is nil-checked in the combined shutdown. Debug listener is gated on logProvider being non-nil. Alert-log dedup in rule_manager replaces the racy Contains+Add pair with a mutex-protected check-and-set, closing the TOCTOU window under the ants worker pool. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- cmd/main.go: use resolved cfg/clusterData values for OTEL resource attributes instead of raw os.Getenv (NodeName, PodName, NamespaceName, ClusterName) - setup.go: detect URL-scheme endpoints and route to WithEndpointURL; host:port paths continue using WithEndpoint for all three exporters - setup.go: shutdown closure derives fresh context.Background() timeout so provider flush is not short-circuited by an already-cancelled caller context - otelsetup_test.go: restore slowEvalThresholdNs global in t.Cleanup to prevent order-dependent test failures - container_watcher.go: call Release() on dropped events in the worker_channel_full path to prevent resource leak Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
The shared resource was missing K8SClusterName and ServiceVersion attributes, so all three signals were losing cluster identity and agent version even after cmd/main.go was updated to resolve these values from clusterData and ProviderConfig. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
resource.Default() in otel v1.43.0 sets schema URL 1.40.0 internally, while resource.NewWithAttributes(semconv.SchemaURL, ...) sets 1.26.0 from our semconv/v1.26.0 import. resource.Merge rejects two non-empty conflicting schema URLs, causing InitProviders to fail at startup. Switch to resource.NewSchemaless so our custom attributes carry no schema URL; Merge then adopts the default resource's schema without conflict. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The OTLP gRPC exporters default to TLS, which fails against plaintext collectors (e.g. local SignOz on port 4317). Add WithInsecure() for all three exporters when the endpoint does not start with https://. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ing buffer severity gate - ProfileLifecycleTracker: one container.profile.learning span per container with LearningSpanID() and LearningTraceparent() for W3C propagation - OnEntrySaved: emit container.profile.cp.saved child spans with M2 throttle (count==1, %10, or hasDropped) - ContainerProfile annotations: OtelSpanIDMetadataKey + OtelTraceparentMetadataKey (k8s-interface v0.0.213) - otelsetup: thin wrapper delegating to go-logger/otelsetup (v0.0.29) which includes ring buffer with severity≥Info gate - Bump k8s-interface v0.0.212→v0.0.213 for OtelTraceparentMetadataKey Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…amAV, drop events Wires OTEL log context on the call sites the design doc marks as Tier 1 (direct customer impact) and Tier 2 (operational health): - Alert delivery failures: alert_manager.go SendRuleAlert/SendMalwareAlert, alert_bulk_manager.go bulk-send max retries / queue-full / drain timeout, http_exporter.go SendRuleAlert / SendFimAlerts / alert-limit - Rule eval failures: rule_manager.go ReportEnrichedEvent / EvaluatePolicyRules / getUniqueIdAndMessage — use rm.ctx for trace correlation - ClamAV health: clamav.go ping failure, exec.go/open.go scan failures - Drop events: container_watcher.go worker-channel-full lines (Tier 2) - Profile save failures: containerprofilemanager monitoring.go — use cpm.ctx Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… to exec malware path - rule_manager.go: set span.SetStatus(codes.Error) when CEL evaluation returns an error so failed evaluations are distinguishable from successful non-alerts in traces - malware_manager.go: mirror reportFileOpen OTEL emission in reportFileExec — adds metrics.ReportRuleAlert + EmitAlertLogRecord to the exec malware detection path so exec-path detections appear in telemetry alongside open-path ones Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4525e89 to
b41f213
Compare
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
- Add pkg/metricsmanager/otel/ with full MetricsManager implementation backed by OTEL SDK; attribute-set caching on all hot paths eliminates per-call allocations (2× faster, 10× less memory vs Prometheus on the histogram path per A/B benchmark) - Wire OTEL metrics in cmd/main.go; drop prometheus package import - Add Prometheus scrape mode: OTEL_METRICS_EXPORTER=prometheus installs an OTEL→Prometheus bridge and starts :8080/metrics listener - Standardise rule.ID at all metric call sites (was rule.Name); malware alerts use constant "malware" to bound cardinality - Add docs/metrics-migration.md mapping every old Prometheus name to its new OTEL name (breaking rename — dashboards must be updated) - Add A/B benchmarks in otel/ and prometheus/ packages; hard gate passes: OTEL allocs/op ≤ Prometheus allocs/op, ns/op ≤ 1.1× Prometheus Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Starts go.opentelemetry.io/contrib/instrumentation/runtime in the sidecar so heap_alloc, GC pause, and goroutine counts flow to SigNoz as OTEL metrics sampled every 30s. Records heap.alloc.before_mb / after_mb / delta_mb as attributes on every sbom.scan span so OOM-prone images are directly identifiable in trace views. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vability
Adds node_agent.alert.suppressed.total{rule_id, reason} counter so
operators can debug "alert didn't fire" cases from SigNoz alone.
Suppression reasons instrumented:
- no_rules_for_pod: no rule bindings match the pod
- profile_incomplete: rule requires profile but none exists yet
- policy: rule policy check suppressed the alert
- eval_error: CEL evaluation returned an error
- cooldown: alert deduplicated within the cooldown window
- rate_limit: HTTP exporter rate limit reached (rule + malware alerts)
MetricsManager is now threaded into HTTPExporter (variadic, defaults to
noop for backward-compatible test callers) and InitExporters gains a
required MetricsManager parameter.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. Event Counters
|
Health checks fire every 5s (scannerReadinessCheckInterval), producing 720 spans/hr — ~125 KB/hr compressed, which blows the entire per-agent span budget (target: ~1 KB/hr for control-plane spans with mitigations). Add otelgrpc.WithFilter on both client and server handlers to skip the Health RPC. CreateSBOM spans (low frequency, ~5–20/hr) are retained. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Avoids ~2-3 KB/hr of metric volume for deployments without a telemetry endpoint configured. Consistent with how the main agent guards OTEL init. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
ReportAlertSuppressed is called on the hot path (no_rules_for_pod fires per event, profile_incomplete fires per-rule per event during learning). metric.WithAttributes allocates a []attribute.KeyValue slice on every call; replace with the same sync.Map option cache used by all other hot-path methods to eliminate per-call heap allocation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
pkg/malwaremanager/v1/malware_manager.go (1)
173-191: ⚡ Quick winExtract shared malware alert telemetry emission into one helper.
The span +
EmitAlertLogRecordblock is duplicated in two paths; consolidating it will prevent drift in attributes/fields over time.♻️ Suggested refactor
func (mm *MalwareManager) reportFileExec(event utils.ExecEvent) { @@ - alertCtx, alertSpan := otelsetup.Tracer().Start(context.Background(), "malware.alert", - trace.WithAttributes( - attribute.String("container.id", containerID), - attribute.String("k8s.namespace.name", result.GetRuntimeAlertK8sDetails().Namespace), - attribute.String("k8s.pod.name", result.GetRuntimeAlertK8sDetails().PodName), - attribute.String("malware.signature", result.GetBasicRuntimeAlert().AlertName), - )) - otelsetup.EmitAlertLogRecord(alertCtx, otelsetup.AlertLogAttrs{ - RuleID: "malware", - AlertType: result.GetBasicRuntimeAlert().AlertName, - ContainerID: containerID, - ContainerName: result.GetRuntimeAlertK8sDetails().ContainerName, - Namespace: result.GetRuntimeAlertK8sDetails().Namespace, - PodName: result.GetRuntimeAlertK8sDetails().PodName, - Image: result.GetRuntimeAlertK8sDetails().Image, - EventType: "malware", - MalwareSignature: result.GetBasicRuntimeAlert().AlertName, - }) - alertSpan.End() + mm.emitMalwareTelemetry(containerID, result) @@ func (mm *MalwareManager) reportFileOpen(event utils.OpenEvent) { @@ - alertCtx, alertSpan := otelsetup.Tracer().Start(context.Background(), "malware.alert", - trace.WithAttributes( - attribute.String("container.id", containerID), - attribute.String("k8s.namespace.name", result.GetRuntimeAlertK8sDetails().Namespace), - attribute.String("k8s.pod.name", result.GetRuntimeAlertK8sDetails().PodName), - attribute.String("malware.signature", result.GetBasicRuntimeAlert().AlertName), - )) - otelsetup.EmitAlertLogRecord(alertCtx, otelsetup.AlertLogAttrs{ - RuleID: "malware", - AlertType: result.GetBasicRuntimeAlert().AlertName, - ContainerID: containerID, - ContainerName: result.GetRuntimeAlertK8sDetails().ContainerName, - Namespace: result.GetRuntimeAlertK8sDetails().Namespace, - PodName: result.GetRuntimeAlertK8sDetails().PodName, - Image: result.GetRuntimeAlertK8sDetails().Image, - EventType: "malware", - MalwareSignature: result.GetBasicRuntimeAlert().AlertName, - }) - alertSpan.End() + mm.emitMalwareTelemetry(containerID, result) } } } + +func (mm *MalwareManager) emitMalwareTelemetry(containerID string, result malwaremanager.MalwareResult) { + alertCtx, alertSpan := otelsetup.Tracer().Start(context.Background(), "malware.alert", + trace.WithAttributes( + attribute.String("container.id", containerID), + attribute.String("k8s.namespace.name", result.GetRuntimeAlertK8sDetails().Namespace), + attribute.String("k8s.pod.name", result.GetRuntimeAlertK8sDetails().PodName), + attribute.String("malware.signature", result.GetBasicRuntimeAlert().AlertName), + )) + defer alertSpan.End() + + otelsetup.EmitAlertLogRecord(alertCtx, otelsetup.AlertLogAttrs{ + RuleID: "malware", + AlertType: result.GetBasicRuntimeAlert().AlertName, + ContainerID: containerID, + ContainerName: result.GetRuntimeAlertK8sDetails().ContainerName, + Namespace: result.GetRuntimeAlertK8sDetails().Namespace, + PodName: result.GetRuntimeAlertK8sDetails().PodName, + Image: result.GetRuntimeAlertK8sDetails().Image, + EventType: "malware", + MalwareSignature: result.GetBasicRuntimeAlert().AlertName, + }) +}Also applies to: 237-255
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/malwaremanager/v1/malware_manager.go` around lines 173 - 191, Duplicate telemetry emission (span + otelsetup.EmitAlertLogRecord) for malware alerts should be extracted into a single helper to avoid drift; create a function (e.g., emitMalwareAlertTelemetry or EmitMalwareAlert) that accepts context, containerID and the alert result (or the derived values from result.GetBasicRuntimeAlert() and result.GetRuntimeAlertK8sDetails()), starts the otel tracer span named "malware.alert" with the same attributes, calls otelsetup.EmitAlertLogRecord with the same AlertLogAttrs (RuleID "malware", AlertType/MalwareSignature from result.GetBasicRuntimeAlert().AlertName, ContainerName/Namespace/PodName/Image from result.GetRuntimeAlertK8sDetails(), EventType "malware"), ends the span, and replace the duplicated blocks (the block using alertCtx/alertSpan and otelsetup.EmitAlertLogRecord at lines ~173-191 and ~237-255) with calls to that helper.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/sbom-scanner/main.go`:
- Around line 49-55: The current gating uses only OTEL_EXPORTER_OTLP_ENDPOINT
before calling goruntime.Start, which misses other valid metrics configurations;
update the condition around goruntime.Start (the block that calls
goruntime.Start with goruntime.WithMinimumReadMemStatsInterval) to also permit
startup when OTEL_EXPORTER_OTLP_METRICS_ENDPOINT is set or when
OTEL_METRICS_EXPORTER equals "prometheus" (i.e., check
os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT") != "" ||
os.Getenv("OTEL_EXPORTER_OTLP_METRICS_ENDPOINT") != "" ||
strings.EqualFold(os.Getenv("OTEL_METRICS_EXPORTER"), "prometheus")), keeping
the existing error handling that logs via logger.L().Warning if goruntime.Start
returns an error.
In `@pkg/metricsmanager/prometheus/prometheus.go`:
- Around line 361-383: Destroy() currently unregisters legacy metrics but misses
the metrics created with promauto: p.sbomScanCounter, p.alertSuppressedCounter,
p.sbomScanDuration, p.sbomRestarts, and p.sbomReady; update the
PrometheusMetric.Destroy() implementation to call prometheus.Unregister(...) for
each of those identifiers (ensuring you pass the same metric variables:
sbomScanCounter, alertSuppressedCounter, sbomScanDuration, sbomRestarts,
sbomReady) so they are removed from the default registry during teardown and
avoid stale/default-registry leftovers.
In `@pkg/otelsetup/setup.go`:
- Around line 144-147: The returned shutdown closure currently ignores the error
from srv.Shutdown(ctx); update the closure returned by the function so it
captures the error from srv.Shutdown(ctx) and does not discard it: call
srv.Shutdown(ctx), store its error (e.g., srvErr), then call mp.Shutdown(ctx)
and return an appropriate combined result (for example return mpErr if mpErr !=
nil else srvErr, or wrap both errors) instead of always returning only
mp.Shutdown's error; reference the closure that currently calls
srv.Shutdown(ctx) and mp.Shutdown(ctx) and modify its return behavior
accordingly.
In `@pkg/sbommanager/v1/sbom_manager.go`:
- Around line 100-101: Create a nil-check for the incoming metrics parameter in
CreateSbomManager and any other SbomManager constructors/initializers that store
it (e.g., where metrics is assigned inside the function), and if metrics == nil
replace it with a no-op implementation that satisfies
metricsmanager.MetricsManager (or return an explicit error if you prefer);
ensure the code assigns this safe non-nil instance before storing it on the
SbomManager struct so downstream dereferences cannot panic during scan
processing.
---
Nitpick comments:
In `@pkg/malwaremanager/v1/malware_manager.go`:
- Around line 173-191: Duplicate telemetry emission (span +
otelsetup.EmitAlertLogRecord) for malware alerts should be extracted into a
single helper to avoid drift; create a function (e.g., emitMalwareAlertTelemetry
or EmitMalwareAlert) that accepts context, containerID and the alert result (or
the derived values from result.GetBasicRuntimeAlert() and
result.GetRuntimeAlertK8sDetails()), starts the otel tracer span named
"malware.alert" with the same attributes, calls otelsetup.EmitAlertLogRecord
with the same AlertLogAttrs (RuleID "malware", AlertType/MalwareSignature from
result.GetBasicRuntimeAlert().AlertName, ContainerName/Namespace/PodName/Image
from result.GetRuntimeAlertK8sDetails(), EventType "malware"), ends the span,
and replace the duplicated blocks (the block using alertCtx/alertSpan and
otelsetup.EmitAlertLogRecord at lines ~173-191 and ~237-255) with calls to that
helper.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8738e41f-2174-4fbb-ad9e-b3fc16ead7c0
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (35)
cmd/main.gocmd/sbom-scanner/main.godocs/CONFIGURATION.mddocs/metrics-migration.mdgo.modpkg/config/config.gopkg/config/config_test.gopkg/containerprofilemanager/v1/containerprofile_manager.gopkg/containerprofilemanager/v1/monitoring.gopkg/containerwatcher/v2/container_watcher.gopkg/containerwatcher/v2/event_handler_factory.gopkg/containerwatcher/v2/tracers/top.gopkg/exporters/alert_bulk_manager.gopkg/exporters/alert_manager.gopkg/exporters/exporters_bus.gopkg/exporters/http_exporter.gopkg/malwaremanager/v1/clamav/clamav.gopkg/malwaremanager/v1/clamav/exec.gopkg/malwaremanager/v1/clamav/open.gopkg/malwaremanager/v1/malware_manager.gopkg/metricsmanager/metrics_manager_interface.gopkg/metricsmanager/metrics_manager_mock.gopkg/metricsmanager/metrics_manager_noop.gopkg/metricsmanager/otel/bench_test.gopkg/metricsmanager/otel/otel_metrics_manager.gopkg/metricsmanager/prometheus/bench_test.gopkg/metricsmanager/prometheus/prometheus.gopkg/otelsetup/lifecycle.gopkg/otelsetup/otelsetup_test.gopkg/otelsetup/setup.gopkg/rulemanager/rule_manager.gopkg/sbommanager/v1/metrics.gopkg/sbommanager/v1/sbom_manager.gopkg/sbomscanner/v1/client.gopkg/sbomscanner/v1/server.go
💤 Files with no reviewable changes (1)
- pkg/sbommanager/v1/metrics.go
✅ Files skipped from review due to trivial changes (5)
- pkg/malwaremanager/v1/clamav/open.go
- pkg/config/config_test.go
- docs/CONFIGURATION.md
- docs/metrics-migration.md
- pkg/exporters/alert_bulk_manager.go
…etheus port conflict Issue 1: sidecar OTEL was a no-op in ARMO deployments because credentials come from /etc/credentials (not env vars). Load from file first, fall back to ACCOUNT_ID/ACCESS_KEY env vars for non-ARMO deployments. Issue 2: InitProviders binds :8080 in Prometheus mode; main agent already owns that port, causing initPrometheusMeterProvider to return an error and tear down all base providers (traces + logs lost too). Changed to a soft failure: log a warning and continue with traces/logs when the Prometheus listener can't bind. Sidecar retains full OTLP telemetry in all modes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. |
Picks up buildAuthHeaders refactor and AC8/AC9 unit tests for ARMO auth header injection (X-API-Key, X-Customer-GUID) across trace/log/metric exporter option builders. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. |
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. Event Counters
|
otel.SetMeterProvider(mp) was called before net.Listen(":8080"), so a
port-conflict error (sidecar when main agent owns :8080) left a leaked
Prometheus-backed provider as the global, silently blackholing sidecar
metrics. Reorder to attempt net.Listen first; only install the provider
after the listener succeeds.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…reateSbomManager Prevents a runtime panic if CreateSbomManager is ever called with a nil MetricsManager (e.g. in tests). Substitutes the no-op implementation rather than returning an error, consistent with how other managers in this codebase treat optional metric dependencies. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e values HeapAlloc is a live snapshot that can decrease when GC runs mid-scan, making heap.alloc.delta_mb negative or understated. Switch to TotalAlloc (monotonically increasing cumulative bytes) so the delta always reflects actual allocations made during the scan. Rename span attributes from heap.alloc.* to alloc.total.* to avoid implying a live-heap measurement. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. Event Counters
|
Picks up credential-presence auth header gate — drops isARMOEndpoint() hostname matching in favour of accessKey != "" check, consistent with the SBOM scan-failure reporter and HTTP exporter. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tial-presence model Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…LE_DEBUG_LISTENER Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. |
Picks up debug listener activation from KS_LOGGER_LEVEL=debug directly in the library — no ENABLE_DEBUG_LISTENER env var or DebugListener config field needed in callers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. Event Counters
|
- Add KS_LOGGER_LEVEL and KS_LOGGER_NAME to CONFIGURATION.md env vars table - Gate runtime metrics on all metric-exporter env vars (OTEL_METRICS_EXPORTER, OTEL_EXPORTER_OTLP_METRICS_ENDPOINT) not just the base endpoint - Unregister sbom/alert counters in PrometheusMetric.Destroy() - Propagate prometheus HTTP server shutdown error via errors.Join Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. Event Counters
|
Summary
Phase 1 — Traces, logs, drop counters
Phase 2 — Replace Prometheus metrics with OTEL SDK
New env vars
`OTEL_COLLECTOR_SVC` is now deprecated (superseded by `OTEL_EXPORTER_OTLP_ENDPOINT`).
Breaking change
Metric names changed. See `docs/metrics-migration.md` for the full mapping and dashboard update checklist.
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Tests