Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
dba2830
feat(otel): add OpenTelemetry traces, logs, and drop counters (Phase 1)
matthyx May 19, 2026
572df5f
fix(otel): per-signal exporter gating and atomic dedup
matthyx May 20, 2026
3bbf498
fix(otel): address CodeRabbit review comments
matthyx May 20, 2026
9d7c130
fix(otel): add ClusterName and ServiceVersion to OTEL resource
matthyx May 20, 2026
6cd5ae9
fix(otel): resolve schema URL conflict in resource.Merge
matthyx May 21, 2026
9081ebb
fix(otel): use insecure gRPC transport for non-HTTPS endpoints
matthyx May 21, 2026
6aed5fb
feat(otel): Phase 1 — lifecycle spans, CP traceparent, M2 throttle, r…
matthyx May 21, 2026
8e3cf41
feat(otel): Step 5 — Tier 1/2 .Ctx() on alert delivery, rule eval, Cl…
matthyx May 21, 2026
b41f213
fix(otel): mark slow-eval spans as error on CEL failure; add OTEL log…
matthyx May 21, 2026
5b936af
feat(metrics): Phase 2 — replace Prometheus impl with OTEL SDK
matthyx May 21, 2026
c2be3d3
fix(otel): flush on shutdown, mark profiles dropped, gate OTEL log on…
matthyx May 21, 2026
8d9ea6f
fix(otel): address CodeRabbit review comments
matthyx May 21, 2026
e1fc1e6
fix(metrics): always enable OTEL metrics manager per design spec
matthyx May 22, 2026
5d18c56
feat(otel): correlate alert logs to traces via rule.alert / malware.a…
matthyx May 22, 2026
95df910
feat(metrics): thread span ctx through ReportRuleEvaluationTime for O…
matthyx May 22, 2026
dedf290
feat(sbom): migrate SBOM metrics to MetricsManager and add gRPC tracing
matthyx May 22, 2026
579ef55
feat(sbom-scanner): instrument sidecar with OTEL traces
matthyx May 22, 2026
6bd5b6a
feat(sbom-scanner): add Go runtime metrics and per-scan heap delta
matthyx May 22, 2026
14f55cb
feat: add alert.suppressed.total counter for suppression funnel obser…
matthyx May 22, 2026
7ec156a
fix: exclude Health RPC from otelgrpc tracing to stay within span budget
matthyx May 22, 2026
c477e36
fix: gate sidecar go runtime metrics on OTEL_EXPORTER_OTLP_ENDPOINT
matthyx May 22, 2026
2fb4a4a
perf: cache attribute sets for alert.suppressed counter
matthyx May 22, 2026
861db0c
fix(sbom-scanner): credentials from /etc/credentials + soft-fail Prom…
matthyx May 22, 2026
cd00ceb
chore: bump go-logger to v0.0.30
matthyx May 22, 2026
776394e
fix(otelsetup): bind Prometheus port before setting global MeterProvider
matthyx May 22, 2026
7beaf37
fix(sbommanager): substitute MetricsNoop when nil metrics passed to C…
matthyx May 22, 2026
7ef0455
fix(sbomscanner): use TotalAlloc for scan heap delta to avoid negativ…
matthyx May 22, 2026
52e2500
chore: bump go-logger to v0.0.31
matthyx May 22, 2026
47dcb06
docs: remove ARMO_OTEL_AUTH, update auth header description to creden…
matthyx May 22, 2026
ec3002a
docs: debug listener activates via KS_LOGGER_LEVEL=debug, remove ENAB…
matthyx May 22, 2026
dda23a9
chore: bump go-logger to v0.0.32
matthyx May 22, 2026
fea6131
fix: address remaining PR review comments
matthyx May 22, 2026
e6167d6
feat(metrics): add resource gauges and wire goruntime.Start in main a…
matthyx May 27, 2026
7cacef7
feat(otel): close span↔log correlation gaps via shared context plumbing
matthyx May 27, 2026
be6435a
test(otel): prove span↔log correlation wiring via logtest.Recorder
matthyx May 27, 2026
30d86eb
fix(nodeprofile): guard HTTP client timeout default against unset config
matthyx May 28, 2026
05aec48
fix(metrics): resolve own cgroup for memory.current, add limit gauge
matthyx May 28, 2026
854069e
fix(metrics): resolve own cgroup via authoritative container ID
matthyx May 28, 2026
2e2ff3b
feat(metrics): emit process memory gauges from the sbom-scanner too
matthyx May 28, 2026
12aff41
fix(otel): only overwrite env credentials with non-empty values from …
matthyx May 28, 2026
9e56c73
refactor(sbom): extract sbom-scanner main logic into reusable RunServ…
matthyx May 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 80 additions & 33 deletions cmd/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ import (
"context"
"net/http"
_ "net/http/pprof"
"net/url"
"os"
"os/signal"
"strings"
"syscall"
"time"

"github.com/armosec/armoapi-go/armotypes"
utilsmetadata "github.com/armosec/utils-k8s-go/armometadata"
Expand Down Expand Up @@ -38,8 +38,7 @@ import (
"github.com/kubescape/node-agent/pkg/hostsensormanager"
"github.com/kubescape/node-agent/pkg/malwaremanager"
malwaremanagerv1 "github.com/kubescape/node-agent/pkg/malwaremanager/v1"
"github.com/kubescape/node-agent/pkg/metricsmanager"
metricprometheus "github.com/kubescape/node-agent/pkg/metricsmanager/prometheus"
otelmetrics "github.com/kubescape/node-agent/pkg/metricsmanager/otel"
"github.com/kubescape/node-agent/pkg/networkstream"
networkstreamv1 "github.com/kubescape/node-agent/pkg/networkstream/v1"
"github.com/kubescape/node-agent/pkg/nodeprofilemanager"
Expand All @@ -49,6 +48,7 @@ import (
"github.com/kubescape/node-agent/pkg/objectcache/dnscache"
"github.com/kubescape/node-agent/pkg/objectcache/k8scache"
objectcachev1 "github.com/kubescape/node-agent/pkg/objectcache/v1"
"github.com/kubescape/node-agent/pkg/otelsetup"
"github.com/kubescape/node-agent/pkg/processtree"
containerprocesstree "github.com/kubescape/node-agent/pkg/processtree/container"
processtreecreator "github.com/kubescape/node-agent/pkg/processtree/creator"
Expand All @@ -70,6 +70,8 @@ import (
"github.com/kubescape/node-agent/pkg/validator"
"github.com/kubescape/node-agent/pkg/watcher/dynamicwatcher"
"github.com/kubescape/node-agent/pkg/watcher/seccompprofilewatcher"
goruntime "go.opentelemetry.io/contrib/instrumentation/runtime"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

func main() {
Expand Down Expand Up @@ -99,14 +101,35 @@ func main() {
logger.L().Info("credentials loaded", helpers.Int("accountLength", len(credentials.Account)))
}

// to enable otel, set OTEL_COLLECTOR_SVC=otel-collector:4317
if otelHost, present := os.LookupEnv("OTEL_COLLECTOR_SVC"); present {
ctx = logger.InitOtel("node-agent",
os.Getenv("RELEASE"),
clusterData.AccountID,
clusterData.ClusterName,
url.URL{Host: otelHost})
defer logger.ShutdownOtel(ctx)
otelShutdown, err := otelsetup.InitProviders(ctx, otelsetup.ProviderConfig{
ServiceName: "node-agent",
ServiceVersion: os.Getenv("RELEASE"),
NodeName: cfg.NodeName,
PodName: cfg.PodName,
Namespace: cfg.NamespaceName,
ClusterName: clusterData.ClusterName,
AccountID: clusterData.AccountID,
AccessKey: accessKey,
})
Comment thread
coderabbitai[bot] marked this conversation as resolved.
if err != nil {
logger.L().Warning("OTEL init failed, running without telemetry", helpers.Error(err))
}
defer func() {
shutdownCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
if otelShutdown != nil {
_ = otelShutdown(shutdownCtx)
}
}()
Comment thread
coderabbitai[bot] marked this conversation as resolved.

// Emit Go runtime metrics only when metrics collection is configured;
// avoids ~2–3 KB/hr of metric volume for deployments without telemetry.
if os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT") != "" ||
os.Getenv("OTEL_METRICS_EXPORTER") != "" ||
os.Getenv("OTEL_EXPORTER_OTLP_METRICS_ENDPOINT") != "" {
if err := goruntime.Start(goruntime.WithMinimumReadMemStatsInterval(30 * time.Second)); err != nil {
logger.L().Warning("node-agent: Go runtime metrics unavailable", helpers.Error(err))
}
}

// Check if we need to validate the kernel version.
Expand Down Expand Up @@ -174,13 +197,11 @@ func main() {
logger.L().Ctx(ctx).Fatal("error creating the storage client", helpers.Error(err))
}

// Create Prometheus metrics exporter
var prometheusExporter metricsmanager.MetricsManager
if cfg.EnablePrometheusExporter {
prometheusExporter = metricprometheus.NewPrometheusMetric()
} else {
prometheusExporter = metricsmanager.NewMetricsNoop()
}
// Create metrics provider (OTEL SDK; Prometheus scrape endpoint started by otelsetup
// when OTEL_METRICS_EXPORTER=prometheus; OTLP push when OTEL_EXPORTER_OTLP_ENDPOINT set).
// MUST be constructed after otelsetup.InitProviders() so the global MeterProvider is set.
// Always use the OTEL impl — the SDK's own no-op providers handle the "no endpoint" case.
metricsProvider := otelmetrics.NewOTELMetricsManager(resolveOwnContainerID(ctx, k8sClient))

// Create watchers
dWatcher := dynamicwatcher.NewWatchHandler(k8sClient, storageClient.GetStorageClient(), cfg.SkipNamespace)
Expand Down Expand Up @@ -290,13 +311,13 @@ func main() {

if cfg.EnableRuntimeDetection {
// create exporter
exporter := exporters.InitExporters(cfg.Exporters, clusterData.ClusterName, cfg.NodeName, cloudMetadata, clusterUID, armotypes.AlertSourcePlatformK8sAgent)
exporter := exporters.InitExporters(cfg.Exporters, clusterData.ClusterName, cfg.NodeName, cloudMetadata, clusterUID, armotypes.AlertSourcePlatformK8sAgent, metricsProvider)
dWatcher.AddAdaptor(ruleBindingCache)

ruleBindingNotify = make(chan rulebinding.RuleBindingNotify, 100)
ruleBindingCache.AddNotifier(&ruleBindingNotify)

cpc := containerprofilecache.NewContainerProfileCache(cfg, storageClient, k8sObjectCache, prometheusExporter)
cpc := containerprofilecache.NewContainerProfileCache(cfg, storageClient, k8sObjectCache, metricsProvider)
cpc.Start(ctx)
if cpm, ok := containerProfileManager.(*containerprofilemanagerv1.ContainerProfileManager); ok {
cpm.SetCompletionNotifier(cpc)
Expand All @@ -312,7 +333,7 @@ func main() {

adapterFactory := ruleadapters.NewEventRuleAdapterFactory()

celEvaluator, err := cel.NewCEL(objCache, cfg, prometheusExporter)
celEvaluator, err := cel.NewCEL(objCache, cfg, metricsProvider)
if err != nil {
logger.L().Ctx(ctx).Fatal("error creating CEL evaluator", helpers.Error(err))
}
Expand All @@ -321,7 +342,7 @@ func main() {

// create runtimeDetection managers
agentVersion := os.Getenv("AGENT_VERSION")
ruleManager, err = rulemanager.CreateRuleManager(ctx, cfg, k8sClient, ruleBindingCache, objCache, exporter, prometheusExporter, processTreeManager, dnsResolver, nil, ruleCooldown, adapterFactory, celEvaluator, mntnsRegistry, agentVersion)
ruleManager, err = rulemanager.CreateRuleManager(ctx, cfg, k8sClient, ruleBindingCache, objCache, exporter, metricsProvider, processTreeManager, dnsResolver, nil, ruleCooldown, adapterFactory, celEvaluator, mntnsRegistry, agentVersion)
if err != nil {
logger.L().Ctx(ctx).Fatal("error creating RuleManager", helpers.Error(err))
}
Expand Down Expand Up @@ -357,8 +378,8 @@ func main() {
var malwareManager malwaremanager.MalwareManagerClient
if cfg.EnableMalwareDetection {
// create exporter
exporter := exporters.InitExporters(cfg.Exporters, clusterData.ClusterName, cfg.NodeName, cloudMetadata, clusterUID, armotypes.AlertSourcePlatformK8sAgent)
malwareManager, err = malwaremanagerv1.CreateMalwareManager(cfg, k8sClient, cfg.NodeName, clusterData.ClusterName, exporter, prometheusExporter, k8sObjectCache)
exporter := exporters.InitExporters(cfg.Exporters, clusterData.ClusterName, cfg.NodeName, cloudMetadata, clusterUID, armotypes.AlertSourcePlatformK8sAgent, metricsProvider)
malwareManager, err = malwaremanagerv1.CreateMalwareManager(cfg, k8sClient, cfg.NodeName, clusterData.ClusterName, exporter, metricsProvider, k8sObjectCache)
if err != nil {
logger.L().Ctx(ctx).Fatal("error creating MalwareManager", helpers.Error(err))
}
Expand Down Expand Up @@ -406,7 +427,7 @@ func main() {
// Create the SBOM manager
var sbomManager sbommanager.SbomManagerClient
if cfg.EnableSbomGeneration {
sbomManager, err = sbommanagerv1.CreateSbomManager(ctx, cfg, igK8sClient.RuntimeConfig.SocketPath, storageClient, k8sObjectCache, scannerClient, failureReporter)
sbomManager, err = sbommanagerv1.CreateSbomManager(ctx, cfg, igK8sClient.RuntimeConfig.SocketPath, storageClient, k8sObjectCache, scannerClient, failureReporter, metricsProvider)
if err != nil {
logger.L().Ctx(ctx).Fatal("error creating SbomManager", helpers.Error(err))
}
Expand All @@ -419,7 +440,7 @@ func main() {
if cfg.EnableFIM {
// Initialize FIM-specific exporters
fimExportersConfig := cfg.FIM.GetFIMExportersConfig()
fimExporter := exporters.InitExporters(fimExportersConfig, clusterData.ClusterName, cfg.NodeName, cloudMetadata, clusterUID, armotypes.AlertSourcePlatformK8sAgent)
fimExporter := exporters.InitExporters(fimExportersConfig, clusterData.ClusterName, cfg.NodeName, cloudMetadata, clusterUID, armotypes.AlertSourcePlatformK8sAgent, metricsProvider)

fimManager, err = fimmanager.NewFIMManager(cfg, clusterData.ClusterName, fimExporter, cloudMetadata)
if err != nil {
Expand All @@ -434,7 +455,7 @@ func main() {

// Create the container handler
mainHandler, err := containerwatcherv2.CreateIGContainerWatcher(cfg, containerProfileManager, k8sClient,
igK8sClient, dnsManagerClient, prometheusExporter, ruleManager,
igK8sClient, dnsManagerClient, metricsProvider, ruleManager,
malwareManager, sbomManager, &ruleBindingNotify, igK8sClient.RuntimeConfig, nil,
processTreeManager, clusterData.ClusterName, objCache, networkStreamClient, containerProcessTree, thirdPartyTracers)
if err != nil {
Expand All @@ -448,8 +469,8 @@ func main() {
// Start the networkStream
networkStreamClient.Start()

// Start the prometheusExporter
prometheusExporter.Start()
// Start the metrics provider
metricsProvider.Start()

// Start the host sensor manager
if err = hostSensorManager.Start(ctx); err != nil {
Expand Down Expand Up @@ -503,16 +524,42 @@ func main() {
signal.Notify(shutdown, os.Interrupt, syscall.SIGTERM)
sig := <-shutdown

// Exit with success
switch sig {
case os.Interrupt:
logger.L().Info("Received interrupt signal")
os.Exit(utils.ExitCodeSuccess)
case syscall.SIGTERM:
logger.L().Info("Received SIGTERM signal")
os.Exit(utils.ExitCodeSuccess)
default:
logger.L().Info("Received unknown signal")
os.Exit(utils.ExitCodeError)
}
// Return normally so deferred OTEL shutdown flushes traces/metrics/logs.
}

// resolveOwnContainerID returns node-agent's own container ID via the k8s API
// (using the Downward-API POD_NAME / NAMESPACE_NAME env), or "" if it cannot be
// determined. The cgroup memory gauges use it to locate the correct container
// scope in the host-mounted cgroup tree; "" makes them fall back to /proc-based
// resolution. Best-effort: a failure here only degrades those gauges.
func resolveOwnContainerID(ctx context.Context, k8sClient *k8sinterface.KubernetesApi) string {
const containerName = "node-agent"
podName, namespace := os.Getenv("POD_NAME"), os.Getenv("NAMESPACE_NAME")
if podName == "" || namespace == "" {
return ""
}
pod, err := k8sClient.GetKubernetesClient().CoreV1().Pods(namespace).Get(ctx, podName, metav1.GetOptions{})
if err != nil {
logger.L().Warning("resolveOwnContainerID - failed to get own pod, cgroup memory gauges will fall back",
helpers.Error(err), helpers.String("pod", podName), helpers.String("namespace", namespace))
return ""
}
for _, cs := range pod.Status.ContainerStatuses {
if cs.Name == containerName {
// ContainerID is "<runtime>://<id>", e.g. "containerd://9103ee5f..."
if i := strings.LastIndex(cs.ContainerID, "/"); i >= 0 {
return cs.ContainerID[i+1:]
}
return cs.ContainerID
}
}
return ""
}
52 changes: 17 additions & 35 deletions cmd/sbom-scanner/main.go
Original file line number Diff line number Diff line change
@@ -1,48 +1,30 @@
package main

import (
"net"
"context"
"os"
"os/signal"
"syscall"

"github.com/kubescape/go-logger"
"github.com/kubescape/go-logger/helpers"
beUtils "github.com/kubescape/backend/pkg/utils"
sbomscanner "github.com/kubescape/node-agent/pkg/sbomscanner/v1"
pb "github.com/kubescape/node-agent/pkg/sbomscanner/v1/proto"
"google.golang.org/grpc"
_ "modernc.org/sqlite"
)

func main() {
socketPath := os.Getenv("SOCKET_PATH")
if socketPath == "" {
socketPath = "/sbom-comm/scanner.sock"
ctx := context.Background()

// Load ARMO credentials from /etc/credentials (same source as the main agent).
// Fall back to env vars so the binary stays functional in non-ARMO deployments.
accountID := os.Getenv("ACCOUNT_ID")
accessKey := os.Getenv("ACCESS_KEY")
if creds, err := beUtils.LoadCredentialsFromFile("/etc/credentials"); err == nil {
if creds.Account != "" {
accountID = creds.Account
}
if creds.AccessKey != "" {
accessKey = creds.AccessKey
}
}

// Remove stale socket file from a previous run
os.Remove(socketPath)

lis, err := net.Listen("unix", socketPath)
if err != nil {
logger.L().Fatal("failed to listen on socket", helpers.Error(err), helpers.String("path", socketPath))
}

srv := grpc.NewServer()
pb.RegisterSBOMScannerServer(srv, sbomscanner.NewScannerServer())

sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)

go func() {
sig := <-sigCh
logger.L().Info("received signal, shutting down", helpers.String("signal", sig.String()))
srv.GracefulStop()
os.Remove(socketPath)
}()

logger.L().Info("SBOM scanner sidecar started", helpers.String("socket", socketPath))
if err := srv.Serve(lis); err != nil {
logger.L().Fatal("gRPC server failed", helpers.Error(err))
}
// Run the reusable SBOM scanner server
sbomscanner.RunServer(ctx, accountID, accessKey)
}
36 changes: 35 additions & 1 deletion docs/CONFIGURATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,15 +86,49 @@ These environment variables are read directly (not through config file):
| `CONFIG_DIR` | Configuration directory path | No (default: `/etc/config`) |
| `SKIP_KERNEL_VERSION_CHECK` | Skip kernel validation | No |
| `ENABLE_PROFILER` | Enable pprof on port 6060 | No |
| `OTEL_COLLECTOR_SVC` | OpenTelemetry collector (e.g., `otel-collector:4317`) | No |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | OTLP collector endpoint (e.g., `otel.armosec.io:4317`). When unset, all telemetry is silently discarded (no-op). | No |
| `OTEL_METRICS_EXPORTER` | Metrics exporter: `otlp` (push to collector) or `prometheus` (expose `:8080/metrics`). Defaults to `otlp` when endpoint is set, `none` otherwise. | No |
| `OTEL_TRACES_EXPORTER` | Traces exporter: `otlp` or `none`. Defaults to `otlp` when endpoint is set, `none` otherwise. | No |
| `OTEL_SLOW_EVAL_THRESHOLD_MS` | Rule evaluations exceeding this threshold (ms) emit a trace span. Default: `5`. | No |
| `OTEL_DEBUG_PORT` | Port for the debug listener when `KS_LOGGER_LEVEL=debug`. Default: `6062`. | No |
| `OTEL_COLLECTOR_SVC` | **Deprecated** — alias for `OTEL_EXPORTER_OTLP_ENDPOINT`. Will be removed in a future release. | No |
| `PYROSCOPE_SERVER_SVC` | Pyroscope server address | No |
| `APPLICATION_NAME` | Application name for Pyroscope | No (default: `node-agent`) |
| `RELEASE` | Release version for telemetry | No |
| `KS_LOGGER_LEVEL` | Log level: `debug`, `info`, `warning`, `error`. Default: `info`. When `debug`, enables the ring-buffer flush endpoint. | No |
| `KS_LOGGER_NAME` | Logger output format: `zap` (structured JSON) or `pretty` (human-readable). Default: `zap`. | No |
| `MULTIPLY` | Enable pod multiplication (testing) | No |
| `QUEUE_DIR` | Directory for persistent queue | No |
| `MAX_QUEUE_SIZE` | Maximum queue size | No |
| `TEST_NAMESPACE` | Override namespace in tests | No |

### OTEL Notes

**Authentication headers:** When credentials are present in `/etc/credentials` (or via `ACCOUNT_ID` /
`ACCESS_KEY` env vars), `X-API-Key` and `X-Customer-GUID` gRPC metadata headers are injected for every
outbound OTLP RPC, regardless of endpoint hostname. This applies to any collector — ARMO back office,
self-hosted, or otherwise. Credentials are read once at startup; agent restart is required if credentials
are rotated at runtime (known v1 limitation).

**Ring buffer (retroactive log export):** When `KS_LOGGER_LEVEL=debug`, the agent keeps the last
7,500 log records in memory and activates a flush endpoint at `localhost:6062/debug/flush-ring-buffer`
(port configurable via `OTEL_DEBUG_PORT`). A `POST` to that endpoint re-emits all buffered records
through the OTLP log pipeline — useful for recovering startup logs that were emitted before the OTLP
exporter finished connecting. The ring buffer is cleared after flushing; a second call emits nothing.

**Kubernetes event correlation — cross-repo dependency (operator PR, not yet implemented)**

K8s events (OOMKilled, Evicted, CrashLoopBackOff, NodeMemoryPressure, pod rescheduling) are collected
by the **operator**, which runs once per cluster and pushes OTLP logs to `otel.armosec.io:4317` using
the same API credentials. Correlation is automatic via shared `k8s.node.name`, `k8s.pod.name`,
`k8s.namespace.name` resource attributes — no node-agent configuration change required.

Until the operator PR ships, node-agent ACs 1–9 are independent and all pass without K8s event
correlation. The ARMO back office dashboard renders correctly using node-agent signals only; K8s event
correlation is an upgrade, not a dependency.

---

## Configuration File Options

### Core Settings
Expand Down
Loading
Loading