Replace Ruby bosh-monitor with Go implementation#2747
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
e611365 to
56da280
Compare
56da280 to
327b3f9
Compare
There was a problem hiding this comment.
Pull request overview
This PR replaces the legacy Ruby-based bosh-monitor with a Go-based implementation and updates packaging, CI, and integration test scaffolding to build and run the new binary + out-of-process plugins.
Changes:
- Introduces a new Go
bosh-monitorbinary with supporting packages (server, event processing, NATS monitoring, plugin host/protocol, etc.) and Ginkgo/Gomega tests. - Updates BOSH release packaging/job templates to run the Go binary instead of the Ruby runtime/gem.
- Updates integration support to build the Go binary/plugins and adjusts integration specs/configs for the new log/config formats.
Reviewed changes
Copilot reviewed 156 out of 160 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| src/spec/integration/health_monitor/hm_stateless_spec.rb | Updates integration log parsing to match Go slog output format. |
| src/spec/integration_support/sandbox.rb | Builds the Go monitor for integration tests and runs it with updated PATH/env. |
| src/spec/integration_support/bosh_monitor_manager.rb | Adds integration helper to build Go bosh-monitor + plugin binaries. |
| src/spec/assets/sandbox/health_monitor_without_resurrector.yml.erb | Adjusts sandbox HM config to match new Go monitor expectations. |
| src/Gemfile.lock | Removes Ruby bosh-monitor gem from bundle. |
| src/Gemfile | Removes Ruby bosh-monitor gem entry. |
| src/bosh-monitor/test/integration/integration_suite_test.go | Adds Go integration test suite scaffold (Ginkgo). |
| src/bosh-monitor/spec/unit/bosh/monitor/protocols/tcp_connection_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/plugins/tsdb_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/plugins/riemann_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/plugins/resurrector_helper_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/plugins/paging_datadog_client_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/plugins/pagerduty_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/plugins/logger_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/plugins/json_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/plugins/graphite_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/plugins/event_logger_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/plugins/email_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/plugins/dummy_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/plugins/base_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/metric_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/instance_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/events/base_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/events/alert_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/event_processor_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/director_monitor_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/config_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/unit/bosh/monitor/agent_spec.rb | Removes Ruby monitor unit tests (legacy implementation removed). |
| src/bosh-monitor/spec/support/uaa_helpers.rb | Removes Ruby monitor test support (legacy implementation removed). |
| src/bosh-monitor/spec/support/host_authorizatin.rb | Removes Ruby monitor test support (legacy implementation removed). |
| src/bosh-monitor/spec/support/buffered_logger.rb | Removes Ruby monitor test support (legacy implementation removed). |
| src/bosh-monitor/spec/spec_helper.rb | Removes Ruby monitor spec helper (legacy implementation removed). |
| src/bosh-monitor/spec/gemspec_spec.rb | Removes Ruby gemspec tests (legacy implementation removed). |
| src/bosh-monitor/spec/functional/notifying_plugins_spec.rb | Removes Ruby functional tests (legacy implementation removed). |
| src/bosh-monitor/spec/assets/sample_config.yml | Removes Ruby sample config (legacy implementation removed). |
| src/bosh-monitor/spec/assets/dummy_plugin_config.yml | Removes Ruby dummy plugin config (legacy implementation removed). |
| src/bosh-monitor/pkg/server/server.go | Adds Go HTTP API server implementation (healthz + agent endpoints). |
| src/bosh-monitor/pkg/server/server_test.go | Adds Go tests for server endpoints. |
| src/bosh-monitor/pkg/server/server_suite_test.go | Adds Ginkgo suite for server package. |
| src/bosh-monitor/pkg/resurrection/resurrection_suite_test.go | Adds Ginkgo suite for resurrection package. |
| src/bosh-monitor/pkg/resurrection/manager_test.go | Adds resurrection manager rule parsing/behavior tests. |
| src/bosh-monitor/pkg/processor/processor_suite_test.go | Adds Ginkgo suite for processor package. |
| src/bosh-monitor/pkg/processor/event_processor.go | Adds Go event processor (validation, dedupe, pruning, dispatch). |
| src/bosh-monitor/pkg/processor/event_processor_test.go | Adds tests for Go event processor. |
| src/bosh-monitor/pkg/pluginproto/protocol_suite_test.go | Adds Ginkgo suite for plugin protocol package. |
| src/bosh-monitor/pkg/pluginhost/pluginhost_suite_test.go | Adds Ginkgo suite for plugin host package. |
| src/bosh-monitor/pkg/pluginhost/host_test.go | Adds tests for plugin host command handling and startup behavior. |
| src/bosh-monitor/pkg/nats/nats_suite_test.go | Adds Ginkgo suite for NATS package. |
| src/bosh-monitor/pkg/nats/director_monitor.go | Adds Go director-alert subscription monitor. |
| src/bosh-monitor/pkg/nats/director_monitor_test.go | Adds initial unit tests for director monitor (needs strengthening). |
| src/bosh-monitor/pkg/nats/client.go | Adds Go NATS client with TLS and startup retry logic. |
| src/bosh-monitor/pkg/instance/instance.go | Adds Go instance model + formatting helpers. |
| src/bosh-monitor/pkg/instance/instance_suite_test.go | Adds Ginkgo suite for instance package. |
| src/bosh-monitor/pkg/instance/deployment.go | Adds Go deployment model and agent/instance bookkeeping. |
| src/bosh-monitor/pkg/instance/agent.go | Adds Go agent model and timeout/rogue logic. |
| src/bosh-monitor/pkg/events/metric.go | Adds Go metric model. |
| src/bosh-monitor/pkg/events/events_suite_test.go | Adds Ginkgo suite for events package. |
| src/bosh-monitor/pkg/events/base.go | Adds Go event factory/validation helpers. |
| src/bosh-monitor/pkg/director/director_suite_test.go | Adds Ginkgo suite for director package. |
| src/bosh-monitor/pkg/director/auth.go | Adds Go auth provider logic (basic + UAA token flow, CA selection). |
| src/bosh-monitor/pkg/config/config.go | Adds Go config loader with defaults/validation. |
| src/bosh-monitor/pkg/config/config_suite_test.go | Adds Ginkgo suite for config package. |
| src/bosh-monitor/main.go | Adds Go entrypoint (-c config) with slog logging and signal handling. |
| src/bosh-monitor/lib/bosh/monitor/yaml_helper.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/version.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/resurrection_manager.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/protocols/tsdb_connection.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/protocols/tcp_connection.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/protocols/graphite_connection.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/plugins/tsdb.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/plugins/riemann.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/plugins/resurrector_helper.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/plugins/README.md | Removes Ruby plugin README (deleted). |
| src/bosh-monitor/lib/bosh/monitor/plugins/paging_datadog_client.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/plugins/pagerduty.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/plugins/logger.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/plugins/json.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/plugins/http_request_helper.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/plugins/graphite.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/plugins/event_logger.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/plugins/email.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/plugins/dummy.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/plugins/datadog.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/plugins/base.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/metric.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/instance.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/events/heartbeat.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/events/base.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/events/alert.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/event_processor.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/errors.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/director.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/director_monitor.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/deployment.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/core_ext.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/config.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/auth_provider.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/api_controller.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor/agent.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/lib/bosh/monitor.rb | Removes Ruby monitor implementation (deleted). |
| src/bosh-monitor/go.mod | Adds Go module definition for new monitor. |
| src/bosh-monitor/cmd/plugins/pluginlib/pluginlib.go | Adds shared plugin runtime library for out-of-process plugins. |
| src/bosh-monitor/cmd/plugins/pluginlib/pluginlib_test.go | Adds tests for plugin runtime library. |
| src/bosh-monitor/cmd/plugins/pluginlib/pluginlib_suite_test.go | Adds Ginkgo suite for pluginlib package. |
| src/bosh-monitor/cmd/plugins/hm-tsdb/main.go | Adds TSDB plugin (Go) implementation. |
| src/bosh-monitor/cmd/plugins/hm-riemann/main.go | Adds Riemann plugin (Go) implementation. |
| src/bosh-monitor/cmd/plugins/hm-pagerduty/main.go | Adds PagerDuty plugin (Go) implementation. |
| src/bosh-monitor/cmd/plugins/hm-logger/main.go | Adds logger plugin (Go) implementation. |
| src/bosh-monitor/cmd/plugins/hm-json/main.go | Adds JSON fanout plugin (Go) implementation. |
| src/bosh-monitor/cmd/plugins/hm-graphite/main.go | Adds Graphite plugin (Go) implementation. |
| src/bosh-monitor/cmd/plugins/hm-event-logger/main.go | Adds event-logger plugin (Go) implementation. |
| src/bosh-monitor/cmd/plugins/hm-dummy/main.go | Adds dummy plugin (Go) implementation. |
| src/bosh-monitor/cmd/plugins/hm-datadog/main.go | Adds Datadog plugin (Go) implementation. |
| src/bosh-monitor/.golangci.yml | Adds golangci-lint config for the new Go module. |
| packages/health_monitor/spec | Updates package spec to remove Ruby dependencies and ship new monitor sources. |
| packages/health_monitor/packaging | Updates packaging to build Go binary + plugins. |
| jobs/health_monitor/templates/health_monitor | Updates job launcher to run the Go binary (removes Ruby env). |
| jobs/health_monitor/templates/bpm.yml | Updates BPM config to run Go binary with args, removes Ruby env vars/volumes. |
| jobs/health_monitor/spec | Removes Ruby package dependency from health_monitor job. |
| .github/workflows/ruby.yml | Removes the Ruby monitor test matrix entry. |
| .github/workflows/go.yml | Adds lint/test jobs for the new Go bosh-monitor module. |
Comments suppressed due to low confidence (1)
packages/health_monitor/spec:6
go buildis invoked in this package, but the package spec declares no dependency that would provide a Go toolchain during BOSH compilation. Unless the compilation environment already hasgoavailable, this will fail to compile the release. Consider either (a) adding agolang-*package dependency, or (b) shipping prebuilt binaries (like other packages in this release) and removing the compile-timego buildrequirement.
---
name: health_monitor
files:
- bosh-monitor/**/*
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
327b3f9 to
704df82
Compare
e88928d to
ca33a76
Compare
018013f to
9607fdc
Compare
a3a6c01 to
847679e
Compare
Follow-up to the concurrency/liveness fixes, informed by comparing against the
Ruby bosh-monitor still on main:
- email (F5): Ruby connects in plaintext and upgrades via STARTTLS
(Net::SMTP.new(host, port, starttls:)); the Go plugin used implicit TLS
(tls.Dial), which fails against the STARTTLS submission port (587). Switch to
smtp.Dial + StartTLS and honour the Ruby auth option (PLAIN / CRAM-MD5).
- events (F4): reject a non-numeric severity instead of silently treating it as
0, matching Ruby's "non-negative integer expected".
- pluginproto (L5): drop omitempty from the http_response Status so a real
status of 0 (request error) is sent explicitly rather than dropped.
- director (L4/L7): NewClient/NewAuthProvider now take a typed director.Config
instead of map[string]interface{}, removing the fmt.Sprintf("%v", ...)
coercions and the field-by-field map building in the runner; call sites and
tests updated.
Confirmed faithful to Ruby (no change): event_mbus is dead in Ruby too, and the
status server binds 127.0.0.1 in both.
go build / go vet / gofmt / go test -race all clean; offline vendored build OK.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace map[string]interface{} with typed structs at the director->instance
boundary:
- director: add Deployment, Instance, ResurrectionConfig response types;
Deployments/GetDeploymentInstances/ResurrectionConfig now unmarshal into them
and return typed slices. A small FlexStr type tolerates `index` arriving as a
JSON number or string, preserving the previous fmt.Sprintf("%v") behaviour.
- instance: the Director interface, Manager sync methods (SyncDeployments,
SyncDeploymentState, syncInstances, SyncInstancesPublic) and the
NewDeployment/NewInstance/CreateDeployment/CreateInstance constructors now take
typed director values, deleting the per-field map digging and "%v" coercions.
- resurrection: UpdateRules takes []director.ResurrectionConfig.
- tests updated to construct typed values.
go build / vet / gofmt / go test -race all clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the central `Process(kind string, data map[string]interface{})` seam
with a typed `Process(event events.Event)`, and type the deployment-health
fields the resurrector reads:
- processor: Process now takes a constructed events.Event; it assigns a
generated ID if missing, validates, de-duplicates and dispatches. The three
consumer interfaces (instance.EventProcessor, nats.DirectorAlertProcessor,
pluginhost.AlertEmitter) and their fakes are updated to match.
- events: add NewAlertFromData(AlertData) so the monitor builds its own alerts
(vm_health/deployment_health) from typed fields instead of map literals;
add typed JobsToInstanceIDs/TotalAgentCount; move ID assignment into a shared
EnsureID; track severity presence with a field so validation no longer
depends on the attributes map (which monitor-generated alerts don't carry).
- pluginproto.EventData: add typed JobsToInstanceIDs/TotalAgentCount and drop
the generic Attributes map; eventToProto populates the typed fields.
- resurrector/consul: read event.Category/Deployment/JobsToInstanceIDs/
TotalAgentCount/JobState typed fields instead of digging in Attributes.
- manager: onAlert/onHeartbeat and the analyze paths build typed events.
JSON ingress (agent heartbeats/alerts, director alerts, plugin emit_alert)
still decodes into the typed events via NewAlert/NewHeartbeat — the maps are
confined to that boundary. Added a processor test covering the
monitor-generated (attributes-less) alert path.
go build / vet / gofmt / go test -race all clean (3x).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Final typed-boundary cleanup now that nothing reads the generic attributes map:
- events: drop the unused Event.Attributes() method and the Alert.Attrs /
Heartbeat.Attrs fields. Heartbeat gains a typed NumberOfProcesses *int parsed
from the payload; ToHash and eventToProto use it.
- pluginproto: EventData.NumberOfProcesses is now *int instead of interface{}.
golangci-lint (standard set) reports 0 issues; go build / vet / gofmt /
go test -race all clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tifier - pluginproto: add a godoc comment explaining why NumberOfProcesses is *int (nil = absent vs 0 = zero processes) — unlike all other scalar EventData fields. - pluginlib: name the 100-capacity channel buffers as eventChanSize/cmdChanSize with rationale comments instead of bare magic numbers. - main: add `defer signal.Stop(sigCh)` so the signal notifier is properly deregistered on shutdown, guarding against goroutine leaks if main is ever refactored into a long-lived service.
… field Add HTTPConfig.Host (defaults to "127.0.0.1") so the listen address can be overridden without a code change. Production behaviour is unchanged — the default matches the Ruby implementation and the existing hard-coded value. Update server.New signature, runner, and test call sites.
usableCACertFile read the CA cert file to check non-empty, then tlsConfigForCAFile read it again to parse PEM. This was a TOCTOU race (the file can change between reads) and wasted a syscall. Inline the empty-file guard into tlsConfigForCAFile so the file is read exactly once. usableCACertFile is still used by the auth provider to pick the UAA cert path (where it only needs to check usability, not parse PEM).
…leaks fetchUAAToken previously created a new *http.Client and *http.Transport on every invocation. Since the transport is never explicitly closed, its internal connection pool, TLS sessions, and idle goroutines leaked after each token refresh. Build and store the UAA HTTP client once in NewAuthProvider (at the same time the TLS config is already determined from the cert path) and reuse it across all fetchUAAToken calls.
…nic on shutdown The goroutine spawned inside the ticker loop sent directly via `cmds <- pluginlib.LogCommand(...)`. After the plugin context is cancelled and pluginlib stops the writer, a late-arriving send would block indefinitely (if the channel still had capacity) or panic (if it was closed). Replace bare channel sends with pluginlib.SendCommand(ctx, cmds, ...) which selects on ctx.Done() so the goroutine exits cleanly on shutdown.
restartAttempts was incremented on every crash and never reset, so a plugin that crashed once, ran for a week, and crashed again would use an increasingly long backoff. After 6 crashes separated by healthy runs the delay would reach the 60 s cap permanently. Record p.startedAt on each Start(). In waitForExit, if the process ran for at least restartBackoffMax before crashing, reset restartAttempts to 0 so the next restart begins with the base 1 s delay.
…Deployments flag invariant SyncInstancesPublic and SyncAgentsPublic were exported solely "for testing" but were never called by any test. Exporting internal helpers pollutes the API and leaks implementation details. Remove them. Add FetchDeployments tests that document and pin the invariant: - directorInitialSyncDone is set only after a full, successful sync cycle - it remains false when Deployments() fails - it remains false when GetDeploymentInstances() fails for any deployment
…down
When all maxConcurrentPluginRequests goroutine slots are occupied, the
plugin reader goroutine blocks on h.sem <- struct{}{}. If Shutdown is
called at this moment the reader goroutine is not tracked by the
WaitGroup, so h.wg.Wait() returns while the reader is still dangling.
Add a stopped channel (closed once via sync.Once in Shutdown) and
replace the bare channel send with a select so the acquire exits
immediately when the host is shutting down, allowing the reader goroutine
to observe the closed stdout pipe and exit cleanly.
validPayload duplicated a subset of the validation already present in Alert.Validate(). It required id, severity, title, summary, and created_at; Alert.Validate() covers id, severity (with numeric-type check), title, and created_at (as "timestamp"), and treats summary as optional (falling back to Title — matching the Ruby behaviour). Replace the bespoke check with events.NewAlert + Validate() so validation logic lives in one place. Use Create+Validate rather than CreateAndValidate to avoid EnsureID auto-generating a UUID for director alerts: an absent id must remain absent so Validate() correctly rejects it. Update tests to reflect the new error-log phrasing and remove the now- incorrect "summary is required" expectation.
uaaTokenHeader held mu for the full duration of fetchUAAToken (up to the 30 s HTTP client timeout). Every concurrent director request calls AuthHeader which calls uaaTokenHeader, so a token refresh serialised ALL inflight director calls behind the mutex. Switch mu to sync.RWMutex and add a separate fetchMu (sync.Mutex) that guards only the token-fetch code path. The fast path (cached token still valid) now uses RLock and returns without blocking writers. The slow path acquires fetchMu so exactly one goroutine makes the HTTP call; the rest wait on fetchMu and then read the freshly-stored token, avoiding a pile-up of concurrent UAA requests.
Two issues in Client.Subscribe: 1. Iterating a map[string]string has non-deterministic order. Replace with an ordered slice of structs so the subscription sequence is stable and readable. 2. Subscription objects were discarded (_). If a later Subscribe call fails, the earlier subscriptions that succeeded are orphaned on the NATS connection with no way to clean them up. Track each *nats.Subscription and unsubscribe all successful ones before returning an error, so a retry starts clean.
validate() previously only checked director.endpoint. Any other missing required field (mbus endpoint, TLS paths, http port) produced no error at startup; the process would panic or log confusing errors later. Collect all validation errors into a slice and return them in one message so operators see every problem at once. New checks: - http.port in [1, 65535] - mbus.endpoint, server_ca_path, client_certificate_path, client_private_key_path - director.endpoint (was: single-error; now part of multi-error) - all intervals positive (defaults ensure > 0 under normal operation) - each plugin must have a non-empty name Update test fixtures to include the now-required mbus and http.port fields, and add new tests for each newly-validated field.
The Event interface previously declared nine methods. Of these, only
ID(), Kind(), and Validate() are called via the interface anywhere outside
pkg/events:
- event_processor.go calls Validate(), Kind(), and ID()
- pluginhost/host.go calls Kind() and ID() then type-switches to access
concrete fields
The remaining methods (Valid, ToHash, ToJSON, ToPlainText, ShortDescription,
Metrics) are either called within pkg/events itself or through concrete types.
Keeping them in the interface forces every mock / test double to implement
six methods it never needs, and advertises a larger contract than any caller
actually relies on.
Remove the five unused methods from the interface; they remain as concrete
methods on *Alert and *Heartbeat. Update the one test that called Valid()
through the interface to use Validate() instead.
Command.Alert was map[string]interface{}, which gave no compile-time
guarantee that required fields were present. A plugin emitting a malformed
alert discovered the error only at runtime when the host tried to validate it.
Introduce pluginproto.AlertPayload with typed fields (Severity int, Title
string, Summary/Source/Deployment string, CreatedAt int64). Update:
- pluginproto.NewEmitAlertCommand to accept *AlertPayload
- pluginlib.EmitAlertCommand / AlertPayload type alias (plugin-side)
- pluginhost.HandleCommand to convert AlertPayload → events.AlertData
- hm-resurrector to construct typed payloads
- All tests that previously used map[string]interface{} for alert payloads
…yment unhealthyCountForDeployment() now filters by the jobInstanceKey.Deployment field so that unhealthy agents in an unrelated deployment cannot inflate another deployment's meltdown percentage. In a multi-deployment environment the old unhealthyCount() added every entry in the tracker regardless of which deployment they belonged to, which could suppress resurrection for healthy deployments that happened to share a tracker with a melting-down one. ai-assisted=yes [TNZ-113909] Convert BOSH health monitor job to Golang
record() and unhealthyCountForDeployment() are called only from the outer select loop in runResurrector, which runs in a single goroutine. The inner goroutines spawned for managed-state HTTP requests never touch the tracker, so the sync.Mutex was dead weight with no shared-state to protect. ai-assisted=yes [TNZ-113909] Convert BOSH health monitor job to Golang
Ruby's resurrector plugin split jobs_to_instances into resurrection-enabled and resurrection-disabled halves, emitting an explicit alert (severity=1, title='Resurrection is disabled by resurrection config') for the disabled half so operators know why those VMs were not resurrected. The Go rewrite silently dropped resurrection-disabled instances with no operator-visible alert. Fix by collecting disabled instances separately in analyzeDeploymentAgents and AnalyzeInstances and emitting the alert from the manager before processing the enabled deployment_health alert. ai-assisted=yes [TNZ-113909] Convert BOSH health monitor job to Golang
Ruby's agent stores number_of_processes as nil when the heartbeat does not include the field. The check agent.job_state == 'running' && agent.number_of_processes == 0 evaluates nil == 0 as false in Ruby, so agents that never reported a process count are NOT counted as unhealthy. The Go rewrite used int (zero-valued by default), causing agents with absent number_of_processes to be falsely counted as unhealthy when job_state is 'running'. Fix by changing the field to *int (nil when absent) and updating UnhealthyAgents to check agent.NumberOfProcesses != nil && *agent.NumberOfProcesses == 0. ai-assisted=yes [TNZ-113909] Convert BOSH health monitor job to Golang
sync.Map is optimised for a stable-key, many-concurrent-reader access
pattern. pendingChans has low contention and only two accessor sites; a
plain map[string]chan *pluginlib.EventEnvelope guarded by sync.Mutex is
clearer, avoids the interface{} type assertion on Load, and is easier to
reason about under race-detector inspection.
ai-assisted=yes
[TNZ-113909] Convert BOSH health monitor job to Golang
NATS messages arrive as []byte from the NATS library. The previous
implementation converted msg.Data to string, boxed it into interface{},
and then switched on the runtime type inside ProcessEvent to convert it
back to []byte for json.Unmarshal — a round-trip that obscured where
decoding happened.
Change MessageHandler and ProcessEvent to accept []byte. The NATS
subscription now passes msg.Data directly, decoding is done once at the
top of ProcessEvent, and the interface{} type switch is removed.
ai-assisted=yes
[TNZ-113909] Convert BOSH health monitor job to Golang
The Ruby PagerDuty plugin uses the system CA bundle for its HTTPS request. The Go rewrite constructed the HTTP client with an empty tls.Config (which also uses the system bundle), but offered no mechanism for operators in air-gapped environments to supply a custom root CA. Add CACert string to pagerdutyOptions (json: \"ca_cert\"). When set, the field is treated as a file path to a PEM-encoded CA certificate, mirroring the ca_cert option convention used by the resurrector and director client. ai-assisted=yes [TNZ-113909] Convert BOSH health monitor job to Golang
Replace []string + manual string concatenation with []error + errors.Join (Go 1.20+). The returned error now implements Unwrap() []error so callers can use errors.Is/errors.As for targeted assertions instead of substring matching. The rendered message is the same set of lines, just without the redundant ' - ' prefix. ai-assisted=yes [TNZ-113909] Convert BOSH health monitor job to Golang
crashed processing a delete-deployment --force. iThe health monitor detects the stopped VM as "missing" and sends a recreate/resurrect request to the Director casuing a race condition with delete-deployment.
…ependency Copies the vendored golang-1.26-linux package (from cloudfoundry/bosh-package-golang-release) which was already present on experiment-golang-bosh-nats-sync. Without this package packages/health_monitor/spec had no golang dependency, so BOSH would not guarantee the Go toolchain was compiled before health_monitor during a release build. The packages/health_monitor/packaging script already loops for golang-* packages; the spec declaration ensures the compilation ordering is explicit. ai-assisted=yes [TNZ-113909] Convert BOSH health monitor job to Golang
78f1f97 to
9e76d88
Compare
Summary
This PR replaces the Ruby `bosh-monitor` implementation with a new Go-based binary that provides the same functionality, including:
Changes
Test plan