Skip to content

Replace Ruby bosh-monitor with Go implementation#2747

Draft
aramprice wants to merge 45 commits into
mainfrom
experiment-golang-bosh-monitor
Draft

Replace Ruby bosh-monitor with Go implementation#2747
aramprice wants to merge 45 commits into
mainfrom
experiment-golang-bosh-monitor

Conversation

@aramprice

@aramprice aramprice commented Jun 19, 2026

Copy link
Copy Markdown
Member

Summary

This PR replaces the Ruby `bosh-monitor` implementation with a new Go-based binary that provides the same functionality, including:

  • NATS subscription for agent heartbeats, alerts, and shutdown events
  • Director polling for deployment/instance synchronization
  • Plugin host architecture with out-of-process plugins (hm-logger, hm-resurrector, hm-event-logger, hm-datadog, hm-pagerduty, hm-riemann, hm-graphite, hm-email, hm-consul, hm-json, hm-tsdb)
  • HTTP API (healthz, unresponsive_agents, unhealthy_agents, etc.)
  • TLS peer verification for Director and UAA connections
  • NATS connection retry logic during startup
  • `director_ca_cert` / `uaa_ca_cert` support for CA-bundle TLS
  • Resurrection config filtering: only instances with resurrection enabled generate `deployment_health` alerts

Changes

  • `src/bosh-monitor/`: Delete Ruby source (lib/, spec/, bin/, gemspec). Add Go implementation.
  • `src/Gemfile` / `src/Gemfile.lock`: Remove `bosh-monitor` gem.
  • `.github/workflows/go.yml`: Add `bosh-monitor-lint` and `bosh-monitor-test` jobs.
  • `.github/workflows/ruby.yml`: Remove `monitor:parallel` matrix entry (Ruby code deleted).
  • `jobs/health_monitor/`: Switch from Ruby runtime to Go binary; remove `director-ruby-3.3` package dep.
  • `packages/health_monitor/`: Replace gem build script with `go build`.
  • `src/spec/integration_support/`: Add `BoshMonitorManager` to build the Go binary for integration tests; update sandbox to use it with correct PATH.
  • `src/spec/assets/sandbox/`: Update sandbox HM configs to be compatible with Go config format.
  • `src/spec/integration/health_monitor/`: Update JSON heartbeat log parsing to match Go slog format.

Test plan

  • Go unit tests: `cd src/bosh-monitor && go test ./...` — all pass
  • `golangci-lint` — clean (0 issues)
  • GitHub Actions `go` workflow: `bosh-monitor-test` and `bosh-monitor-lint` jobs — passing
  • GitHub Actions `ruby` workflow: `nats_sync:parallel`, `common:parallel`, `release` — passing
  • `fly:integration` — last successful run: build #364415336 (Jun 23, 2026), 1420 examples, 2 failures (1 HM: `hm_stateless_spec.rb:96` now fixed; 1 unrelated: `director_scheduler_spec.rb:59`)
  • `fly:integration` — pending clean re-run (CI infra broken since Jun 25 for all branches; main pipeline build #1034 also failed due to `CI: replace docker-image type` change)

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c479c963-e126-4359-89d9-07bcd571ba74

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch experiment-golang-bosh-monitor

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Comment thread .github/workflows/go.yml Fixed
Comment thread .github/workflows/go.yml Fixed
Comment thread src/bosh-monitor/main.go Fixed
@aramprice aramprice force-pushed the experiment-golang-bosh-monitor branch from e611365 to 56da280 Compare June 19, 2026 23:43
coderabbitai[bot]
coderabbitai Bot previously approved these changes Jun 19, 2026
coderabbitai[bot]
coderabbitai Bot previously approved these changes Jun 19, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the legacy Ruby-based bosh-monitor with a Go-based implementation and updates packaging, CI, and integration test scaffolding to build and run the new binary + out-of-process plugins.

Changes:

  • Introduces a new Go bosh-monitor binary with supporting packages (server, event processing, NATS monitoring, plugin host/protocol, etc.) and Ginkgo/Gomega tests.
  • Updates BOSH release packaging/job templates to run the Go binary instead of the Ruby runtime/gem.
  • Updates integration support to build the Go binary/plugins and adjusts integration specs/configs for the new log/config formats.

Reviewed changes

Copilot reviewed 156 out of 160 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
src/spec/integration/health_monitor/hm_stateless_spec.rb Updates integration log parsing to match Go slog output format.
src/spec/integration_support/sandbox.rb Builds the Go monitor for integration tests and runs it with updated PATH/env.
src/spec/integration_support/bosh_monitor_manager.rb Adds integration helper to build Go bosh-monitor + plugin binaries.
src/spec/assets/sandbox/health_monitor_without_resurrector.yml.erb Adjusts sandbox HM config to match new Go monitor expectations.
src/Gemfile.lock Removes Ruby bosh-monitor gem from bundle.
src/Gemfile Removes Ruby bosh-monitor gem entry.
src/bosh-monitor/test/integration/integration_suite_test.go Adds Go integration test suite scaffold (Ginkgo).
src/bosh-monitor/spec/unit/bosh/monitor/protocols/tcp_connection_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/plugins/tsdb_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/plugins/riemann_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/plugins/resurrector_helper_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/plugins/paging_datadog_client_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/plugins/pagerduty_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/plugins/logger_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/plugins/json_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/plugins/graphite_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/plugins/event_logger_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/plugins/email_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/plugins/dummy_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/plugins/base_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/metric_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/instance_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/events/base_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/events/alert_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/event_processor_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/director_monitor_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/config_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/unit/bosh/monitor/agent_spec.rb Removes Ruby monitor unit tests (legacy implementation removed).
src/bosh-monitor/spec/support/uaa_helpers.rb Removes Ruby monitor test support (legacy implementation removed).
src/bosh-monitor/spec/support/host_authorizatin.rb Removes Ruby monitor test support (legacy implementation removed).
src/bosh-monitor/spec/support/buffered_logger.rb Removes Ruby monitor test support (legacy implementation removed).
src/bosh-monitor/spec/spec_helper.rb Removes Ruby monitor spec helper (legacy implementation removed).
src/bosh-monitor/spec/gemspec_spec.rb Removes Ruby gemspec tests (legacy implementation removed).
src/bosh-monitor/spec/functional/notifying_plugins_spec.rb Removes Ruby functional tests (legacy implementation removed).
src/bosh-monitor/spec/assets/sample_config.yml Removes Ruby sample config (legacy implementation removed).
src/bosh-monitor/spec/assets/dummy_plugin_config.yml Removes Ruby dummy plugin config (legacy implementation removed).
src/bosh-monitor/pkg/server/server.go Adds Go HTTP API server implementation (healthz + agent endpoints).
src/bosh-monitor/pkg/server/server_test.go Adds Go tests for server endpoints.
src/bosh-monitor/pkg/server/server_suite_test.go Adds Ginkgo suite for server package.
src/bosh-monitor/pkg/resurrection/resurrection_suite_test.go Adds Ginkgo suite for resurrection package.
src/bosh-monitor/pkg/resurrection/manager_test.go Adds resurrection manager rule parsing/behavior tests.
src/bosh-monitor/pkg/processor/processor_suite_test.go Adds Ginkgo suite for processor package.
src/bosh-monitor/pkg/processor/event_processor.go Adds Go event processor (validation, dedupe, pruning, dispatch).
src/bosh-monitor/pkg/processor/event_processor_test.go Adds tests for Go event processor.
src/bosh-monitor/pkg/pluginproto/protocol_suite_test.go Adds Ginkgo suite for plugin protocol package.
src/bosh-monitor/pkg/pluginhost/pluginhost_suite_test.go Adds Ginkgo suite for plugin host package.
src/bosh-monitor/pkg/pluginhost/host_test.go Adds tests for plugin host command handling and startup behavior.
src/bosh-monitor/pkg/nats/nats_suite_test.go Adds Ginkgo suite for NATS package.
src/bosh-monitor/pkg/nats/director_monitor.go Adds Go director-alert subscription monitor.
src/bosh-monitor/pkg/nats/director_monitor_test.go Adds initial unit tests for director monitor (needs strengthening).
src/bosh-monitor/pkg/nats/client.go Adds Go NATS client with TLS and startup retry logic.
src/bosh-monitor/pkg/instance/instance.go Adds Go instance model + formatting helpers.
src/bosh-monitor/pkg/instance/instance_suite_test.go Adds Ginkgo suite for instance package.
src/bosh-monitor/pkg/instance/deployment.go Adds Go deployment model and agent/instance bookkeeping.
src/bosh-monitor/pkg/instance/agent.go Adds Go agent model and timeout/rogue logic.
src/bosh-monitor/pkg/events/metric.go Adds Go metric model.
src/bosh-monitor/pkg/events/events_suite_test.go Adds Ginkgo suite for events package.
src/bosh-monitor/pkg/events/base.go Adds Go event factory/validation helpers.
src/bosh-monitor/pkg/director/director_suite_test.go Adds Ginkgo suite for director package.
src/bosh-monitor/pkg/director/auth.go Adds Go auth provider logic (basic + UAA token flow, CA selection).
src/bosh-monitor/pkg/config/config.go Adds Go config loader with defaults/validation.
src/bosh-monitor/pkg/config/config_suite_test.go Adds Ginkgo suite for config package.
src/bosh-monitor/main.go Adds Go entrypoint (-c config) with slog logging and signal handling.
src/bosh-monitor/lib/bosh/monitor/yaml_helper.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/version.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/resurrection_manager.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/protocols/tsdb_connection.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/protocols/tcp_connection.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/protocols/graphite_connection.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/plugins/tsdb.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/plugins/riemann.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/plugins/resurrector_helper.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/plugins/README.md Removes Ruby plugin README (deleted).
src/bosh-monitor/lib/bosh/monitor/plugins/paging_datadog_client.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/plugins/pagerduty.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/plugins/logger.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/plugins/json.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/plugins/http_request_helper.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/plugins/graphite.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/plugins/event_logger.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/plugins/email.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/plugins/dummy.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/plugins/datadog.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/plugins/base.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/metric.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/instance.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/events/heartbeat.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/events/base.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/events/alert.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/event_processor.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/errors.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/director.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/director_monitor.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/deployment.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/core_ext.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/config.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/auth_provider.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/api_controller.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor/agent.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/lib/bosh/monitor.rb Removes Ruby monitor implementation (deleted).
src/bosh-monitor/go.mod Adds Go module definition for new monitor.
src/bosh-monitor/cmd/plugins/pluginlib/pluginlib.go Adds shared plugin runtime library for out-of-process plugins.
src/bosh-monitor/cmd/plugins/pluginlib/pluginlib_test.go Adds tests for plugin runtime library.
src/bosh-monitor/cmd/plugins/pluginlib/pluginlib_suite_test.go Adds Ginkgo suite for pluginlib package.
src/bosh-monitor/cmd/plugins/hm-tsdb/main.go Adds TSDB plugin (Go) implementation.
src/bosh-monitor/cmd/plugins/hm-riemann/main.go Adds Riemann plugin (Go) implementation.
src/bosh-monitor/cmd/plugins/hm-pagerduty/main.go Adds PagerDuty plugin (Go) implementation.
src/bosh-monitor/cmd/plugins/hm-logger/main.go Adds logger plugin (Go) implementation.
src/bosh-monitor/cmd/plugins/hm-json/main.go Adds JSON fanout plugin (Go) implementation.
src/bosh-monitor/cmd/plugins/hm-graphite/main.go Adds Graphite plugin (Go) implementation.
src/bosh-monitor/cmd/plugins/hm-event-logger/main.go Adds event-logger plugin (Go) implementation.
src/bosh-monitor/cmd/plugins/hm-dummy/main.go Adds dummy plugin (Go) implementation.
src/bosh-monitor/cmd/plugins/hm-datadog/main.go Adds Datadog plugin (Go) implementation.
src/bosh-monitor/.golangci.yml Adds golangci-lint config for the new Go module.
packages/health_monitor/spec Updates package spec to remove Ruby dependencies and ship new monitor sources.
packages/health_monitor/packaging Updates packaging to build Go binary + plugins.
jobs/health_monitor/templates/health_monitor Updates job launcher to run the Go binary (removes Ruby env).
jobs/health_monitor/templates/bpm.yml Updates BPM config to run Go binary with args, removes Ruby env vars/volumes.
jobs/health_monitor/spec Removes Ruby package dependency from health_monitor job.
.github/workflows/ruby.yml Removes the Ruby monitor test matrix entry.
.github/workflows/go.yml Adds lint/test jobs for the new Go bosh-monitor module.
Comments suppressed due to low confidence (1)

packages/health_monitor/spec:6

  • go build is invoked in this package, but the package spec declares no dependency that would provide a Go toolchain during BOSH compilation. Unless the compilation environment already has go available, this will fail to compile the release. Consider either (a) adding a golang-* package dependency, or (b) shipping prebuilt binaries (like other packages in this release) and removing the compile-time go build requirement.
---
name: health_monitor

files:
- bosh-monitor/**/*


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/spec/integration_support/bosh_monitor_manager.rb
Comment thread packages/health_monitor/packaging
Comment thread src/bosh-monitor/go.mod
Comment thread src/bosh-monitor/pkg/nats/director_monitor_test.go Outdated
Comment thread src/bosh-monitor/pkg/server/server_test.go
Comment thread src/bosh-monitor/pkg/server/server_test.go
Comment thread src/bosh-monitor/pkg/processor/event_processor.go Outdated
Comment thread src/bosh-monitor/cmd/plugins/hm-pagerduty/main.go
Comment thread src/bosh-monitor/cmd/plugins/hm-graphite/main.go Outdated
Comment thread src/bosh-monitor/cmd/plugins/hm-tsdb/main.go Outdated
coderabbitai[bot]
coderabbitai Bot previously approved these changes Jun 20, 2026
aramprice and others added 29 commits July 1, 2026 09:44
Follow-up to the concurrency/liveness fixes, informed by comparing against the
Ruby bosh-monitor still on main:

- email (F5): Ruby connects in plaintext and upgrades via STARTTLS
  (Net::SMTP.new(host, port, starttls:)); the Go plugin used implicit TLS
  (tls.Dial), which fails against the STARTTLS submission port (587). Switch to
  smtp.Dial + StartTLS and honour the Ruby auth option (PLAIN / CRAM-MD5).
- events (F4): reject a non-numeric severity instead of silently treating it as
  0, matching Ruby's "non-negative integer expected".
- pluginproto (L5): drop omitempty from the http_response Status so a real
  status of 0 (request error) is sent explicitly rather than dropped.
- director (L4/L7): NewClient/NewAuthProvider now take a typed director.Config
  instead of map[string]interface{}, removing the fmt.Sprintf("%v", ...)
  coercions and the field-by-field map building in the runner; call sites and
  tests updated.

Confirmed faithful to Ruby (no change): event_mbus is dead in Ruby too, and the
status server binds 127.0.0.1 in both.

go build / go vet / gofmt / go test -race all clean; offline vendored build OK.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace map[string]interface{} with typed structs at the director->instance
boundary:

- director: add Deployment, Instance, ResurrectionConfig response types;
  Deployments/GetDeploymentInstances/ResurrectionConfig now unmarshal into them
  and return typed slices. A small FlexStr type tolerates `index` arriving as a
  JSON number or string, preserving the previous fmt.Sprintf("%v") behaviour.
- instance: the Director interface, Manager sync methods (SyncDeployments,
  SyncDeploymentState, syncInstances, SyncInstancesPublic) and the
  NewDeployment/NewInstance/CreateDeployment/CreateInstance constructors now take
  typed director values, deleting the per-field map digging and "%v" coercions.
- resurrection: UpdateRules takes []director.ResurrectionConfig.
- tests updated to construct typed values.

go build / vet / gofmt / go test -race all clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the central `Process(kind string, data map[string]interface{})` seam
with a typed `Process(event events.Event)`, and type the deployment-health
fields the resurrector reads:

- processor: Process now takes a constructed events.Event; it assigns a
  generated ID if missing, validates, de-duplicates and dispatches. The three
  consumer interfaces (instance.EventProcessor, nats.DirectorAlertProcessor,
  pluginhost.AlertEmitter) and their fakes are updated to match.
- events: add NewAlertFromData(AlertData) so the monitor builds its own alerts
  (vm_health/deployment_health) from typed fields instead of map literals;
  add typed JobsToInstanceIDs/TotalAgentCount; move ID assignment into a shared
  EnsureID; track severity presence with a field so validation no longer
  depends on the attributes map (which monitor-generated alerts don't carry).
- pluginproto.EventData: add typed JobsToInstanceIDs/TotalAgentCount and drop
  the generic Attributes map; eventToProto populates the typed fields.
- resurrector/consul: read event.Category/Deployment/JobsToInstanceIDs/
  TotalAgentCount/JobState typed fields instead of digging in Attributes.
- manager: onAlert/onHeartbeat and the analyze paths build typed events.

JSON ingress (agent heartbeats/alerts, director alerts, plugin emit_alert)
still decodes into the typed events via NewAlert/NewHeartbeat — the maps are
confined to that boundary. Added a processor test covering the
monitor-generated (attributes-less) alert path.

go build / vet / gofmt / go test -race all clean (3x).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Final typed-boundary cleanup now that nothing reads the generic attributes map:

- events: drop the unused Event.Attributes() method and the Alert.Attrs /
  Heartbeat.Attrs fields. Heartbeat gains a typed NumberOfProcesses *int parsed
  from the payload; ToHash and eventToProto use it.
- pluginproto: EventData.NumberOfProcesses is now *int instead of interface{}.

golangci-lint (standard set) reports 0 issues; go build / vet / gofmt /
go test -race all clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tifier

- pluginproto: add a godoc comment explaining why NumberOfProcesses is *int
  (nil = absent vs 0 = zero processes) — unlike all other scalar EventData fields.
- pluginlib: name the 100-capacity channel buffers as eventChanSize/cmdChanSize
  with rationale comments instead of bare magic numbers.
- main: add `defer signal.Stop(sigCh)` so the signal notifier is properly
  deregistered on shutdown, guarding against goroutine leaks if main is ever
  refactored into a long-lived service.
… field

Add HTTPConfig.Host (defaults to "127.0.0.1") so the listen address can
be overridden without a code change. Production behaviour is unchanged —
the default matches the Ruby implementation and the existing hard-coded
value. Update server.New signature, runner, and test call sites.
usableCACertFile read the CA cert file to check non-empty, then
tlsConfigForCAFile read it again to parse PEM. This was a TOCTOU race
(the file can change between reads) and wasted a syscall.

Inline the empty-file guard into tlsConfigForCAFile so the file is read
exactly once. usableCACertFile is still used by the auth provider to pick
the UAA cert path (where it only needs to check usability, not parse PEM).
…leaks

fetchUAAToken previously created a new *http.Client and *http.Transport on
every invocation. Since the transport is never explicitly closed, its
internal connection pool, TLS sessions, and idle goroutines leaked after
each token refresh.

Build and store the UAA HTTP client once in NewAuthProvider (at the same
time the TLS config is already determined from the cert path) and reuse it
across all fetchUAAToken calls.
…nic on shutdown

The goroutine spawned inside the ticker loop sent directly via
`cmds <- pluginlib.LogCommand(...)`. After the plugin context is cancelled
and pluginlib stops the writer, a late-arriving send would block indefinitely
(if the channel still had capacity) or panic (if it was closed).

Replace bare channel sends with pluginlib.SendCommand(ctx, cmds, ...)
which selects on ctx.Done() so the goroutine exits cleanly on shutdown.
restartAttempts was incremented on every crash and never reset, so a
plugin that crashed once, ran for a week, and crashed again would use an
increasingly long backoff. After 6 crashes separated by healthy runs the
delay would reach the 60 s cap permanently.

Record p.startedAt on each Start(). In waitForExit, if the process ran
for at least restartBackoffMax before crashing, reset restartAttempts to 0
so the next restart begins with the base 1 s delay.
…Deployments flag invariant

SyncInstancesPublic and SyncAgentsPublic were exported solely "for testing"
but were never called by any test. Exporting internal helpers pollutes the
API and leaks implementation details. Remove them.

Add FetchDeployments tests that document and pin the invariant:
  - directorInitialSyncDone is set only after a full, successful sync cycle
  - it remains false when Deployments() fails
  - it remains false when GetDeploymentInstances() fails for any deployment
…down

When all maxConcurrentPluginRequests goroutine slots are occupied, the
plugin reader goroutine blocks on h.sem <- struct{}{}.  If Shutdown is
called at this moment the reader goroutine is not tracked by the
WaitGroup, so h.wg.Wait() returns while the reader is still dangling.

Add a stopped channel (closed once via sync.Once in Shutdown) and
replace the bare channel send with a select so the acquire exits
immediately when the host is shutting down, allowing the reader goroutine
to observe the closed stdout pipe and exit cleanly.
validPayload duplicated a subset of the validation already present in
Alert.Validate(). It required id, severity, title, summary, and created_at;
Alert.Validate() covers id, severity (with numeric-type check), title, and
created_at (as "timestamp"), and treats summary as optional (falling back to
Title — matching the Ruby behaviour).

Replace the bespoke check with events.NewAlert + Validate() so validation
logic lives in one place. Use Create+Validate rather than CreateAndValidate
to avoid EnsureID auto-generating a UUID for director alerts: an absent id
must remain absent so Validate() correctly rejects it.

Update tests to reflect the new error-log phrasing and remove the now-
incorrect "summary is required" expectation.
uaaTokenHeader held mu for the full duration of fetchUAAToken (up to the
30 s HTTP client timeout). Every concurrent director request calls
AuthHeader which calls uaaTokenHeader, so a token refresh serialised ALL
inflight director calls behind the mutex.

Switch mu to sync.RWMutex and add a separate fetchMu (sync.Mutex) that
guards only the token-fetch code path. The fast path (cached token still
valid) now uses RLock and returns without blocking writers. The slow path
acquires fetchMu so exactly one goroutine makes the HTTP call; the rest
wait on fetchMu and then read the freshly-stored token, avoiding a pile-up
of concurrent UAA requests.
Two issues in Client.Subscribe:

1. Iterating a map[string]string has non-deterministic order. Replace with an
   ordered slice of structs so the subscription sequence is stable and readable.

2. Subscription objects were discarded (_). If a later Subscribe call fails,
   the earlier subscriptions that succeeded are orphaned on the NATS connection
   with no way to clean them up. Track each *nats.Subscription and unsubscribe
   all successful ones before returning an error, so a retry starts clean.
validate() previously only checked director.endpoint. Any other missing
required field (mbus endpoint, TLS paths, http port) produced no error at
startup; the process would panic or log confusing errors later.

Collect all validation errors into a slice and return them in one message
so operators see every problem at once. New checks:
  - http.port in [1, 65535]
  - mbus.endpoint, server_ca_path, client_certificate_path, client_private_key_path
  - director.endpoint (was: single-error; now part of multi-error)
  - all intervals positive (defaults ensure > 0 under normal operation)
  - each plugin must have a non-empty name

Update test fixtures to include the now-required mbus and http.port fields,
and add new tests for each newly-validated field.
The Event interface previously declared nine methods. Of these, only
ID(), Kind(), and Validate() are called via the interface anywhere outside
pkg/events:
  - event_processor.go calls Validate(), Kind(), and ID()
  - pluginhost/host.go calls Kind() and ID() then type-switches to access
    concrete fields

The remaining methods (Valid, ToHash, ToJSON, ToPlainText, ShortDescription,
Metrics) are either called within pkg/events itself or through concrete types.

Keeping them in the interface forces every mock / test double to implement
six methods it never needs, and advertises a larger contract than any caller
actually relies on.

Remove the five unused methods from the interface; they remain as concrete
methods on *Alert and *Heartbeat. Update the one test that called Valid()
through the interface to use Validate() instead.
Command.Alert was map[string]interface{}, which gave no compile-time
guarantee that required fields were present. A plugin emitting a malformed
alert discovered the error only at runtime when the host tried to validate it.

Introduce pluginproto.AlertPayload with typed fields (Severity int, Title
string, Summary/Source/Deployment string, CreatedAt int64). Update:
  - pluginproto.NewEmitAlertCommand to accept *AlertPayload
  - pluginlib.EmitAlertCommand / AlertPayload type alias (plugin-side)
  - pluginhost.HandleCommand to convert AlertPayload → events.AlertData
  - hm-resurrector to construct typed payloads
  - All tests that previously used map[string]interface{} for alert payloads
…yment

unhealthyCountForDeployment() now filters by the jobInstanceKey.Deployment
field so that unhealthy agents in an unrelated deployment cannot inflate
another deployment's meltdown percentage.

In a multi-deployment environment the old unhealthyCount() added every entry
in the tracker regardless of which deployment they belonged to, which could
suppress resurrection for healthy deployments that happened to share a tracker
with a melting-down one.

ai-assisted=yes
[TNZ-113909] Convert BOSH health monitor job to Golang
record() and unhealthyCountForDeployment() are called only from the outer
select loop in runResurrector, which runs in a single goroutine. The inner
goroutines spawned for managed-state HTTP requests never touch the tracker,
so the sync.Mutex was dead weight with no shared-state to protect.

ai-assisted=yes
[TNZ-113909] Convert BOSH health monitor job to Golang
Ruby's resurrector plugin split jobs_to_instances into resurrection-enabled
and resurrection-disabled halves, emitting an explicit alert (severity=1,
title='Resurrection is disabled by resurrection config') for the disabled
half so operators know why those VMs were not resurrected.

The Go rewrite silently dropped resurrection-disabled instances with no
operator-visible alert. Fix by collecting disabled instances separately in
analyzeDeploymentAgents and AnalyzeInstances and emitting the alert from
the manager before processing the enabled deployment_health alert.

ai-assisted=yes
[TNZ-113909] Convert BOSH health monitor job to Golang
Ruby's agent stores number_of_processes as nil when the heartbeat does not
include the field. The check agent.job_state == 'running' && agent.number_of_processes == 0
evaluates nil == 0 as false in Ruby, so agents that never reported a process
count are NOT counted as unhealthy.

The Go rewrite used int (zero-valued by default), causing agents with absent
number_of_processes to be falsely counted as unhealthy when job_state is
'running'. Fix by changing the field to *int (nil when absent) and updating
UnhealthyAgents to check agent.NumberOfProcesses != nil && *agent.NumberOfProcesses == 0.

ai-assisted=yes
[TNZ-113909] Convert BOSH health monitor job to Golang
sync.Map is optimised for a stable-key, many-concurrent-reader access
pattern. pendingChans has low contention and only two accessor sites; a
plain map[string]chan *pluginlib.EventEnvelope guarded by sync.Mutex is
clearer, avoids the interface{} type assertion on Load, and is easier to
reason about under race-detector inspection.

ai-assisted=yes
[TNZ-113909] Convert BOSH health monitor job to Golang
NATS messages arrive as []byte from the NATS library. The previous
implementation converted msg.Data to string, boxed it into interface{},
and then switched on the runtime type inside ProcessEvent to convert it
back to []byte for json.Unmarshal — a round-trip that obscured where
decoding happened.

Change MessageHandler and ProcessEvent to accept []byte. The NATS
subscription now passes msg.Data directly, decoding is done once at the
top of ProcessEvent, and the interface{} type switch is removed.

ai-assisted=yes
[TNZ-113909] Convert BOSH health monitor job to Golang
The Ruby PagerDuty plugin uses the system CA bundle for its HTTPS request.
The Go rewrite constructed the HTTP client with an empty tls.Config (which
also uses the system bundle), but offered no mechanism for operators in
air-gapped environments to supply a custom root CA.

Add CACert string to pagerdutyOptions (json: \"ca_cert\"). When set, the
field is treated as a file path to a PEM-encoded CA certificate, mirroring
the ca_cert option convention used by the resurrector and director client.

ai-assisted=yes
[TNZ-113909] Convert BOSH health monitor job to Golang
Replace []string + manual string concatenation with []error + errors.Join
(Go 1.20+). The returned error now implements Unwrap() []error so callers
can use errors.Is/errors.As for targeted assertions instead of substring
matching. The rendered message is the same set of lines, just without the
redundant '  - ' prefix.

ai-assisted=yes
[TNZ-113909] Convert BOSH health monitor job to Golang
crashed processing a delete-deployment --force. iThe health monitor
detects the stopped VM as "missing" and sends a recreate/resurrect request to the Director
casuing a race condition with delete-deployment.
…ependency

Copies the vendored golang-1.26-linux package (from cloudfoundry/bosh-package-golang-release)
which was already present on experiment-golang-bosh-nats-sync. Without this package
packages/health_monitor/spec had no golang dependency, so BOSH would not guarantee
the Go toolchain was compiled before health_monitor during a release build.

The packages/health_monitor/packaging script already loops for golang-* packages;
the spec declaration ensures the compilation ordering is explicit.

ai-assisted=yes
[TNZ-113909] Convert BOSH health monitor job to Golang
@aramprice aramprice force-pushed the experiment-golang-bosh-monitor branch from 78f1f97 to 9e76d88 Compare July 1, 2026 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Pending Merge | Prioritized

Development

Successfully merging this pull request may close these issues.

4 participants