Skip to content

Add disk usage monitoring and auto-alerting for all GCE instances#140

Draft
Copilot wants to merge 2 commits intomainfrom
copilot/add-disk-usage-monitoring
Draft

Add disk usage monitoring and auto-alerting for all GCE instances#140
Copilot wants to merge 2 commits intomainfrom
copilot/add-disk-usage-monitoring

Conversation

Copy link

Copilot AI commented Mar 17, 2026

Validators silently auto-shutdown at ~97% disk (as seen 2026-03-15 on mainnet-validator-1). No automated alerting existed. numbers-testnet-validator-3 is currently at 96%.

New files

  • monitoring/disk-check.sh — Cron-deployable per-instance disk monitor. Alerts via email and/or Slack at configurable WARNING (default 80%) and CRITICAL (default 90%) thresholds. All config via env vars; skips pseudo-filesystems automatically.

    # Deploy + schedule
    export ALERT_EMAIL=ops@example.com
    export SLACK_WEBHOOK_URL=https://hooks.slack.com/services/XXX/YYY/ZZZ
    (crontab -l 2>/dev/null; echo "*/15 * * * * /opt/numbers-network/monitoring/disk-check.sh >> /var/log/disk-check.log 2>&1") | crontab -
  • monitoring/setup-gcp-disk-alerts.sh — Provisions GCP Cloud Monitoring alerting policies (WARNING >80% sustained 5 min, CRITICAL >90% sustained 1 min) and notification channels (email/Slack) via gcloud. Requires the Ops Agent on each instance for the agent.googleapis.com/disk/percent_used metric.

    export GCP_PROJECT=your-gcp-project-id
    export ALERT_EMAIL=ops@example.com
    bash monitoring/setup-gcp-disk-alerts.sh
  • docs/runbooks/disk-management.md — Operational runbook covering:

    • Online GCE persistent disk expansion (growpart + resize2fs, no downtime)
    • Avalanchego chain data pruning (state-sync re-sync, log rotation, snapshot cleanup)
    • Blockscout/PostgreSQL cleanup (VACUUM FULL, Docker prune)
    • Instance inventory with growth rate estimates

Updated

  • README.md — Added "Disk Monitoring" section with ToC entry, quick-start usage for both scripts, and link to the runbook.
Original prompt

This section details on the original issue you should resolve

<issue_title>[Feature][High] Add disk usage monitoring and auto-alerting for all GCE instances</issue_title>
<issue_description>## Summary

Multiple GCE instances have reached or are approaching critical disk thresholds with no automated alerting in place. The 2026-03-15 mainnet validator-1 incident (auto-shutdown at 97% disk) and the current testnet-validator-3 situation (96% disk as of 2026-03-17) demonstrate an urgent need for proactive disk monitoring.

Current State (2026-03-17)

Instance Disk Used Use% Status
numbers-mainnet-validator-1 3.4T 2.8T 84% Warning
numbers-mainnet-validator-a1 1.9T 1.1T 57% OK
numbers-mainnet-validator-a2 2.0T 970G 49% OK
numbers-testnet-validator-3 497G 476G 96% CRITICAL
testnet-explorer 29G 25G 84% Warning
mainnet-explorer 47G 33G 72% OK

Proposed Implementation

  1. GCP Cloud Monitoring alerting policies: Create uptime/disk metric alerts that fire at 80% (warning) and 90% (critical) thresholds
  2. Notification channels: Configure email and/or Slack notifications for disk alerts
  3. Runbook documentation: Add a disk cleanup/expansion runbook to the repository covering:
    • How to expand GCE persistent disks (online resize)
    • Avalanchego chain data pruning options
    • Blockscout/explorer database cleanup procedures
  4. Monitoring script: Add a cron-based disk check script that can be deployed to each instance as a fallback

Immediate Actions Needed

  • numbers-testnet-validator-3 at 96% needs immediate disk expansion or cleanup
  • testnet-explorer and numbers-mainnet-validator-1 at 84% should be monitored closely

Impact

High — without disk monitoring, validators will silently auto-shutdown when disk < 3% free, causing chain downtime and transaction mempool backlog (as seen in the 2026-03-15 incident).

Generated by Health Monitor with Omni</issue_description>

Comments on the Issue (you are @copilot in this section)


🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

Co-authored-by: numbers-official <181934381+numbers-official@users.noreply.github.com>
Copilot AI changed the title [WIP] [Feature] Add disk usage monitoring and auto-alerting for GCE instances Add disk usage monitoring and auto-alerting for all GCE instances Mar 17, 2026
Copilot AI requested a review from numbers-official March 17, 2026 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature][High] Add disk usage monitoring and auto-alerting for all GCE instances

2 participants