Add disk usage monitoring and auto-alerting for all GCE instances by Copilot · Pull Request #140 · numbersprotocol/numbers-network

Copilot · 2026-03-17T14:31:44Z

Validators silently auto-shutdown at ~97% disk (as seen 2026-03-15 on mainnet-validator-1). No automated alerting existed. numbers-testnet-validator-3 is currently at 96%.

New files

monitoring/disk-check.sh — Cron-deployable per-instance disk monitor. Alerts via email and/or Slack at configurable WARNING (default 80%) and CRITICAL (default 90%) thresholds. All config via env vars; skips pseudo-filesystems automatically.

# Deploy + schedule
export ALERT_EMAIL=ops@example.com
export SLACK_WEBHOOK_URL=https://hooks.slack.com/services/XXX/YYY/ZZZ
(crontab -l 2>/dev/null; echo "*/15 * * * * /opt/numbers-network/monitoring/disk-check.sh >> /var/log/disk-check.log 2>&1") | crontab -

monitoring/setup-gcp-disk-alerts.sh — Provisions GCP Cloud Monitoring alerting policies (WARNING >80% sustained 5 min, CRITICAL >90% sustained 1 min) and notification channels (email/Slack) via gcloud. Requires the Ops Agent on each instance for the agent.googleapis.com/disk/percent_used metric.
```
export GCP_PROJECT=your-gcp-project-id
export ALERT_EMAIL=ops@example.com
bash monitoring/setup-gcp-disk-alerts.sh
```
docs/runbooks/disk-management.md — Operational runbook covering:
- Online GCE persistent disk expansion (growpart + resize2fs, no downtime)
- Avalanchego chain data pruning (state-sync re-sync, log rotation, snapshot cleanup)
- Blockscout/PostgreSQL cleanup (VACUUM FULL, Docker prune)
- Instance inventory with growth rate estimates

Updated

README.md — Added "Disk Monitoring" section with ToC entry, quick-start usage for both scripts, and link to the runbook.

Original prompt

This section details on the original issue you should resolve

<issue_title>[Feature][High] Add disk usage monitoring and auto-alerting for all GCE instances</issue_title>
<issue_description>## Summary

Multiple GCE instances have reached or are approaching critical disk thresholds with no automated alerting in place. The 2026-03-15 mainnet validator-1 incident (auto-shutdown at 97% disk) and the current testnet-validator-3 situation (96% disk as of 2026-03-17) demonstrate an urgent need for proactive disk monitoring.

Current State (2026-03-17)

Instance Disk Used Use% Status

numbers-mainnet-validator-1 3.4T 2.8T 84% Warning

numbers-mainnet-validator-a1 1.9T 1.1T 57% OK

numbers-mainnet-validator-a2 2.0T 970G 49% OK

numbers-testnet-validator-3 497G 476G 96% CRITICAL

testnet-explorer 29G 25G 84% Warning

mainnet-explorer 47G 33G 72% OK

Proposed Implementation

GCP Cloud Monitoring alerting policies: Create uptime/disk metric alerts that fire at 80% (warning) and 90% (critical) thresholds

Notification channels: Configure email and/or Slack notifications for disk alerts

Runbook documentation: Add a disk cleanup/expansion runbook to the repository covering:

How to expand GCE persistent disks (online resize)

Avalanchego chain data pruning options

Blockscout/explorer database cleanup procedures

Monitoring script: Add a cron-based disk check script that can be deployed to each instance as a fallback

Immediate Actions Needed

numbers-testnet-validator-3 at 96% needs immediate disk expansion or cleanup

testnet-explorer and numbers-mainnet-validator-1 at 84% should be monitored closely

Impact

High — without disk monitoring, validators will silently auto-shutdown when disk < 3% free, causing chain downtime and transaction mempool backlog (as seen in the 2026-03-15 incident).

Generated by Health Monitor with Omni</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes [Feature][High] Add disk usage monitoring and auto-alerting for all GCE instances #138

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

Co-authored-by: numbers-official <181934381+numbers-official@users.noreply.github.com>

Initial plan

e3af572

Copilot AI assigned Copilot and numbers-official Mar 17, 2026

Copilot started work on behalf of numbers-official March 17, 2026 14:31 View session

feat: add disk usage monitoring and auto-alerting for all GCE instances

39273e1

Co-authored-by: numbers-official <181934381+numbers-official@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] [Feature] Add disk usage monitoring and auto-alerting for GCE instances~~ Add disk usage monitoring and auto-alerting for all GCE instances Mar 17, 2026

Copilot AI requested a review from numbers-official March 17, 2026 14:42

Copilot finished work on behalf of numbers-official March 17, 2026 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add disk usage monitoring and auto-alerting for all GCE instances#140

Add disk usage monitoring and auto-alerting for all GCE instances#140
Copilot wants to merge 2 commits intomainfrom
copilot/add-disk-usage-monitoring

Copilot AI commented Mar 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Instance	Disk	Used	Use%	Status
numbers-mainnet-validator-1	3.4T	2.8T	84%	Warning
numbers-mainnet-validator-a1	1.9T	1.1T	57%	OK
numbers-mainnet-validator-a2	2.0T	970G	49%	OK
numbers-testnet-validator-3	497G	476G	96%	CRITICAL
testnet-explorer	29G	25G	84%	Warning
mainnet-explorer	47G	33G	72%	OK

Conversation

Copilot AI commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New files

Updated

Current State (2026-03-17)

Proposed Implementation

Immediate Actions Needed

Impact

Comments on the Issue (you are @copilot in this section)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Mar 17, 2026 •

edited

Loading