Summary
Routine GCP review on 2026-05-15 found pi.ruv.io brain is up but most of its Cloud Scheduler optimization jobs have been failing for days. Root causes mixed: one dead route, three jobs racing the same overloaded endpoint, one unrelated Firestore backup config, and one unrelated Cloud Run service that never starts.
Findings
1. brain-reclassify-daily → dead route (NOT_FOUND, every 4h)
Scheduler brain-reclassify-daily POSTs https://pi.ruv.io/v1/reclassify, but that route does not exist in crates/mcp-brain-server/src/routes.rs (verified against the router build at lines 325–365 of routes.rs). Has been 404ing on every fire — at least back through 2026-05-14. Either restore the handler or delete/pause the scheduler.
2. Three schedulers hammer /v1/pipeline/optimize and time out
brain-train — every 30m, mix of DEADLINE_EXCEEDED / RESOURCE_EXHAUSTED / UNKNOWN
brain-full-optimize — daily 03:00, DEADLINE_EXCEEDED
brain-graph — every 6h, DEADLINE_EXCEEDED / UNKNOWN
All three POST the same endpoint. The endpoint exists (routes.rs:365 → pipeline_optimize) but with revision ruvbrain-00203-brv (deployed 2026-04-14) on 2 vCPU / 4Gi the graph rebuild work (22k nodes, 1.2M edges) does not finish inside Cloud Scheduler's attempt deadline.
Options:
- Raise scheduler
attemptDeadline on the three jobs (Scheduler's default is short)
- Split
/v1/pipeline/optimize into work-class-specific paths so train/optimize/graph don't compete
- Bump Cloud Run memory/cpu and/or
--no-cpu-throttling
- Move to async job pattern (scheduler kicks, returns 202; worker drains a queue)
3. firestore-daily-backup / firestore-weekly-backup → INVALID_ARGUMENT
Daily and weekly Firestore exports have failed every run for at least 5 days (most recent 2026-05-15 02:00Z). Likely stale arguments (bucket, collectionIds, or output URI) — not a brain-server problem, but breaks DR. Needs a one-line job redefinition.
4. rvagent-learning Cloud Run service won't start
Revision rvagent-learning-00001-kld never bound PORT=8080 within the health-check window. Scheduler rvagent-learning-daily keeps invoking it. Source isn't in this repo — needs to be traced to its owner (or the scheduler paused).
5. Stale brain deployment + paused optimization jobs
- Latest deployed revision
ruvbrain-00203-brv is 31 days old (2026-04-14). Worth a fresh deploy.
- Per CLAUDE.md "7 optimization jobs running" — actually
brain-drift, brain-transfer, brain-attractor, brain-partition-cache are PAUSED. Either intentional and CLAUDE.md is stale, or they should be re-enabled. Worth a one-line decision.
6. Embedding engine
/v1/status reports embedding_engine=ruvllm::HashEmbedder, embedding_dim=128, embedding_corpus=0. Hash embeddings, not real semantic vectors — separate (existing) work, but document the current state so consumers know brain_search recall is bounded.
Reproduction
curl -sS https://pi.ruv.io/v1/status | jq '.embedding_engine,.graph_nodes,.graph_edges'
gcloud scheduler jobs list --location=us-central1 \
--format="table(name.basename(),state,schedule,status.code)"
gcloud logging read 'resource.type="cloud_scheduler_job" AND severity>=ERROR' \
--limit=20 --freshness=24h --format=json | jq -r '.[].jsonPayload | "\(.targetType) \(.status) \(.url)"'
Proposed fix order (low-risk → high-risk)
- Pause
brain-reclassify-daily until route is restored or job is deleted (stops 404 spam)
- Pause
rvagent-learning-daily until owner re-attaches it (stops invoking broken service)
- Bump
attemptDeadline on brain-train / brain-full-optimize / brain-graph to 20–30 min
- Re-deploy
ruvbrain from current main so the 31-day staleness goes away
- Fix
firestore-daily-backup / firestore-weekly-backup arguments
- Decide on the paused jobs (re-enable or delete from CLAUDE.md doc)
Environment
- GCP project:
ruv-dev, region: us-central1
- Cloud Run:
ruvbrain rev 00203-brv from gcr.io/ruv-dev/ruvbrain:latest@sha256:9f71c28f...
- Domain mappings healthy:
pi.ruv.io → ruvbrain, mcp.pi.ruv.io → ruvbrain-sse
Summary
Routine GCP review on 2026-05-15 found
pi.ruv.iobrain is up but most of its Cloud Scheduler optimization jobs have been failing for days. Root causes mixed: one dead route, three jobs racing the same overloaded endpoint, one unrelated Firestore backup config, and one unrelated Cloud Run service that never starts.Findings
1.
brain-reclassify-daily→ dead route (NOT_FOUND, every 4h)Scheduler
brain-reclassify-dailyPOSTshttps://pi.ruv.io/v1/reclassify, but that route does not exist incrates/mcp-brain-server/src/routes.rs(verified against the router build at lines 325–365 ofroutes.rs). Has been 404ing on every fire — at least back through 2026-05-14. Either restore the handler or delete/pause the scheduler.2. Three schedulers hammer
/v1/pipeline/optimizeand time outbrain-train— every 30m, mix ofDEADLINE_EXCEEDED/RESOURCE_EXHAUSTED/UNKNOWNbrain-full-optimize— daily 03:00,DEADLINE_EXCEEDEDbrain-graph— every 6h,DEADLINE_EXCEEDED/UNKNOWNAll three POST the same endpoint. The endpoint exists (
routes.rs:365 → pipeline_optimize) but with revisionruvbrain-00203-brv(deployed 2026-04-14) on 2 vCPU / 4Gi the graph rebuild work (22k nodes, 1.2M edges) does not finish inside Cloud Scheduler's attempt deadline.Options:
attemptDeadlineon the three jobs (Scheduler's default is short)/v1/pipeline/optimizeinto work-class-specific paths so train/optimize/graph don't compete--no-cpu-throttling3.
firestore-daily-backup/firestore-weekly-backup→INVALID_ARGUMENTDaily and weekly Firestore exports have failed every run for at least 5 days (most recent 2026-05-15 02:00Z). Likely stale arguments (bucket, collectionIds, or output URI) — not a brain-server problem, but breaks DR. Needs a one-line job redefinition.
4.
rvagent-learningCloud Run service won't startRevision
rvagent-learning-00001-kldnever boundPORT=8080within the health-check window. Schedulerrvagent-learning-dailykeeps invoking it. Source isn't in this repo — needs to be traced to its owner (or the scheduler paused).5. Stale brain deployment + paused optimization jobs
ruvbrain-00203-brvis 31 days old (2026-04-14). Worth a fresh deploy.brain-drift,brain-transfer,brain-attractor,brain-partition-cacheare PAUSED. Either intentional and CLAUDE.md is stale, or they should be re-enabled. Worth a one-line decision.6. Embedding engine
/v1/statusreportsembedding_engine=ruvllm::HashEmbedder,embedding_dim=128,embedding_corpus=0. Hash embeddings, not real semantic vectors — separate (existing) work, but document the current state so consumers knowbrain_searchrecall is bounded.Reproduction
Proposed fix order (low-risk → high-risk)
brain-reclassify-dailyuntil route is restored or job is deleted (stops 404 spam)rvagent-learning-dailyuntil owner re-attaches it (stops invoking broken service)attemptDeadlineonbrain-train/brain-full-optimize/brain-graphto 20–30 minruvbrainfrom currentmainso the 31-day staleness goes awayfirestore-daily-backup/firestore-weekly-backupargumentsEnvironment
ruv-dev, region:us-central1ruvbrainrev00203-brvfromgcr.io/ruv-dev/ruvbrain:latest@sha256:9f71c28f...pi.ruv.io → ruvbrain,mcp.pi.ruv.io → ruvbrain-sse