Persist data to Cloud Storage before processing to prevent OOM data loss#146
Merged
Persist data to Cloud Storage before processing to prevent OOM data loss#146
Conversation
…t data loss When the Cloud Function OOM-crashes during metadata processing or OSF upload, the researcher's data payload is lost because no catch block executes. This change writes the data to Cloud Storage immediately after validation, before any heavy processing begins. If the function crashes, the data survives in the pending-data/ prefix and can be recovered. On successful OSF upload (or successful queue), the pending copy is cleaned up. Also adds the storage emulator config to firebase.json so tests can exercise the persist/cleanup cycle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds scheduledPendingRecovery that runs every 15 minutes to scan the pending-data/ prefix for stale files (older than 15 min). For each orphaned file, it replays the full processing pipeline: token resolution, metadata processing (if active), and OSF upload. This handles the case where api-data OOM-crashed after persisting but before completing. Also updates persist-pending to store the full request envelope (including metadataOptions) so the recovery function can replay metadata processing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add PUT /endpoint to mock server so OSF upload succeeds in tests - Update early-persist test to use mock server - Fix skip-metadata test assertion to check property existence instead of non-empty value (metadata errors return empty string without mock) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The metadata-emulator test already uses port 3000 for its mock server. Use port 3001 with an inline mock server for the early-persist test to avoid port conflicts when Jest runs tests in parallel. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The data-emulator test was flaky on CI due to resource contention when running all test files in parallel. Increase the polling timeout from 10s to 30s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of reimplementing OSF upload logic, the recovery function now promotes orphaned pending-data/ files into the existing uploadQueue system. This means recovered data immediately appears in the researcher's dashboard QueuePanel and follows the same retry/download lifecycle as normal upload failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Expand "Why am I seeing this?" to cover OOM/crash recoveries alongside OSF errors and config issues - Map raw failure reasons to plain-language descriptions so researchers understand what happened without technical jargon Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pending uploads are no longer shown in the alert panel. A light text indicator near the header badges shows retry count and next retry time instead — no alarm for things the system handles. - The full alert panel (FailedUploadsPanel) only appears when uploads have exhausted all retries and the researcher needs to download. - Failure reasons get their own REASON column instead of tiny text under the filename. - Replace ATTEMPTS column with AUTO-CLEANUP (time until data expires). - Add UploadsResolvedNotice: brief success confirmation when all queued uploads complete, so the panel doesn't just vanish. - Remove error log mixing from the queue table (ErrorPanel handles those separately). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Researchers should be able to see and download queued data files as soon as they appear, not after 30 hours of retries. The panel now shows all entries (pending + failed) in a single table with: - STATUS column with badge and next retry time for pending items - REASON column with human-readable failure explanation - STORED FOR column showing time until auto-cleanup - Download button available immediately for every entry The panel uses warning tone for pending items (retries still running) and error tone when all retries are exhausted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pending-data/prefix) immediately after validation but before heavy processing (metadata, OSF upload)DATA_PERSIST_ERRORmessage for the (rare) case where the initial persist failsfirebase.jsonfor testingContext
Issue #102 — OOM crashes during metadata or OSF upload processing kill the function instantly, bypassing all catch blocks. The researcher's data in the request body is permanently lost. This change ensures data is safely persisted before any memory-intensive work begins.
How it works
persistPending()writes the raw data to Cloud StoragecleanupPending()removes the temporary filequeueUploadwrites its own copypending-data/for stale filesTest plan
🤖 Generated with Claude Code