Skip to content

Merge test into main#147

Merged
jodeleeuw merged 22 commits intomainfrom
test
Apr 1, 2026
Merged

Merge test into main#147
jodeleeuw merged 22 commits intomainfrom
test

Conversation

@jodeleeuw
Copy link
Copy Markdown
Member

Summary

  • Add upload queue feature: cache failed OSF uploads in Firestore/Storage and retry automatically via scheduled function
  • Add QueuePanel dashboard component with download, retry, and ZIP export for queued files
  • Improve error logging across all OSF upload and metadata failure paths
  • Extract shared token resolution helper; fix OAuth token refresh fallback to PAT
  • Skip metadata processing when metadata is inactive (performance optimization)
  • Increase memory limit for data upload functions to 512MiB
  • Add Firestore and Storage security rules for uploadQueue
  • Add emulator tests for upload queue and skip-metadata behavior
  • Replace fixed sleep with polling in emulator tests for CI reliability

Test plan

  • Verify upload queue retry works end-to-end with OSF
  • Confirm QueuePanel displays queued files with download/retry/ZIP functionality
  • Check that metadata skip optimization doesn't affect active metadata experiments
  • Run emulator test suite (npm run test-ci in functions/)

🤖 Generated with Claude Code

jodeleeuw and others added 18 commits March 30, 2026 16:21
The apiData and apiBase64 functions were running with the default 256MiB
memory limit, which is insufficient for the Node.js runtime + Firebase SDK
baseline (~150MiB) plus multiple copies of the data payload held in memory
during upload. This caused OOM kills that returned 503 responses without
CORS headers, leading users to report CORS errors (see #102).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: increase memory limit for data upload functions
When an experiment has metadataActive=false, the blockMetadata function
was still called, performing unnecessary token decryption, potential
OAuth refresh, and Firestore document reference creation. This change
skips the entire metadata block when metadata is disabled, reducing
function execution time and avoiding unnecessary OSF API calls.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verifies that when metadataActive is false:
- metadataMessage is empty in the response
- no metadata document is created in Firestore
- metadata processing is still attempted when metadataActive is true

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t data loss

When the Cloud Function OOM-crashes during metadata processing or OSF
upload, the researcher's data payload is lost because no catch block
executes. This change writes the data to Cloud Storage immediately after
validation, before any heavy processing begins. If the function crashes,
the data survives in the pending-data/ prefix and can be recovered.

On successful OSF upload (or successful queue), the pending copy is
cleaned up. Also adds the storage emulator config to firebase.json so
tests can exercise the persist/cleanup cycle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds scheduledPendingRecovery that runs every 15 minutes to scan the
pending-data/ prefix for stale files (older than 15 min). For each
orphaned file, it replays the full processing pipeline: token resolution,
metadata processing (if active), and OSF upload. This handles the case
where api-data OOM-crashed after persisting but before completing.

Also updates persist-pending to store the full request envelope (including
metadataOptions) so the recovery function can replay metadata processing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add PUT /endpoint to mock server so OSF upload succeeds in tests
- Update early-persist test to use mock server
- Fix skip-metadata test assertion to check property existence instead
  of non-empty value (metadata errors return empty string without mock)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The metadata-emulator test already uses port 3000 for its mock server.
Use port 3001 with an inline mock server for the early-persist test
to avoid port conflicts when Jest runs tests in parallel.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The data-emulator test was flaky on CI due to resource contention
when running all test files in parallel. Increase the polling
timeout from 10s to 30s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of reimplementing OSF upload logic, the recovery function now
promotes orphaned pending-data/ files into the existing uploadQueue
system. This means recovered data immediately appears in the
researcher's dashboard QueuePanel and follows the same retry/download
lifecycle as normal upload failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Expand "Why am I seeing this?" to cover OOM/crash recoveries
  alongside OSF errors and config issues
- Map raw failure reasons to plain-language descriptions so
  researchers understand what happened without technical jargon

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pending uploads are no longer shown in the alert panel. A light
  text indicator near the header badges shows retry count and next
  retry time instead — no alarm for things the system handles.
- The full alert panel (FailedUploadsPanel) only appears when uploads
  have exhausted all retries and the researcher needs to download.
- Failure reasons get their own REASON column instead of tiny text
  under the filename.
- Replace ATTEMPTS column with AUTO-CLEANUP (time until data expires).
- Add UploadsResolvedNotice: brief success confirmation when all
  queued uploads complete, so the panel doesn't just vanish.
- Remove error log mixing from the queue table (ErrorPanel handles
  those separately).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Researchers should be able to see and download queued data files
as soon as they appear, not after 30 hours of retries. The panel
now shows all entries (pending + failed) in a single table with:
- STATUS column with badge and next retry time for pending items
- REASON column with human-readable failure explanation
- STORED FOR column showing time until auto-cleanup
- Download button available immediately for every entry

The panel uses warning tone for pending items (retries still
running) and error tone when all retries are exhausted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ention

Persist data to Cloud Storage before processing to prevent OOM data loss
- Clean up pending file on metadata failure path (api-data.ts)
- Add early-persist to apiBase64 for OOM crash protection (api-base64.ts)
- Use Firestore transaction for atomic deduplication in pending recovery
- Use random port (port 0) in early-persist test to avoid EADDRINUSE
- Improve DATA_PERSIST_ERROR message for live experiment context

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jodeleeuw and others added 4 commits April 1, 2026 09:47
Logs process.memoryUsage() at four points during request processing:
- request-received: after body parsing, before any processing
- after-persist: after writing to Cloud Storage
- after-metadata: after metadata processing
- after-osf-upload: after successful OSF upload

Each log line includes data payload size, RSS, heap used/total, and
external memory. This will help determine what payload sizes approach
the 512MiB function memory limit.

This instrumentation is temporary — remove after testing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ensures each Cloud Function instance handles only one request at a
time. This eliminates the risk of concurrent large payloads sharing
memory and pushing past the 512MiB limit. The tradeoff (more cold
starts under burst traffic) is negligible for DataPipe's usage pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The logMemory instrumentation was added to measure OOM thresholds
during testing. Results confirmed 512MiB with concurrency:1 is safe
for all payloads up to the 32MB Cloud Run limit. Removing before
merge to main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The log increment tests used waitForLog() which polled Firestore in a
loop for up to 30s. Under CI load with parallel test files, the
combined time for two requests + two polling cycles often exceeded the
30s jest timeout, causing flaky failures.

Since writeLog() is awaited inside apiData before the response is sent,
the log document is guaranteed to exist by the time saveData() returns.
Replace the polling with a simple direct read after a small delay,
and remove the now-unused waitForLog helper.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jodeleeuw jodeleeuw merged commit 775b299 into main Apr 1, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant