Skip to content

fix(sandbox): rewrite credential placeholders in websocket text frames#1286

Draft
ericksoa wants to merge 14 commits intoNVIDIA:mainfrom
ericksoa:fix/872-websocket-credential-rewrite
Draft

fix(sandbox): rewrite credential placeholders in websocket text frames#1286
ericksoa wants to merge 14 commits intoNVIDIA:mainfrom
ericksoa:fix/872-websocket-credential-rewrite

Conversation

@ericksoa
Copy link
Copy Markdown

@ericksoa ericksoa commented May 9, 2026

Summary

Fixes #872 by adding an opt-in, policy-controlled WebSocket credential placeholder rewrite path after an allowed RFC 6455 upgrade. Provider secrets stay out of sandbox env, argv, config files, logs, and sandbox-generated payloads until OpenShell reaches the relay boundary; opted-in endpoints then resolve openshell:resolve:env:* placeholders in client-to-server UTF-8 WebSocket text frames before forwarding upstream.

The implementation follows common WebSocket proxy hardening patterns while keeping the scope specific to OpenShell: validate the upgrade, parse frames strictly, support only a narrow safe compression shape, fail closed on ambiguous protocol states, preserve raw opt-out behavior, and use structured GraphQL operation policy for GraphQL-over-WebSocket instead of generic payload matching.

Related Issue

Fixes #872

Reviewer Map

  • Policy/schema/CLI surface: proto/sandbox.proto, crates/openshell-policy, crates/openshell-cli, crates/openshell-providers, crates/openshell-server, and policy docs.
  • WebSocket protocol enforcement and credential rewrite: crates/openshell-sandbox/src/l7/websocket.rs, relay.rs, rest.rs, proxy.rs, and secrets.rs.
  • GraphQL-over-WebSocket classification: crates/openshell-sandbox/src/l7/graphql.rs, websocket.rs, mod.rs, opa.rs, and sandbox-policy.rego.
  • Regression and conformance coverage: crates/openshell-sandbox tests, e2e/rust/tests/websocket_conformance.rs, tasks/test.toml, and .github/workflows/websocket-conformance.yml.

Changes

  • Add websocket_credential_rewrite as an explicit endpoint policy field, defaulting to false.
  • Support the rewrite option on protocol: websocket endpoints and REST compatibility endpoints that perform a WebSocket upgrade.
  • Wire the field through proto conversion, YAML serde, provider profiles, incremental merge behavior, OPA JSON conversion, L7 config parsing, CLI summaries, and docs.
  • Add incremental policy update support for WebSocket endpoints, WebSocket GET / WEBSOCKET_TEXT rules, and the websocket-credential-rewrite endpoint option.
  • Preserve the existing GraphQL authoring model: GraphQL semantic rules are expressed in full policy YAML, matching the current GraphQL-over-HTTP pattern rather than adding one-off CLI sugar in this PR.
  • Validate upstream handshake responses before forwarding 101 Switching Protocols, including status, Upgrade, Connection, Sec-WebSocket-Accept, and extension negotiation against the sanitized client offer.
  • Implement RFC 6455 client-frame enforcement for masking, opcodes, fragmentation, close frames, UTF-8 text, bounded buffering, and fail-closed malformed states.
  • Implement the safe RFC 7692 permessage-deflate subset with no-context-takeover constraints; unsupported parameters or mismatched upstream responses fail closed.
  • Rewrite only client-to-server UTF-8 text frames containing openshell:resolve:env:* placeholders. Binary frames and server-to-client frames are relayed without rewriting.
  • Add GraphQL-over-WebSocket operation policy for modern graphql-transport-ws subscribe messages and legacy graphql-ws / subscriptions-transport-ws start messages.
  • Apply credential rewrite before GraphQL-over-WebSocket classification so connection payloads and operation messages see the same relay-boundary secret resolution behavior.
  • Fail closed on malformed GraphQL WebSocket JSON, unsupported operation message types, parse errors, unregistered hash-only persisted queries, and denied operations.
  • Keep WebSocket protocol lifecycle messages such as connection_init, connection_terminate, ping, pong, complete, and stop out of payload logs.
  • Preserve raw opt-out WebSocket behavior, including forward-proxy upgrade requests and endpoints that do not opt into credential rewrite or GraphQL operation policy.
  • Add focused tests across CONNECT L7, route-selected L7, forward-proxy behavior, raw opt-out behavior, credential replacement, compression negotiation, close/fragmentation/UTF-8/frame validation, GraphQL-over-WebSocket policy, and no-secret log assertions.
  • Add a Docker-backed WebSocket conformance e2e lane that proves provider-managed placeholders are resolved on the real sandbox path without echoing payloads, placeholders, or secrets into test output.

Security and Spec Notes

  • The feature is opt-in per endpoint and keeps the existing no-secret-in-sandbox model intact.
  • Ambiguous protocol states fail closed before OpenShell forwards the upstream upgrade response to the sandbox client.
  • OCSF rewrite events report safe metadata and replacement counts only; they do not log payloads, placeholders, secret values, or secret lengths.
  • subscription is supported as an explicit GraphQL operation type. It is not silently added to read-only or read-write; that matches the existing GraphQL preset model where read-only = query, read-write = query + mutation, and full = *.
  • Selective credential passthrough is intentionally not implemented here.
  • Generic WebSocket payload-content policy matching is intentionally not implemented here; GraphQL-over-WebSocket is handled through structured GraphQL operation semantics.
  • Binary frame rewriting is intentionally not implemented here.
  • Unsafe compression variants are intentionally not supported.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated

Focused validation run on the current pushed head:

  • cargo fmt
  • cargo test -p openshell-sandbox websocket
  • cargo test -p openshell-sandbox graphql
  • cargo test -p openshell-sandbox secrets
  • cargo test -p openshell-policy
  • cargo clippy -p openshell-sandbox -p openshell-policy --all-targets -- -D warnings
  • mise run pre-commit

Autobahn was not run because the repo does not currently include an Autobahn harness. This PR adds the closest practical focused regression coverage in-tree plus a dedicated WebSocket conformance e2e lane; the workflow is manual today with a note to add schedule: after it burns in.

Remaining Maintainer Decisions

  • Whether to add first-class CLI sugar for GraphQL policy authoring later. If added, it should cover GraphQL-over-HTTP and GraphQL-over-WebSocket together.
  • Whether to turn .github/workflows/websocket-conformance.yml into scheduled nightly coverage after manual burn-in.
  • Whether OpenShell should introduce a dedicated realtime/subscription access preset later. This PR keeps subscriptions explicit unless access: full is used.

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 9, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

ericksoa added 4 commits May 8, 2026 19:58
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa
Copy link
Copy Markdown
Author

ericksoa commented May 9, 2026

WebSocket implementation alignment note

This PR follows common hardening patterns used by established WebSocket clients, servers, and proxy implementations, while keeping the implementation tailored to OpenShell's credential-boundary model.

The important alignment points are behavioral:

  • validate the HTTP upgrade response before treating the connection as a WebSocket;
  • track the original client handshake offer and reject upstream responses that select options the relay did not safely offer;
  • parse frames strictly, including masking, reserved bits, opcodes, fragmentation, UTF-8 text, close frames, and bounded message sizes;
  • keep compression support intentionally narrow, deterministic, and fail-closed;
  • avoid payload logging and expose only safe metadata for observability;
  • preserve raw pass-through behavior when an endpoint has not opted into credential rewrite or semantic message policy;
  • use structured GraphQL operation policy for GraphQL-over-WebSocket instead of generic payload-content matching.

This matters because OpenShell is not trying to become a general WebSocket application framework. The purpose here is narrower: preserve the no-secret-in-sandbox invariant, then resolve placeholders only at the OpenShell relay boundary for opted-in client-to-server text frames. The safest shape is therefore a small, explicit protocol surface: if the relay can validate and process the negotiated WebSocket state, it proceeds; if not, it closes before forwarding an unsafe upgraded flow.

The remaining confidence-building work is conformance and durability, not feature expansion. The follow-up path is to keep the in-tree RFC regression matrix, maintain the Docker-backed WebSocket conformance e2e lane, and consider an Autobahn-style relay harness or parser fuzzing after this PR lands.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa
Copy link
Copy Markdown
Author

ericksoa commented May 9, 2026

Follow-up hardening pass

This pass applies common WebSocket proxy hardening patterns to the OpenShell relay: parse negotiation as structured protocol data, validate server-selected options against the original client offer, canonicalize only the supported safe extension forms, and fail closed when the rewrite path sees parameters it does not explicitly support.

Concretely, fix(sandbox): harden websocket negotiation parsing adds RFC 6455 subprotocol tracking and response validation, replaces ad hoc Sec-WebSocket-Extensions splitting with token-aware parsing, canonicalizes only the safe permessage-deflate no-context-takeover subset, rejects duplicate/value-bearing/quoted unsafe extension parameters in the rewrite path, and keeps WebSocketExtensionMode::Preserve raw so opt-out behavior does not change.

This matters for credential rewriting because the proxy becomes a protocol participant after it mutates an upgraded flow. Loose negotiation can let the upstream select compression state or subprotocol behavior the relay did not offer or understand, which creates room for state confusion, fail-open compression handling, or secret-bearing payloads moving through a path that no longer has the invariants the rewriter depends on. The new behavior keeps the supported surface narrow and deterministic: either the relay can validate and safely process the negotiated shape, or the handshake fails before forwarding the 101 to the client.

Remaining conformance work is still useful, especially an Autobahn-style relay harness and broader fuzz/property coverage for handshake parsing and frame sequencing. Those should come after this focused spec pass so future parser or engine swaps have a stable regression matrix to preserve.

@ericksoa
Copy link
Copy Markdown
Author

ericksoa commented May 9, 2026

I have read the DCO document and I hereby sign the DCO.

@ericksoa
Copy link
Copy Markdown
Author

ericksoa commented May 9, 2026

I have read the Contributor Agreement including DCO and I hereby sign the Contributor Agreement and DCO

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa
Copy link
Copy Markdown
Author

ericksoa commented May 9, 2026

Focused conformance matrix added

I added test(sandbox): add websocket conformance relay matrix as the practical next step before a full Autobahn lane. The new in-crate harness drives the real HTTP 101 relay path and then sends post-upgrade frames through three representative modes:

  • raw preserve mode, proving opt-out behavior still relays frames without mask validation or rewriting;
  • rewrite mode without negotiated compression, proving a masked text credential is rewritten only after a validated upgrade;
  • rewrite mode with the safe permessage-deflate subset, proving canonical extension negotiation carries through to compressed text rewriting with RSV1 preserved.

This is not a replacement for Autobahn. It is the closest low-cost regression layer for this PR because it covers the specific OpenShell behavior boundary: once the proxy becomes an active WebSocket participant for credential rewriting, the handshake assumptions, compression negotiation, and post-upgrade frame relay need to be tested together rather than only as isolated parser/frame unit tests. Full Autobahn remains useful as a follow-up CI or manual conformance job, but it would add Docker/runtime/tooling weight beyond this focused PR pass.

ericksoa added 2 commits May 8, 2026 22:04
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa
Copy link
Copy Markdown
Author

ericksoa commented May 9, 2026

Progress update: policy configuration now treats WebSocket as a first-class inspected protocol in the incremental update flow.

What changed:

  • openshell policy update --add-endpoint host:port:read-write:websocket:enforce is now accepted, so users do not have to hand-edit YAML just to create a WebSocket L7 endpoint.
  • --add-allow and --add-deny can target existing protocol: websocket endpoints using the existing method/path schema. For WebSocket, GET means the RFC 6455 upgrade and WEBSOCKET_TEXT means client text messages on the upgraded request path; this deliberately does not add payload-content matching.
  • Policy merge access expansion is now protocol-aware: WebSocket read-only expands to GET, WebSocket read-write expands to GET plus WEBSOCKET_TEXT, and WebSocket presets no longer expand into REST write methods.
  • CLI help and policy/security docs now document the WebSocket workflow, the WEBSOCKET_TEXT method, and the current rule-level constraints.

Why it matters:
This makes the WebSocket credential-rewrite capability operationally usable through the same policy-update workflow as HTTP. A maintainer can start with a minimal endpoint, audit denials, then add specific WebSocket handshake/text-frame rules without replacing the full policy. It also keeps the schema natural: transport selection stays on protocol, permissions stay in method/path rules, and WebSocket text-frame policy remains explicitly separate from message payload inspection.

Validation:

  • cargo fmt
  • cargo test -p openshell-cli policy_update
  • cargo test -p openshell-policy merge
  • cargo test -p openshell-sandbox websocket
  • cargo test -p openshell-sandbox secrets
  • cargo test -p openshell-policy
  • cargo clippy -p openshell-cli -p openshell-sandbox -p openshell-policy --all-targets -- -D warnings
  • mise run pre-commit

ericksoa added 2 commits May 8, 2026 22:23
…dential-rewrite

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa
Copy link
Copy Markdown
Author

ericksoa commented May 9, 2026

Progress update: WebSocket credential replacement is now configurable from the incremental policy workflow and covered through the route-selected relay path.

What changed:

  • Added openshell policy update --websocket-credential-rewrite for --add-endpoint operations. This lets users create or merge a protocol: websocket endpoint and enable text-frame credential placeholder replacement without hand-editing YAML.
  • The flag is intentionally constrained: it is accepted only for added protocol: websocket endpoints or protocol: rest compatibility endpoints that may upgrade to WebSocket. It is rejected for plain L4 endpoints and protocol: sql.
  • Merge behavior now preserves/merges websocket_credential_rewrite: true when an incoming endpoint is merged into an existing endpoint.
  • Gateway/server-side merge summaries now surface websocket_credential_rewrite=true so audit/progress output records when this sensitive behavior is enabled, without logging secret material.
  • Added a route-selected WebSocket regression proving the main value path: sandbox sends an openshell:resolve:env:* placeholder in a masked client text frame, OpenShell resolves it after a valid 101 upgrade, upstream receives the real credential, and the forwarded client-to-server frame remains masked.
  • Updated the policy docs and CLI examples to show the WebSocket credential rewrite workflow.

Why it matters:
The point of this capability is not only to parse WebSockets; it is to let agents call real WebSocket services while keeping service credentials out of sandbox-authored code and logs. This makes that operational path first-class: policy can allow the upgrade and client text messages, then explicitly opt in to resolving OpenShell placeholders at the relay boundary. The agent still sends placeholders, OpenShell injects the credential only at egress time, binary frames remain untouched, and no payload-content policy matching was added.

Validation:

  • cargo fmt
  • cargo test -p openshell-cli policy_update
  • cargo test -p openshell-policy merge
  • cargo test -p openshell-server summarize_cli_policy_merge_op_formats_websocket_credential_rewrite
  • cargo test -p openshell-sandbox route_selected_websocket_rewrites_text_credentials_after_upgrade
  • cargo test -p openshell-sandbox websocket
  • cargo test -p openshell-sandbox secrets
  • cargo test -p openshell-policy
  • cargo clippy -p openshell-cli -p openshell-sandbox -p openshell-policy -p openshell-server --all-targets -- -D warnings
  • mise run pre-commit

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa
Copy link
Copy Markdown
Author

ericksoa commented May 9, 2026

Follow-up update after the policy CLI shape review:

  • Kept the WebSocket conformance workflow manual-only. The workflow remains workflow_dispatch without a schedule.
  • Changed incremental policy update syntax so WebSocket credential rewrite follows the existing endpoint-local pattern. The new CLI form is:
openshell policy update demo \
  --add-endpoint realtime.example.com:443:read-write:websocket:enforce:websocket-credential-rewrite \
  --binary /usr/bin/node

Rationale: --binary is command-wide because binaries live on the network rule and already apply to each endpoint added in the same command. protocol, access, enforcement, and websocket_credential_rewrite are endpoint fields, so rewrite should be expressed on the --add-endpoint spec instead of through a broad flag that accidentally applies to every endpoint in the batch. This keeps the WebSocket extension aligned with the existing REST/WebSocket endpoint schema and makes mixed batches safer to review.

Message-content policy would buy a different class of control than this PR currently implements: it would let OpenShell authorize WebSocket messages by application intent inside the text payload, such as event type, channel, action, model/tool name, or service-specific JSON shape, instead of only by host/port, upgrade path, and WEBSOCKET_TEXT. That would be valuable for production SaaS/realtime APIs where the same WebSocket path carries both safe reads and state-changing operations. I am leaving that out of this PR because it is a larger product/schema decision: WebSocket payloads are not standardized, parsers must be protocol-specific and opt-in, and any denial/logging path must preserve the current no-payload/no-secret/no-length-leak guarantees.

Validation after this update:

  • cargo fmt
  • cargo test -p openshell-cli policy_update
  • cargo test -p openshell-sandbox websocket
  • cargo test -p openshell-sandbox secrets
  • cargo test -p openshell-policy
  • cargo clippy -p openshell-cli -p openshell-sandbox -p openshell-policy --all-targets -- -D warnings
  • mise run pre-commit

All passed locally.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa
Copy link
Copy Markdown
Author

ericksoa commented May 9, 2026

Progress update: GraphQL-over-WebSocket semantic policy pass

Implemented in 55f52e73b0194aa68330c224429efb1f919bc142:

  • Added GraphQL-over-WebSocket operation inspection for protocol: websocket endpoints that contain GraphQL operation policy.
  • Reused the existing GraphQL classifier and OPA operation rules for client operation messages instead of adding generic JSON payload matching.
  • Supports modern graphql-transport-ws client subscribe messages and legacy graphql-ws / subscriptions-transport-ws start messages.
  • Treats connection_init, connection_terminate, ping, pong, complete, and stop as GraphQL WebSocket control-plane messages, without payload logging.
  • Keeps raw WebSocket behavior unchanged for endpoints without GraphQL operation policy.
  • Keeps credential rewrite before message classification, including connection_init auth payloads, so placeholder credentials can be resolved before upstream forwarding.
  • Added fail-closed handling for malformed JSON, missing operation ids/payloads, unsupported client message types, GraphQL parse errors, unregistered hash-only persisted queries, and unallowed operations.
  • Made GraphQL policy authoritative for WebSocket operation messages so a generic WEBSOCKET_TEXT rule cannot bypass operation_type / operation_name / fields constraints.
  • Added validation errors for mixed WebSocket transport + GraphQL operation rule fields and for raw WEBSOCKET_TEXT message rules on WebSocket endpoints that use GraphQL operation policy.
  • Added OCSF/log summary assertions so GraphQL-over-WebSocket logging includes only operation metadata and does not include payloads, variables, placeholders, secrets, or secret lengths.
  • Updated policy docs and security guidance for GraphQL-over-WebSocket policy shape.
  • Commit includes DCO signoff.

Validation run:

  • cargo fmt
  • cargo test -p openshell-sandbox websocket
  • cargo test -p openshell-sandbox graphql
  • cargo test -p openshell-sandbox secrets
  • cargo test -p openshell-policy
  • cargo clippy -p openshell-sandbox -p openshell-policy --all-targets -- -D warnings
  • mise run pre-commit

Maintainer decisions still explicit:

  • I kept this as semantic GraphQL-over-WebSocket policy, not generic message-content policy.
  • I did not add selective credential passthrough.
  • I did not rewrite binary frames.
  • I did not add unsafe compression variants.
  • I did not mark the draft PR ready for review.

…dential-rewrite

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Secrets rewriting is not applied to websocket traffic.

1 participant