Skip to content

feat(rest): add scan plan endpoint support to REST catalog client#783

Draft
gsandeep1241 wants to merge 1 commit into
apache:mainfrom
gsandeep1241:sandeepg-scan-plan-endpoint-for-rest-catalog-client-2-impl
Draft

feat(rest): add scan plan endpoint support to REST catalog client#783
gsandeep1241 wants to merge 1 commit into
apache:mainfrom
gsandeep1241:sandeepg-scan-plan-endpoint-for-rest-catalog-client-2-impl

Conversation

@gsandeep1241

Copy link
Copy Markdown
Contributor

When a table is loaded from a REST catalog that advertises the PlanTableScan endpoint, NewScan() now returns a RestTableScanBuilder whose Build() produces a RestTableScan. PlanFiles() on that scan delegates manifest resolution to the server via POST /plan, GET /plan/{id} (with exponential backoff), POST /tasks/{id}, and DELETE /plan/{id} (best-effort cancel), instead of reading manifests locally.

  • Add RestTable, RestTableScanBuilder, RestTableScan and RestScanContext
  • Promote DataTableScan::PlanFiles and TableScanBuilder::Build to virtual
  • Convert RestCatalog::client_ and paths_ to shared_ptr so RestScanContext can share ownership with live scans

When a table is loaded from a REST catalog that advertises the PlanTableScan
endpoint, NewScan() now returns a RestTableScanBuilder whose Build() produces
a RestTableScan. PlanFiles() on that scan delegates manifest resolution to
the server via POST /plan, GET /plan/{id} (with exponential backoff),
POST /tasks/{id}, and DELETE /plan/{id} (best-effort cancel), instead of
reading manifests locally.

- Add RestTable, RestTableScanBuilder, RestTableScan and RestScanContext
- Promote DataTableScan::PlanFiles and TableScanBuilder::Build to virtual
- Convert RestCatalog::client_ and paths_ to shared_ptr so RestScanContext
  can share ownership with live scans
auto table_catalog = std::make_shared<TableScopedCatalog>(
shared_from_this(), context, identifier, table_config, table_session);

if (supported_endpoints_.contains(Endpoint::PlanTableScan())) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also gate on the effective scan-planning-mode, not only endpoint support. Java defaults to client-side planning and lets the table config override the client config, so a table can otherwise be forced into REST planning even when the server says client, or silently fall back when the server says server but the endpoint is missing.

request.case_sensitive = context_.case_sensitive;
request.min_rows_requested = context_.min_rows_requested;

if (context_.from_snapshot_id.has_value() && context_.to_snapshot_id.has_value()) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to set use-snapshot-schema for snapshot/time-travel and incremental scans. Java sends it for useSnapshot and start/end snapshot scans, and the REST spec says time travel should use the snapshot schema. Without it, schema-evolved tables can be planned against the current schema.

rest_context_.client->Post(path, json_request, /*headers=*/{},
*PlanErrorHandler::Instance(), *rest_context_.session));
ICEBERG_ASSIGN_OR_RAISE(auto json, FromJsonString(response.body()));
ICEBERG_ASSIGN_OR_RAISE(auto result,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Planning responses can include storage-credentials. Java switches to a scan-scoped FileIO built from those credentials, and the spec expects clients to use them for the returned tasks. Ignoring them means servers that vend temporary storage credentials can plan successfully but reads may fail.


switch (result.plan_status) {
case PlanStatus::kCompleted:
return ResolveScanTasks(result.plan_tasks, result.file_scan_tasks, specs);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once a plan-id is returned, the server may hold resources until all plan tasks are fetched or the plan is cancelled. If resolving paginated tasks fails partway through, this returns without cancelling the remaining plan; Java cancels from the scan-task iterable cleanup path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants