Skip to content

Docs: document filename convention and pipeline constraints for custom ingestion#2098

Open
RolandKrummenacher wants to merge 6 commits intodevfrom
krummrol/docs-ingest-from-other-sources
Open

Docs: document filename convention and pipeline constraints for custom ingestion#2098
RolandKrummenacher wants to merge 6 commits intodevfrom
krummrol/docs-ingest-from-other-sources

Conversation

@RolandKrummenacher
Copy link
Copy Markdown
Collaborator

Summary

Closes #2096. Expands the Ingest from other data sources section of the Hub deploy tutorial with details required for the Data Explorer ingestion pipeline to work correctly. The existing text led integrators into silent data loss when uploading custom FOCUS datasets (e.g. from GCP or AWS).

Changes to docs-mslearn/toolkit/hubs/deploy.md:

  • Document the <ingestionId>__<originalFileName>.parquet filename convention and explain how the pipeline derives ingestionId by splitting on __ (see Analytics/app.bicep:1806-1810).
  • Add an IMPORTANT callout that each upload must rewrite the full month — incremental deltas silently wipe existing extents because of the pre-ingest cleanup at Analytics/app.bicep:1352.
  • Add a TIP to filter empty parquet shards before upload (Data Explorer rejects them with BadRequest_NoRecordsOrWrongFormat and the pipeline retries 3× at 120s intervals, per retry: 3, retryIntervalInSeconds: 120 on the Ingest Data activity).
  • Correct the manifest instruction: the file must contain at minimum {} because the storage event trigger sets ignoreEmptyBlobs: true (fx/hub-eventTrigger.bicep:79) — a zero-byte file never triggers ingestion.
  • Update table/function references in step 4 from v1_0 to the current v1_2 schema (confirmed against HubSetup_Latest.kql).

No code changes.

Test plan

  • Preview the rendered page locally (docs-mslearn) and confirm the IMPORTANT/TIP callouts render correctly under the numbered list item.
  • Verify no broken links in the updated section via the existing doc-link test suite.

🤖 Generated with Claude Code

… ingestion (#2096)

Closes #2096. Expands the "Ingest from other data sources" section with the
details integrators need to avoid silent data loss: the ``__`` filename
convention used by the ingestion pipeline, full-month replacement requirement,
retry cost of empty parquet shards, corrected table versions (v1_2), and the
non-empty ``manifest.json`` requirement driven by ``ignoreEmptyBlobs`` on the
event trigger.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Roland Krummenacher and others added 2 commits April 16, 2026 11:25
…er month

Reword the IMPORTANT callout so it no longer contradicts the preceding
guidance about using ``dd``/``dd/hh`` subfolders for nonoverlapping deltas.
The pre-ingest cleanup operates on the folder path, so each delta folder is
independently replaced — the callout now reflects that.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The event trigger fires on manifest.json creation, after which the pipeline
waits 60 seconds and enumerates the folder. Parquet files that arrive after
that enumeration are skipped by the current run, so the manifest must be
uploaded last.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@RolandKrummenacher
Copy link
Copy Markdown
Collaborator Author

Cross-references for reviewer context:

Closing #2096 (this PR's parent issue); leaving #2057 and #2046 for maintainers to decide whether an additional code-side fix is wanted.

Roland Krummenacher and others added 2 commits April 16, 2026 16:59
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Needs: Review 👀 PR that is ready to be reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Docs: 'Ingest from other data sources' missing critical filename/dedup details

3 participants