-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Is there an existing issue for this?
- I have searched the existing issues
Problem statement
The CDC Snapshot file source expects the {version} placeholder to appear only in the directory path when recursiveFileLookup is enabled. If {version} appears in the filename itself, the framework raises a hard error. This prevents ingestion of source data where versioned files are nested within date-partitioned directory structures (e.g. Data/2025/10/TableName.{version}.parquet), where each file represents a distinct snapshot and the version identifier is embedded in the filename rather than the folder hierarchy.
Proposed Solution
Add an allowVersionInFilename: bool = False configuration option to CDCSnapshotFileSource. When set to True alongside recursiveFileLookup, the framework permits {version} to appear in the filename portion of the path pattern, treating each matched file as a separate snapshot. This also requires adding a file_path attribute to VersionInfo so that when a version is extracted from a filename during recursive lookup, the full file path is preserved - allowing _read_snapshot_dataframe to read the specific file directly rather than attempting to reconstruct the path from the pattern. The existing hard error is retained as the default behaviour (allowVersionInFilename: False) to maintain backward compatibility, and the JSON schema is updated to expose the new option.
Additional Context
No response