Skip to content

[FEATURE]: Allow Version in Filename for Recursive File Lookup #7

@daves-mantel

Description

@daves-mantel

Is there an existing issue for this?

  • I have searched the existing issues

Problem statement

The CDC Snapshot file source expects the {version} placeholder to appear only in the directory path when recursiveFileLookup is enabled. If {version} appears in the filename itself, the framework raises a hard error. This prevents ingestion of source data where versioned files are nested within date-partitioned directory structures (e.g. Data/2025/10/TableName.{version}.parquet), where each file represents a distinct snapshot and the version identifier is embedded in the filename rather than the folder hierarchy.

Proposed Solution

Add an allowVersionInFilename: bool = False configuration option to CDCSnapshotFileSource. When set to True alongside recursiveFileLookup, the framework permits {version} to appear in the filename portion of the path pattern, treating each matched file as a separate snapshot. This also requires adding a file_path attribute to VersionInfo so that when a version is extracted from a filename during recursive lookup, the full file path is preserved - allowing _read_snapshot_dataframe to read the specific file directly rather than attempting to reconstruct the path from the pattern. The existing hard error is retained as the default behaviour (allowVersionInFilename: False) to maintain backward compatibility, and the JSON schema is updated to expose the new option.

Additional Context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions