475 rapid local performance test environment supporting the polarsbased transformation rewrite in digital land python by mattsan-dev · Pull Request #489 · digital-land/digital-land-python

mattsan-dev · 2026-02-09T13:33:56Z

What type of PR is this? (check all applicable)

Description

Please replace this line with a brief description of the changes made.

Related Tickets & Documents

Ticket Link
Related Issue #
Closes #

QA Instructions, Screenshots, Recordings

Please replace this line with instructions on how to test your changes, a note
on the devices and browsers this has been tested on, as well as any relevant
images for UI changes.

Added/updated tests?

We encourage you to keep the code coverage percentage at 80% and above. Please refer to the Digital Land Testing Guidance for more information.

Yes
No, and this is why: please replace this line with details on why tests
have not been included
I need help with writing tests

[optional] Are there any post deployment tasks we need to perform?

[optional] Are there any dependencies on other PRs or Work?

… test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

…ument parsing Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

…gress tracking Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

…d Parquet Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

…support Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

…apid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

…t Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

…apid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

…dd run_all script for batch processing Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

…sting environment Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

…d pipeline_runner.py; add new implementations for main pipeline orchestration and reporting Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

Copilot

Pull request overview

Adds a self-contained local_testing/ environment to run the 26-phase Digital Land pipeline end-to-end for the Land Registry title-boundary dataset, with timing/performance reporting and Makefile helpers intended to support an “original vs Polars” transformation comparison.

Changes:

Introduces a modular local runner (main.py, PipelineRunner, PipelineReport) plus batch runner (run_all.py) and Make targets for setup/runs.
Adds GML download/extract/convert utilities (regex / Polars / DuckDB paths) and generates minimal pipeline config CSVs under local_testing/pipeline/.
Updates .gitignore to exclude generated local_testing artifacts (raw/extracted/converted/output/reports/cache/spec/venv).

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
`local_testing/run_all.py`	Batch driver to run the pipeline across all endpoints and write a summary JSON report.
`local_testing/main.py`	Orchestrates download → extract → convert → transform → report; wires CLI flags incl. `--compare` and `--phases`.
`local_testing/cli.py`	CLI argument parsing and LA selection helpers.
`local_testing/file_downloader.py`	Endpoint CSV fetch + ZIP downloader (requests/urllib) with progress reporting.
`local_testing/gml_extractor.py`	Extracts GML from downloaded ZIP archives.
`local_testing/gml_converter.py`	Converts GML to CSV/Parquet via regex, Polars, or DuckDB spatial.
`local_testing/pipeline_config.py`	Ensures pipeline CSV configs exist and downloads `organisation.csv` cache.
`local_testing/pipeline_runner.py`	Runs the 26-phase digital-land pipeline with per-phase timing.
`local_testing/pipeline_report.py`	Collects metrics and emits JSON + human-readable text reports.
`local_testing/pipeline/*.csv`	Minimal pipeline configuration fixtures (column/default/lookup/etc.).
`local_testing/README.md`	Setup and usage documentation for the local testing environment.
`local_testing/Makefile`	Convenience targets for init, running, batch runs, and “fast” mode.
`.gitignore`	Ignores `venv/` plus `local_testing` generated folders.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-09T13:40:59Z

local_testing/file_downloader.py

+            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
+            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
+            'Accept-Language': 'en-GB,en;q=0.9',
+            'Accept-Encoding': 'gzip, deflate, br',


_download_with_urllib sets Accept-Encoding: gzip, deflate, br, but urllib will not transparently decode brotli/gzip content-encodings. If the server responds compressed, the downloaded bytes will be corrupted. For binary ZIP downloads, omit Accept-Encoding (or explicitly handle Content-Encoding).

Suggested change

'Accept-Encoding': 'gzip, deflate, br',

…es Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

…r, main, pipeline report, pipeline runner, and run_all scripts Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

…s for consistency Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

codecov-commenter · 2026-02-09T15:08:12Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.97%. Comparing base (5c94713) to head (bf2fe7b).
⚠️ Report is 30 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #489      +/-   ##
==========================================
+ Coverage   85.49%   85.97%   +0.48%     
==========================================
  Files          87       87              
  Lines        5872     5966      +94     
==========================================
+ Hits         5020     5129     +109     
+ Misses        852      837      -15

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

eveleighoj

There is a lot of good stuff in here but I'm not sure it justifies creating a whole mini version of the project here. majority of the code should either be in digital-land(the src code) directory if it's functionality for the library or in tests/performance if it's code specifically for performance testing.

There is a huge risk with creating entire different structures in that the code or functionality get's lost in the future.

I suggest migrating the majority of the functionality into the tests/performance directory and running as pytest scripts.

you can add additional make targets to the makefile to run your specific performance tests.

this is quite a simple repo to get set-up you just need to create a .venv see guidance on that here
https://digital-land.github.io/technical-documentation/development/how-to-guides/make-python-venv/

and then run make init. it should then install requirements including non python ones on linux. If you on mac you may just need to install them via brew.

I've added some comments throughout about possible duplication between functionality already in the repo and that you've made.

pytest can be the entry-point rather than requiring another cli, I'm pretty sure you can expand the arguments passed into pytest using fixtures.

eveleighoj · 2026-02-09T15:30:41Z

local_testing/pipeline/column.csv

There is a data directory in tests (/data) generally test data can go in there I'm not sure why we need to create additional test data elsewhere in the repo.

equally based on the size of these files they could be created in a pytest fixture each time a test runs

This is a rapid prototype, and the aim is to validate it using real‑world data and, crucially, real‑world data volumes. The supplied test data does not provide the scale or characteristics needed for meaningful evaluation.

eveleighoj · 2026-02-09T15:35:34Z

local_testing/file_downloader.py

There is already a collector in collect.py that is designed for downloading files. Rebuilding in a slightly different way is duplicating it

Agreed however, the current version of the collector is too slow for prototype purposes hence I made this change

eveleighoj · 2026-02-09T15:39:57Z

local_testing/gml_converter.py

there are conversion functions already built in the repo, I'm not against replacing them but it has to support all files not just GML. This code also seems extremely specific to this data. Duckdb could be a good idea for the future but it has to extract all columns from any data.

As the ingestion is out of scope I needed a fast method to do these conversions for rapid testing purposes - this could be used as a future method for extraction

eveleighoj · 2026-02-09T15:46:09Z

local_testing/main.py

Could this not just be a pytest test file in tests/performance?

or an integration or acceptance test in the relevant files?

It can then very easily be ran using python -m pytest tests/performance/<test_file_name>.py

This then fits in with how the repo is structured already rather than needing to create a separate directory.

No this is a separate module from the functional correctness testing provided by the unit, integration and acceptance testing. I think we should keep it where it is currently located.

eveleighoj · 2026-02-09T15:51:39Z

local_testing/pipeline_config.py

+                filepath.write_text(content)
+
+    @staticmethod
+    def download_organisation_csv(cache_dir: Path) -> Path:


organisation is generally separate from pipeline configuration. This is also download the dataset from the wrong place see here https://github.com/digital-land/digital-land-python/blob/main/digital_land/organisation.py

Yes, although as this is a rapid testing prototype rather than production‑grade code, our priority is to demonstrate performance improvements. What is currently implemented is sufficient for the purposes of testing the changes to the phase code.

eveleighoj · 2026-02-09T15:55:35Z

local_testing/pipeline_runner.py

+        print("  Loading digital-land pipeline modules...")
+        p = self.get_pipeline_imports()
+
+        # Convert Parquet to CSV if needed (original pipeline only supports CSV)


Original pipeline converts directly from gml

It does but it is not fast enough for rapid testing purposes.

mattsan-dev · 2026-02-10T10:30:49Z

@eveleighoj

Thank you for your review comments. For this phase we are operating under tight delivery timelines and the objective is to prove performance and feasibility, not to finalise long‑term structures. Creating an isolated prototype path avoids any risk to the existing codebase and allows us to progress at the required pace. Structural consolidation can be completed during the production build, where maintainability and future‑proofing will be addressed properly.

Please also consider that that the rapid local performance test environment is a prototype which is intended for rapid testing and evaluation rather than production use.

The prototype is:

not production‑grade code,
not fully hardened,
not optimised for maintainability,
meant to validate concept, architecture, or feasibility,
time‑boxed,
built with “just enough engineering” for demonstration.

- Implemented MapPhase for renaming columns based on a mapping specification. - Created MigratePhase to rename fields according to the latest specification. - Added NormalisePhase to clean whitespace and handle null patterns in CSV data. - Developed OrganisationPhase for looking up organisation values. - Introduced PatchPhase to apply regex patches to field values. - Implemented PivotPhase to unpivot entity rows into a series of facts. - Created EntityPrefixPhase to ensure every entry has a prefix field. - Added PriorityPhase to deduce the priority of each entry. - Developed FieldPrunePhase and EntityPrunePhase to reduce columns and remove entries with missing entities. - Implemented EntityReferencePhase and FactReferencePhase to ensure prefix and reference fields are set correctly. - Created SavePhase to save the DataFrame to a CSV file. - Added comprehensive tests for each phase to ensure functionality and correctness. #475

mattsan-dev added 11 commits February 2, 2026 13:45

fix: add venv and local_testing to .gitignore Rapid local performance…

f248b9d

… test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

feat: add command-line interface for title-boundary pipeline with arg…

e58d438

…ument parsing Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

feat: implement file downloader for title-boundary GML files with pro…

c3f82f4

…gress tracking Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

feat: add GML converter with multiple output formats including CSV an…

0b2c2dc

…d Parquet Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

feat: add GML extractor for title-boundary datasets with ZIP archive …

ec0a0de

…support Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

feat: add Makefile for title boundary pipeline setup and management R…

92558de

…apid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

feat: add pipeline configuration management for title-boundary datase…

52be8ab

…t Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

feat: update .gitignore to include local testing scripts and README R…

f391856

…apid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

feat: enhance Makefile and CLI for improved pipeline comparison and a…

82857ad

…dd run_all script for batch processing Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

feat: update pipeline configuration files and add README for local te…

bfeb931

…sting environment Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

mattsan-dev requested review from Copilot and eveleighoj February 9, 2026 13:33

mattsan-dev linked an issue Feb 9, 2026 that may be closed by this pull request

Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python #475

Closed

5 tasks

mattsan-dev self-assigned this Feb 9, 2026

Copilot started reviewing on behalf of mattsan-dev February 9, 2026 13:34 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

mattsan-dev added 3 commits February 9, 2026 13:52

refactor: improve code formatting and readability across multiple fil…

ecd0b4f

…es Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

style: flake8 improve code formatting and readability in GML converte…

f6aca1f

…r, main, pipeline report, pipeline runner, and run_all scripts Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

fix: improve pipeline report formatting and update flake8 ignore rule…

bf2fe7b

…s for consistency Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475

Conversation

mattsan-dev commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this? (check all applicable)

Description

Related Tickets & Documents

QA Instructions, Screenshots, Recordings

Added/updated tests?

[optional] Are there any post deployment tasks we need to perform?

[optional] Are there any dependencies on other PRs or Work?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Feb 9, 2026

Codecov Report

Uh oh!

eveleighoj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattsan-dev commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mattsan-dev commented Feb 9, 2026 •

edited

Loading

mattsan-dev commented Feb 10, 2026 •

edited

Loading