Skip to content

475 rapid local performance test environment supporting the polarsbased transformation rewrite in digital land python#489

Open
mattsan-dev wants to merge 15 commits intomainfrom
475-rapid-local-performance-test-environment-supporting-the-polarsbased-transformation-rewrite-in-digital-land-python
Open

475 rapid local performance test environment supporting the polarsbased transformation rewrite in digital land python#489
mattsan-dev wants to merge 15 commits intomainfrom
475-rapid-local-performance-test-environment-supporting-the-polarsbased-transformation-rewrite-in-digital-land-python

Conversation

@mattsan-dev
Copy link
Contributor

@mattsan-dev mattsan-dev commented Feb 9, 2026

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update

Description

Please replace this line with a brief description of the changes made.

Related Tickets & Documents

  • Ticket Link
  • Related Issue #
  • Closes #

QA Instructions, Screenshots, Recordings

Please replace this line with instructions on how to test your changes, a note
on the devices and browsers this has been tested on, as well as any relevant
images for UI changes.

Added/updated tests?

We encourage you to keep the code coverage percentage at 80% and above. Please refer to the Digital Land Testing Guidance for more information.

  • Yes
  • No, and this is why: please replace this line with details on why tests
    have not been included
  • I need help with writing tests

[optional] Are there any post deployment tasks we need to perform?

[optional] Are there any dependencies on other PRs or Work?

… test environment supporting the Polars‑based transformation rewrite in digital-land-python

Fixes #475
…ument parsing Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python

Fixes #475
…gress tracking Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python

Fixes #475
…d Parquet Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python

Fixes #475
…support Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python

Fixes #475
…apid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python

Fixes #475
…t Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python

Fixes #475
…apid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python

Fixes #475
…dd run_all script for batch processing Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python

Fixes #475
…sting environment Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python

Fixes #475
…d pipeline_runner.py; add new implementations for main pipeline orchestration and reporting Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python

Fixes #475
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a self-contained local_testing/ environment to run the 26-phase Digital Land pipeline end-to-end for the Land Registry title-boundary dataset, with timing/performance reporting and Makefile helpers intended to support an “original vs Polars” transformation comparison.

Changes:

  • Introduces a modular local runner (main.py, PipelineRunner, PipelineReport) plus batch runner (run_all.py) and Make targets for setup/runs.
  • Adds GML download/extract/convert utilities (regex / Polars / DuckDB paths) and generates minimal pipeline config CSVs under local_testing/pipeline/.
  • Updates .gitignore to exclude generated local_testing artifacts (raw/extracted/converted/output/reports/cache/spec/venv).

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
local_testing/run_all.py Batch driver to run the pipeline across all endpoints and write a summary JSON report.
local_testing/main.py Orchestrates download → extract → convert → transform → report; wires CLI flags incl. --compare and --phases.
local_testing/cli.py CLI argument parsing and LA selection helpers.
local_testing/file_downloader.py Endpoint CSV fetch + ZIP downloader (requests/urllib) with progress reporting.
local_testing/gml_extractor.py Extracts GML from downloaded ZIP archives.
local_testing/gml_converter.py Converts GML to CSV/Parquet via regex, Polars, or DuckDB spatial.
local_testing/pipeline_config.py Ensures pipeline CSV configs exist and downloads organisation.csv cache.
local_testing/pipeline_runner.py Runs the 26-phase digital-land pipeline with per-phase timing.
local_testing/pipeline_report.py Collects metrics and emits JSON + human-readable text reports.
local_testing/pipeline/*.csv Minimal pipeline configuration fixtures (column/default/lookup/etc.).
local_testing/README.md Setup and usage documentation for the local testing environment.
local_testing/Makefile Convenience targets for init, running, batch runs, and “fast” mode.
.gitignore Ignores venv/ plus local_testing generated folders.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'en-GB,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_download_with_urllib sets Accept-Encoding: gzip, deflate, br, but urllib will not transparently decode brotli/gzip content-encodings. If the server responds compressed, the downloaded bytes will be corrupted. For binary ZIP downloads, omit Accept-Encoding (or explicitly handle Content-Encoding).

Suggested change
'Accept-Encoding': 'gzip, deflate, br',

Copilot uses AI. Check for mistakes.
…es Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python

Fixes #475
…r, main, pipeline report, pipeline runner, and run_all scripts Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python

Fixes #475
…s for consistency Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python

Fixes #475
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.97%. Comparing base (5c94713) to head (bf2fe7b).
⚠️ Report is 30 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #489      +/-   ##
==========================================
+ Coverage   85.49%   85.97%   +0.48%     
==========================================
  Files          87       87              
  Lines        5872     5966      +94     
==========================================
+ Hits         5020     5129     +109     
+ Misses        852      837      -15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@digital-land digital-land deleted a comment from Copilot AI Feb 9, 2026
@digital-land digital-land deleted a comment from Copilot AI Feb 9, 2026
@digital-land digital-land deleted a comment from Copilot AI Feb 9, 2026
@digital-land digital-land deleted a comment from Copilot AI Feb 9, 2026
@digital-land digital-land deleted a comment from Copilot AI Feb 9, 2026
@digital-land digital-land deleted a comment from Copilot AI Feb 9, 2026
@digital-land digital-land deleted a comment from Copilot AI Feb 9, 2026
@digital-land digital-land deleted a comment from Copilot AI Feb 9, 2026
@digital-land digital-land deleted a comment from Copilot AI Feb 9, 2026
@digital-land digital-land deleted a comment from Copilot AI Feb 9, 2026
Copy link
Contributor

@eveleighoj eveleighoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of good stuff in here but I'm not sure it justifies creating a whole mini version of the project here. majority of the code should either be in digital-land(the src code) directory if it's functionality for the library or in tests/performance if it's code specifically for performance testing.

There is a huge risk with creating entire different structures in that the code or functionality get's lost in the future.

I suggest migrating the majority of the functionality into the tests/performance directory and running as pytest scripts.

you can add additional make targets to the makefile to run your specific performance tests.

this is quite a simple repo to get set-up you just need to create a .venv see guidance on that here
https://digital-land.github.io/technical-documentation/development/how-to-guides/make-python-venv/

and then run make init. it should then install requirements including non python ones on linux. If you on mac you may just need to install them via brew.

I've added some comments throughout about possible duplication between functionality already in the repo and that you've made.

pytest can be the entry-point rather than requiring another cli, I'm pretty sure you can expand the arguments passed into pytest using fixtures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a data directory in tests (/data) generally test data can go in there I'm not sure why we need to create additional test data elsewhere in the repo.

equally based on the size of these files they could be created in a pytest fixture each time a test runs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rapid prototype, and the aim is to validate it using real‑world data and, crucially, real‑world data volumes. The supplied test data does not provide the scale or characteristics needed for meaningful evaluation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a collector in collect.py that is designed for downloading files. Rebuilding in a slightly different way is duplicating it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed however, the current version of the collector is too slow for prototype purposes hence I made this change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are conversion functions already built in the repo, I'm not against replacing them but it has to support all files not just GML. This code also seems extremely specific to this data. Duckdb could be a good idea for the future but it has to extract all columns from any data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the ingestion is out of scope I needed a fast method to do these conversions for rapid testing purposes - this could be used as a future method for extraction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this not just be a pytest test file in tests/performance?

or an integration or acceptance test in the relevant files?

It can then very easily be ran using python -m pytest tests/performance/<test_file_name>.py

This then fits in with how the repo is structured already rather than needing to create a separate directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is a separate module from the functional correctness testing provided by the unit, integration and acceptance testing. I think we should keep it where it is currently located.

filepath.write_text(content)

@staticmethod
def download_organisation_csv(cache_dir: Path) -> Path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

organisation is generally separate from pipeline configuration. This is also download the dataset from the wrong place see here https://github.com/digital-land/digital-land-python/blob/main/digital_land/organisation.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, although as this is a rapid testing prototype rather than production‑grade code, our priority is to demonstrate performance improvements. What is currently implemented is sufficient for the purposes of testing the changes to the phase code.

print(" Loading digital-land pipeline modules...")
p = self.get_pipeline_imports()

# Convert Parquet to CSV if needed (original pipeline only supports CSV)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original pipeline converts directly from gml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does but it is not fast enough for rapid testing purposes.

@mattsan-dev
Copy link
Contributor Author

mattsan-dev commented Feb 10, 2026

@eveleighoj

Thank you for your review comments. For this phase we are operating under tight delivery timelines and the objective is to prove performance and feasibility, not to finalise long‑term structures. Creating an isolated prototype path avoids any risk to the existing codebase and allows us to progress at the required pace. Structural consolidation can be completed during the production build, where maintainability and future‑proofing will be addressed properly.

Please also consider that that the rapid local performance test environment is a prototype which is intended for rapid testing and evaluation rather than production use.

The prototype is:

  • not production‑grade code,
  • not fully hardened,
  • not optimised for maintainability,
  • meant to validate concept, architecture, or feasibility,
  • time‑boxed,
  • built with “just enough engineering” for demonstration.

- Implemented MapPhase for renaming columns based on a mapping specification.
- Created MigratePhase to rename fields according to the latest specification.
- Added NormalisePhase to clean whitespace and handle null patterns in CSV data.
- Developed OrganisationPhase for looking up organisation values.
- Introduced PatchPhase to apply regex patches to field values.
- Implemented PivotPhase to unpivot entity rows into a series of facts.
- Created EntityPrefixPhase to ensure every entry has a prefix field.
- Added PriorityPhase to deduce the priority of each entry.
- Developed FieldPrunePhase and EntityPrunePhase to reduce columns and remove entries with missing entities.
- Implemented EntityReferencePhase and FactReferencePhase to ensure prefix and reference fields are set correctly.
- Created SavePhase to save the DataFrame to a CSV file.
- Added comprehensive tests for each phase to ensure functionality and correctness. #475
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python

4 participants