475 rapid local performance test environment supporting the polarsbased transformation rewrite in digital land python#489
Conversation
… test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475
…ument parsing Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475
…gress tracking Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475
…d Parquet Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475
…support Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475
…apid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475
…t Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475
…apid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475
…dd run_all script for batch processing Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475
…sting environment Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475
…d pipeline_runner.py; add new implementations for main pipeline orchestration and reporting Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475
There was a problem hiding this comment.
Pull request overview
Adds a self-contained local_testing/ environment to run the 26-phase Digital Land pipeline end-to-end for the Land Registry title-boundary dataset, with timing/performance reporting and Makefile helpers intended to support an “original vs Polars” transformation comparison.
Changes:
- Introduces a modular local runner (
main.py,PipelineRunner,PipelineReport) plus batch runner (run_all.py) and Make targets for setup/runs. - Adds GML download/extract/convert utilities (regex / Polars / DuckDB paths) and generates minimal pipeline config CSVs under
local_testing/pipeline/. - Updates
.gitignoreto exclude generatedlocal_testingartifacts (raw/extracted/converted/output/reports/cache/spec/venv).
Reviewed changes
Copilot reviewed 22 out of 23 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
local_testing/run_all.py |
Batch driver to run the pipeline across all endpoints and write a summary JSON report. |
local_testing/main.py |
Orchestrates download → extract → convert → transform → report; wires CLI flags incl. --compare and --phases. |
local_testing/cli.py |
CLI argument parsing and LA selection helpers. |
local_testing/file_downloader.py |
Endpoint CSV fetch + ZIP downloader (requests/urllib) with progress reporting. |
local_testing/gml_extractor.py |
Extracts GML from downloaded ZIP archives. |
local_testing/gml_converter.py |
Converts GML to CSV/Parquet via regex, Polars, or DuckDB spatial. |
local_testing/pipeline_config.py |
Ensures pipeline CSV configs exist and downloads organisation.csv cache. |
local_testing/pipeline_runner.py |
Runs the 26-phase digital-land pipeline with per-phase timing. |
local_testing/pipeline_report.py |
Collects metrics and emits JSON + human-readable text reports. |
local_testing/pipeline/*.csv |
Minimal pipeline configuration fixtures (column/default/lookup/etc.). |
local_testing/README.md |
Setup and usage documentation for the local testing environment. |
local_testing/Makefile |
Convenience targets for init, running, batch runs, and “fast” mode. |
.gitignore |
Ignores venv/ plus local_testing generated folders. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
local_testing/file_downloader.py
Outdated
| 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36', | ||
| 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8', | ||
| 'Accept-Language': 'en-GB,en;q=0.9', | ||
| 'Accept-Encoding': 'gzip, deflate, br', |
There was a problem hiding this comment.
_download_with_urllib sets Accept-Encoding: gzip, deflate, br, but urllib will not transparently decode brotli/gzip content-encodings. If the server responds compressed, the downloaded bytes will be corrupted. For binary ZIP downloads, omit Accept-Encoding (or explicitly handle Content-Encoding).
| 'Accept-Encoding': 'gzip, deflate, br', |
…es Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475
…r, main, pipeline report, pipeline runner, and run_all scripts Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475
…s for consistency Rapid local performance test environment supporting the Polars‑based transformation rewrite in digital-land-python Fixes #475
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #489 +/- ##
==========================================
+ Coverage 85.49% 85.97% +0.48%
==========================================
Files 87 87
Lines 5872 5966 +94
==========================================
+ Hits 5020 5129 +109
+ Misses 852 837 -15 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
eveleighoj
left a comment
There was a problem hiding this comment.
There is a lot of good stuff in here but I'm not sure it justifies creating a whole mini version of the project here. majority of the code should either be in digital-land(the src code) directory if it's functionality for the library or in tests/performance if it's code specifically for performance testing.
There is a huge risk with creating entire different structures in that the code or functionality get's lost in the future.
I suggest migrating the majority of the functionality into the tests/performance directory and running as pytest scripts.
you can add additional make targets to the makefile to run your specific performance tests.
this is quite a simple repo to get set-up you just need to create a .venv see guidance on that here
https://digital-land.github.io/technical-documentation/development/how-to-guides/make-python-venv/
and then run make init. it should then install requirements including non python ones on linux. If you on mac you may just need to install them via brew.
I've added some comments throughout about possible duplication between functionality already in the repo and that you've made.
pytest can be the entry-point rather than requiring another cli, I'm pretty sure you can expand the arguments passed into pytest using fixtures.
There was a problem hiding this comment.
There is a data directory in tests (/data) generally test data can go in there I'm not sure why we need to create additional test data elsewhere in the repo.
equally based on the size of these files they could be created in a pytest fixture each time a test runs
There was a problem hiding this comment.
This is a rapid prototype, and the aim is to validate it using real‑world data and, crucially, real‑world data volumes. The supplied test data does not provide the scale or characteristics needed for meaningful evaluation.
There was a problem hiding this comment.
There is already a collector in collect.py that is designed for downloading files. Rebuilding in a slightly different way is duplicating it
There was a problem hiding this comment.
Agreed however, the current version of the collector is too slow for prototype purposes hence I made this change
There was a problem hiding this comment.
there are conversion functions already built in the repo, I'm not against replacing them but it has to support all files not just GML. This code also seems extremely specific to this data. Duckdb could be a good idea for the future but it has to extract all columns from any data.
There was a problem hiding this comment.
As the ingestion is out of scope I needed a fast method to do these conversions for rapid testing purposes - this could be used as a future method for extraction
There was a problem hiding this comment.
Could this not just be a pytest test file in tests/performance?
or an integration or acceptance test in the relevant files?
It can then very easily be ran using python -m pytest tests/performance/<test_file_name>.py
This then fits in with how the repo is structured already rather than needing to create a separate directory.
There was a problem hiding this comment.
No this is a separate module from the functional correctness testing provided by the unit, integration and acceptance testing. I think we should keep it where it is currently located.
| filepath.write_text(content) | ||
|
|
||
| @staticmethod | ||
| def download_organisation_csv(cache_dir: Path) -> Path: |
There was a problem hiding this comment.
organisation is generally separate from pipeline configuration. This is also download the dataset from the wrong place see here https://github.com/digital-land/digital-land-python/blob/main/digital_land/organisation.py
There was a problem hiding this comment.
Yes, although as this is a rapid testing prototype rather than production‑grade code, our priority is to demonstrate performance improvements. What is currently implemented is sufficient for the purposes of testing the changes to the phase code.
| print(" Loading digital-land pipeline modules...") | ||
| p = self.get_pipeline_imports() | ||
|
|
||
| # Convert Parquet to CSV if needed (original pipeline only supports CSV) |
There was a problem hiding this comment.
Original pipeline converts directly from gml
There was a problem hiding this comment.
It does but it is not fast enough for rapid testing purposes.
|
Thank you for your review comments. For this phase we are operating under tight delivery timelines and the objective is to prove performance and feasibility, not to finalise long‑term structures. Creating an isolated prototype path avoids any risk to the existing codebase and allows us to progress at the required pace. Structural consolidation can be completed during the production build, where maintainability and future‑proofing will be addressed properly. Please also consider that that the rapid local performance test environment is a prototype which is intended for rapid testing and evaluation rather than production use. The prototype is:
|
- Implemented MapPhase for renaming columns based on a mapping specification. - Created MigratePhase to rename fields according to the latest specification. - Added NormalisePhase to clean whitespace and handle null patterns in CSV data. - Developed OrganisationPhase for looking up organisation values. - Introduced PatchPhase to apply regex patches to field values. - Implemented PivotPhase to unpivot entity rows into a series of facts. - Created EntityPrefixPhase to ensure every entry has a prefix field. - Added PriorityPhase to deduce the priority of each entry. - Developed FieldPrunePhase and EntityPrunePhase to reduce columns and remove entries with missing entities. - Implemented EntityReferencePhase and FactReferencePhase to ensure prefix and reference fields are set correctly. - Created SavePhase to save the DataFrame to a CSV file. - Added comprehensive tests for each phase to ensure functionality and correctness. #475
What type of PR is this? (check all applicable)
Description
Please replace this line with a brief description of the changes made.
Related Tickets & Documents
QA Instructions, Screenshots, Recordings
Please replace this line with instructions on how to test your changes, a note
on the devices and browsers this has been tested on, as well as any relevant
images for UI changes.
Added/updated tests?
We encourage you to keep the code coverage percentage at 80% and above. Please refer to the Digital Land Testing Guidance for more information.
have not been included
[optional] Are there any post deployment tasks we need to perform?
[optional] Are there any dependencies on other PRs or Work?