Epic: BNPL Data Engineering Pipeline
Business Objective
Build production-grade data infrastructure for BNPL transaction analysis and ML model development. Ingest comprehensive historical dataset and establish scalable processing pipeline.
Architecture Overview
- Data Flow: simtom API → BigQuery (raw) → dbt (transform) → Analytics/ML tables
- Orchestration: Airflow for daily processing patterns
- Quality: Great Expectations for data validation
- Pattern: ELT leveraging BigQuery compute with dbt transformations
Repository Structure
flit-data-platform/
├── scripts/bnpl/ # Data ingestion and utilities
├── airflow/ # Orchestration infrastructure
├── great_expectations/ # Data quality framework
├── models/staging/bnpl/ # dbt staging models
├── models/intermediate/bnpl/ # dbt intermediate models
└── models/marts/bnpl/ # dbt analytics models
Implementation Roadmap
PR 1: Infrastructure Foundation
Branch: feat/bnpl-infrastructure
Description: Core infrastructure setup and project foundation
Scope:
Files:
scripts/bnpl/__init__.py
scripts/bnpl/api_client.py (interface definition)
airflow/docker-compose.yml
docs/bnpl_pipeline_architecture.md
requirements.txt (updated with new dependencies)
PR 2: Historical Data Ingestion Engine
Branch: feat/bnpl-ingestion
Description: Comprehensive historical data acquisition with realistic transaction patterns
Scope:
Key Decisions:
- Dataset Size: 1.8M transactions (realistic volume for ML training)
- Time Distribution: Leverage simtom's realistic holiday/weekend patterns
- Payload Strategy: Dynamic
total_records based on day type (weekdays: 5-7k, weekends: 3-4k, holidays: 8-10k)
- Scope: Historical ingestion only - production streaming pipeline addressed separately
Files:
scripts/bnpl/ingest_historical_bnpl.py
scripts/bnpl/api_client.py (complete implementation)
scripts/bnpl/data_patterns.py (holiday/weekend logic)
scripts/bnpl/schema_utils.py (JSON normalization)
API Integration:
- URL:
https://simtom-production.up.railway.app/stream/bnpl
- Dynamic payload generation based on date characteristics
- Automatic schema extraction for transaction/customer entity separation
PR 3: dbt Transformation Pipeline
Branch: feat/bnpl-dbt-models
Description: Scalable data transformation pipeline with proper entity modeling
Scope:
Entity Separation Strategy:
- Extract customer profiles from transaction JSON in staging layer
- Deduplicate and enrich customer data in intermediate layer
- Create separate transaction/customer marts with proper relationships
dbt Models:
models/staging/bnpl/stg_bnpl_raw_transactions.sql
models/staging/bnpl/stg_bnpl_extracted_customers.sql
models/intermediate/bnpl/int_bnpl_transactions_enriched.sql
models/intermediate/bnpl/int_bnpl_customer_profiles.sql
models/marts/bnpl/mart_bnpl_transactions.sql
models/marts/bnpl/mart_bnpl_customers.sql
models/marts/bnpl/mart_bnpl_ml_features.sql
PR 4: Data Quality Framework
Branch: feat/bnpl-data-quality
Description: Comprehensive data validation and monitoring infrastructure
Scope:
Files:
great_expectations/great_expectations.yml
great_expectations/expectations/bnpl_business_rules.json
great_expectations/expectations/bnpl_statistical_patterns.json
great_expectations/checkpoints/daily_quality_gate.yml
PR 5: Airflow Orchestration Pipeline
Branch: feat/bnpl-airflow-pipeline
Description: Production-grade workflow orchestration
Scope:
Files:
airflow/dags/daily_bnpl_processing.py
airflow/plugins/bnpl_operators.py
airflow/utils/bnpl_utils.py
DAG Architecture:
extract_daily_batch >> validate_raw_schema >>
normalize_entities >> run_dbt_pipeline >>
validate_business_rules >> publish_quality_metrics
PR 6: ML Feature Engineering
Branch: feat/bnpl-ml-features
Description: Advanced analytics and ML preparation infrastructure
Scope:
Files:
models/marts/bnpl/mart_bnpl_ml_training_set.sql
models/marts/bnpl/mart_bnpl_risk_features.sql
models/marts/bnpl/mart_bnpl_customer_analytics.sql
scripts/bnpl/feature_pipeline.py
Technical Architecture
Infrastructure Stack
- Orchestration: Apache Airflow (containerized)
- Data Warehouse: Google BigQuery (partitioned, clustered)
- Transformations: dbt (with proper testing and documentation)
- Data Quality: Great Expectations (integrated with pipeline)
- API Integration: Python with production-grade error handling
Data Volumes and Performance
- Historical Dataset: 1.8M transactions across 365 days
- Daily Processing: 3k-10k transactions (realistic business patterns)
- Storage Strategy: Date-partitioned tables with customer_id clustering
- Processing Pattern: ELT with BigQuery compute optimization
Success Metrics
Dependencies and Prerequisites
- BigQuery project permissions and dataset creation rights
- Docker environment for local Airflow development
- simtom API access and rate limit understanding
- Python environment with data engineering dependencies
Risk Mitigation
- API Rate Limits: Implement exponential backoff and respectful request patterns
- Data Volume: Monitor BigQuery costs and implement proper partitioning
- Schema Evolution: Design flexible JSON parsing with schema validation
- Pipeline Failures: Comprehensive error handling and recovery mechanisms
Future Considerations
- Real-time streaming pipeline (separate from this historical ingestion)
- Production ML model serving infrastructure
- Advanced analytics and reporting layer
- Data lineage and governance framework
Epic: BNPL Data Engineering Pipeline
Business Objective
Build production-grade data infrastructure for BNPL transaction analysis and ML model development. Ingest comprehensive historical dataset and establish scalable processing pipeline.
Architecture Overview
Repository Structure
Implementation Roadmap
PR 1: Infrastructure Foundation
Branch:
feat/bnpl-infrastructureDescription: Core infrastructure setup and project foundation
Scope:
flit_bnpl_raw,flit_bnpl_intermediate,flit_bnpl_marts)Files:
scripts/bnpl/__init__.pyscripts/bnpl/api_client.py(interface definition)airflow/docker-compose.ymldocs/bnpl_pipeline_architecture.mdrequirements.txt(updated with new dependencies)PR 2: Historical Data Ingestion Engine
Branch:
feat/bnpl-ingestionDescription: Comprehensive historical data acquisition with realistic transaction patterns
Scope:
Key Decisions:
total_recordsbased on day type (weekdays: 5-7k, weekends: 3-4k, holidays: 8-10k)Files:
scripts/bnpl/ingest_historical_bnpl.pyscripts/bnpl/api_client.py(complete implementation)scripts/bnpl/data_patterns.py(holiday/weekend logic)scripts/bnpl/schema_utils.py(JSON normalization)API Integration:
https://simtom-production.up.railway.app/stream/bnplPR 3: dbt Transformation Pipeline
Branch:
feat/bnpl-dbt-modelsDescription: Scalable data transformation pipeline with proper entity modeling
Scope:
Entity Separation Strategy:
dbt Models:
models/staging/bnpl/stg_bnpl_raw_transactions.sqlmodels/staging/bnpl/stg_bnpl_extracted_customers.sqlmodels/intermediate/bnpl/int_bnpl_transactions_enriched.sqlmodels/intermediate/bnpl/int_bnpl_customer_profiles.sqlmodels/marts/bnpl/mart_bnpl_transactions.sqlmodels/marts/bnpl/mart_bnpl_customers.sqlmodels/marts/bnpl/mart_bnpl_ml_features.sqlPR 4: Data Quality Framework
Branch:
feat/bnpl-data-qualityDescription: Comprehensive data validation and monitoring infrastructure
Scope:
Files:
great_expectations/great_expectations.ymlgreat_expectations/expectations/bnpl_business_rules.jsongreat_expectations/expectations/bnpl_statistical_patterns.jsongreat_expectations/checkpoints/daily_quality_gate.ymlPR 5: Airflow Orchestration Pipeline
Branch:
feat/bnpl-airflow-pipelineDescription: Production-grade workflow orchestration
Scope:
Files:
airflow/dags/daily_bnpl_processing.pyairflow/plugins/bnpl_operators.pyairflow/utils/bnpl_utils.pyDAG Architecture:
PR 6: ML Feature Engineering
Branch:
feat/bnpl-ml-featuresDescription: Advanced analytics and ML preparation infrastructure
Scope:
Files:
models/marts/bnpl/mart_bnpl_ml_training_set.sqlmodels/marts/bnpl/mart_bnpl_risk_features.sqlmodels/marts/bnpl/mart_bnpl_customer_analytics.sqlscripts/bnpl/feature_pipeline.pyTechnical Architecture
Infrastructure Stack
Data Volumes and Performance
Success Metrics
Dependencies and Prerequisites
Risk Mitigation
Future Considerations