Epic: BNPL Data Engineering Pipeline

# Epic: BNPL Data Engineering Pipeline

## Business Objective
Build production-grade data infrastructure for BNPL transaction analysis and ML model development. Ingest comprehensive historical dataset and establish scalable processing pipeline.

## Architecture Overview
- **Data Flow**: simtom API → BigQuery (raw) → dbt (transform) → Analytics/ML tables
- **Orchestration**: Airflow for daily processing patterns
- **Quality**: Great Expectations for data validation
- **Pattern**: ELT leveraging BigQuery compute with dbt transformations

## Repository Structure
```
flit-data-platform/
├── scripts/bnpl/              # Data ingestion and utilities
├── airflow/                   # Orchestration infrastructure
├── great_expectations/        # Data quality framework
├── models/staging/bnpl/       # dbt staging models
├── models/intermediate/bnpl/  # dbt intermediate models
└── models/marts/bnpl/         # dbt analytics models
```

## Implementation Roadmap

### [PR 1](https://github.com/whitehackr/flit-data-platform/pull/10): Infrastructure Foundation
**Branch**: `feat/bnpl-infrastructure`
**Description**: Core infrastructure setup and project foundation

**Scope**:
- [x] Directory structure establishment
- [x] BigQuery dataset provisioning (`flit_bnpl_raw`, `flit_bnpl_intermediate`, `flit_bnpl_marts`)
- [x] Docker Compose configuration for local Airflow development
- [x] Base API client framework for simtom integration
- [x] Project documentation and architectural decisions
- [x] Dependency management (extend root requirements.txt)

**Files**:
- `scripts/bnpl/__init__.py`
- `scripts/bnpl/api_client.py` (interface definition)
- `airflow/docker-compose.yml`
- `docs/bnpl_pipeline_architecture.md`
- `requirements.txt` (updated with new dependencies)

### [PR 2:](https://github.com/whitehackr/flit-data-platform/pull/11) Historical Data Ingestion Engine
**Branch**: `feat/bnpl-ingestion`
**Description**: Comprehensive historical data acquisition with realistic transaction patterns

**Scope**:
- [x] Production-grade API client with exponential backoff and circuit breaker
- [x] Historical data ingestion orchestrator
- [x] BigQuery staging infrastructure with date partitioning
- [x] JSON schema extraction and normalization (transaction/customer separation)
- [x] Comprehensive error handling and observability
- [x] Data volume and pattern validation

**Key Decisions**:
- **Dataset Size**: 1.8M transactions (realistic volume for ML training)
- **Time Distribution**: Leverage simtom's realistic holiday/weekend patterns
- **Payload Strategy**: Dynamic `total_records` based on day type (weekdays: 5-7k, weekends: 3-4k, holidays: 8-10k)
- **Scope**: Historical ingestion only - production streaming pipeline addressed separately

**Files**:
- `scripts/bnpl/ingest_historical_bnpl.py`
- `scripts/bnpl/api_client.py` (complete implementation)
- `scripts/bnpl/data_patterns.py` (holiday/weekend logic)
- `scripts/bnpl/schema_utils.py` (JSON normalization)

**API Integration**:
- URL: `https://simtom-production.up.railway.app/stream/bnpl`
- Dynamic payload generation based on date characteristics
- Automatic schema extraction for transaction/customer entity separation

### PR 3: dbt Transformation Pipeline
**Branch**: `feat/bnpl-dbt-models`
**Description**: Scalable data transformation pipeline with proper entity modeling

**Scope**:
- [x] Staging models (JSON → normalized relational structures)
- [x] Intermediate models (business logic, data enrichment)
- [x] Mart models (analytics-optimized tables)
- [x] Comprehensive dbt testing framework
- [x] Model lineage documentation
- [x] Performance optimization (clustering, partitioning)

**Entity Separation Strategy**:
- Extract customer profiles from transaction JSON in staging layer
- Deduplicate and enrich customer data in intermediate layer
- Create separate transaction/customer marts with proper relationships

**dbt Models**:
- `models/staging/bnpl/stg_bnpl_raw_transactions.sql`
- `models/staging/bnpl/stg_bnpl_extracted_customers.sql`
- `models/intermediate/bnpl/int_bnpl_transactions_enriched.sql`
- `models/intermediate/bnpl/int_bnpl_customer_profiles.sql`
- `models/marts/bnpl/mart_bnpl_transactions.sql`
- `models/marts/bnpl/mart_bnpl_customers.sql`
- `models/marts/bnpl/mart_bnpl_ml_features.sql`

### PR 4: Data Quality Framework
**Branch**: `feat/bnpl-data-quality`
**Description**: Comprehensive data validation and monitoring infrastructure

**Scope**:
- [ ] Great Expectations suite configuration
- [ ] Business rule validation (transaction amounts, customer behavior)
- [ ] Statistical pattern validation (holiday spikes, weekend dips)
- [ ] Data freshness and completeness monitoring
- [ ] Quality gate integration with dbt
- [ ] Alerting framework for quality failures

**Files**:
- `great_expectations/great_expectations.yml`
- `great_expectations/expectations/bnpl_business_rules.json`
- `great_expectations/expectations/bnpl_statistical_patterns.json`
- `great_expectations/checkpoints/daily_quality_gate.yml`

### PR 5: Airflow Orchestration Pipeline
**Branch**: `feat/bnpl-airflow-pipeline`
**Description**: Production-grade workflow orchestration

**Scope**:
- [ ] Daily processing DAG with proper dependency management
- [ ] Task-level error handling and retry strategies
- [ ] dbt integration with proper artifact management
- [ ] Great Expectations checkpoint integration
- [ ] Historical backfill simulation capabilities
- [ ] Monitoring and alerting integration

**Files**:
- `airflow/dags/daily_bnpl_processing.py`
- `airflow/plugins/bnpl_operators.py`
- `airflow/utils/bnpl_utils.py`

**DAG Architecture**:
```
extract_daily_batch >> validate_raw_schema >> 
normalize_entities >> run_dbt_pipeline >> 
validate_business_rules >> publish_quality_metrics
```

### PR 6: ML Feature Engineering
**Branch**: `feat/bnpl-ml-features`
**Description**: Advanced analytics and ML preparation infrastructure

**Scope**:
- [ ] Time-series feature engineering (rolling aggregations, trends)
- [ ] Customer behavioral scoring (payment patterns, risk indicators)
- [ ] Transaction categorization and enrichment
- [ ] Feature store preparation
- [ ] Model training dataset generation
- [ ] Analytics dashboard preparation

**Files**:
- `models/marts/bnpl/mart_bnpl_ml_training_set.sql`
- `models/marts/bnpl/mart_bnpl_risk_features.sql`
- `models/marts/bnpl/mart_bnpl_customer_analytics.sql`
- `scripts/bnpl/feature_pipeline.py`

## Technical Architecture

### Infrastructure Stack
- **Orchestration**: Apache Airflow (containerized)
- **Data Warehouse**: Google BigQuery (partitioned, clustered)
- **Transformations**: dbt (with proper testing and documentation)
- **Data Quality**: Great Expectations (integrated with pipeline)
- **API Integration**: Python with production-grade error handling

### Data Volumes and Performance
- **Historical Dataset**: 1.8M transactions across 365 days
- **Daily Processing**: 3k-10k transactions (realistic business patterns)
- **Storage Strategy**: Date-partitioned tables with customer_id clustering
- **Processing Pattern**: ELT with BigQuery compute optimization

## Success Metrics
- [ ] Complete historical dataset ingested with proper entity separation
- [ ] Airflow pipeline processing daily patterns with <5% failure rate
- [ ] Data quality gates preventing bad data propagation
- [ ] dbt pipeline generating clean, tested analytics tables
- [ ] ML-ready feature sets available for model development
- [ ] End-to-end pipeline documentation and runbooks

## Dependencies and Prerequisites
- BigQuery project permissions and dataset creation rights
- Docker environment for local Airflow development
- simtom API access and rate limit understanding
- Python environment with data engineering dependencies

## Risk Mitigation
- **API Rate Limits**: Implement exponential backoff and respectful request patterns
- **Data Volume**: Monitor BigQuery costs and implement proper partitioning
- **Schema Evolution**: Design flexible JSON parsing with schema validation
- **Pipeline Failures**: Comprehensive error handling and recovery mechanisms

## Future Considerations
- Real-time streaming pipeline (separate from this historical ingestion)
- Production ML model serving infrastructure
- Advanced analytics and reporting layer
- Data lineage and governance framework

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: BNPL Data Engineering Pipeline - Historical Data Ingestion & ML Preparation #9

Business Objective

Architecture Overview

Repository Structure

Implementation Roadmap

PR 1: Infrastructure Foundation

PR 2: Historical Data Ingestion Engine

PR 3: dbt Transformation Pipeline

PR 4: Data Quality Framework

PR 5: Airflow Orchestration Pipeline

PR 6: ML Feature Engineering

Technical Architecture

Infrastructure Stack

Data Volumes and Performance

Success Metrics

Dependencies and Prerequisites

Risk Mitigation

Future Considerations

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Epic: BNPL Data Engineering Pipeline - Historical Data Ingestion & ML Preparation #9

Description

Epic: BNPL Data Engineering Pipeline

Business Objective

Architecture Overview

Repository Structure

Implementation Roadmap

PR 1: Infrastructure Foundation

PR 2: Historical Data Ingestion Engine

PR 3: dbt Transformation Pipeline

PR 4: Data Quality Framework

PR 5: Airflow Orchestration Pipeline

PR 6: ML Feature Engineering

Technical Architecture

Infrastructure Stack

Data Volumes and Performance

Success Metrics

Dependencies and Prerequisites

Risk Mitigation

Future Considerations

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions