Skip to content

ShovalBenjer/admaven-python-data-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AdMaven Analytics: Detection & Intelligence Suite

image

Two tools for ad-tech: a forensic SQL investigation into network fraud using Z-score anomaly detection, and a concurrent Python scraper for competitive domain intelligence. One catches bot farms. The other maps competitor publisher networks at scale.

Now also exposing an MCP server for agentic integration.


What It Found

  • Advertiser 601040: 400% traffic spike, near-zero conversions -- classic click flood
  • Campaign 653344: 75% CR across 10k+ impressions -- mathematical impossibility, likely pixel stuffing
  • Philippines traffic surge + 60% CR drop: geographic arbitrage on cheap inventory
  • Z-score threshold: flags any tag CR deviating >1.96 std deviations from advertiser baseline

How It Works

SQL (Snowflake): Dynamic Z-score analysis instead of static thresholds. Adapts to each advertiser's baseline, so naturally low-converting verticals don't generate false positives.

CASE 
    WHEN (tag_cr - avg_cr) / NULLIF(std_cr, 0) < -1.96 THEN 'FRAUD_CONFIRMED'
    ELSE 'REVIEW_REQUIRED'
END

Python pipeline: Map-Reduce architecture. ProcessPoolExecutor for CPU-bound HTML parsing, asyncio for I/O-bound requests. Both CPU cores and network bandwidth saturated simultaneously. Polars for sub-millisecond set operations against the existing client list, DuckDB for pre-export QA.

Ad detection: Lightweight heuristic engine -- no ML model needed. Scans HTML for weighted tokens (googlesyndication: 1.5, prebid: 1.2, iframe: 0.2). Score > 2.0 = running ads.


Architecture

image

Pipeline flow: load competitor domains via Polars -> spawn process per competitor -> each process fetches API data, deduplicates against client list, scrapes HTML async -> reduce to single Polars DataFrame -> DuckDB QA -> CSV export.


MCP Server Integration

The pipeline is exposed as an MCP (Model Context Protocol) server, enabling LLM agents to invoke fraud detection and scraping tools directly:

from mcp import Client

# Detect fraud from SQL query
result = client.call_tool('detect_fraud', {
    'sql_query': 'SELECT tag_id, advertiser_id, converted_pixel FROM impressions...'
})

# Scrape competitor websites
sites = client.call_tool('scrape_competitor', {
    'domain': 'competitor.com'
})

# Check HTML for ads
ad_result = client.call_tool('check_ads', {
    'html': '<html>...</html>',
    'use_llm': True
})

Tools:

  • detect_fraud(sql_query) — Z-score statistical analysis
  • scrape_competitor(domain) — Competitor domain intelligence
  • check_ads(html) — Heuristic + LLM ad detection

See src/mcp_server.py for implementation.


Repository Structure

.
├── src/
│   ├── main.py                    # Legacy standalone script
│   ├── mcp_server.py              # MCP server (new)
│   ├── lib/
│   │   ├── fraud_detection.py     # Z-score analysis
│   │   ├── ad_detection.py        # Heuristic + LLM detection
│   │   └── scraper.py             # Competitor fetching
│   ├── queries_shoval_benjer.sql  # Original SQL fraud logic
│   ├── python_Shoval_Benjer.py    # Original scraper (refactored)
│   └── requirements.txt
├── tests/                         # Unit tests
│   ├── test_heuristic_ad_detect.py
│   ├── test_fraud_detection.py
│   ├── test_json_parsing.py
│   ├── test_scraper.py
│   └── test_mcp_server.py
├── pyproject.toml                 # Package metadata & dev deps
├── .env.example                   # Environment template
├── final_output.csv               # Generated dataset
├── AdMaven_SQL_Investigation_Report.pdf
└── README.md

Setup

pip install -r requirements.txt

Optional .env for GenAI features:

API_KEY=your_email@example.com
HF_TOKEN=your_huggingface_token
python python_Shoval_Benjer.py
# Input:  comp_run_time_domains.csv, our_clients.csv
# Output: final_output.csv

About

Production-grade Ad-Tech Intelligence Suite featuring a high-performance Python Map-Reduce scraping pipeline (Polars/DuckDB) and a forensic SQL anomaly detection system (Snowflake) for real-time fraud mitigation and competitive reconnaissance.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages