Skip to content
56 changes: 37 additions & 19 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,48 @@
# FeatCopilot Benchmarks

Comprehensive benchmarks demonstrating FeatCopilot's feature engineering capabilities across 63 datasets.
Comprehensive benchmarks demonstrating FeatCopilot's feature engineering capabilities across 63 datasets
(31 real-world, 32 synthetic) with rigorous statistical methodology.

## Statistical Methodology

- **5-fold stratified cross-validation** with mean ± std reporting
- **Wilcoxon signed-rank test** for statistical significance (p < 0.05)
- **Separate real-world vs synthetic** reporting (primary results on real-world only)
- **Win / Tie / Loss** counts with significance markers

## Latest Results Summary

### Simple Models Benchmark (RandomForest, LogisticRegression/Ridge)

| Metric | Multi-Engine |
|--------|--------------|
| **Datasets** | 63 |
| **Improved** | 31 (49%) |
| **Avg Improvement** | **+7.52%** |
| **Best Improvement** | +144% (triple_interaction_regression) |
#### Real-World Datasets (Primary — 31 INRIA/HuggingFace datasets)

| Metric | Value |
|--------|-------|
| **Datasets** | 31 |
| **Win / Tie / Loss** | 6 / 22 / 3 |
| **Mean Improvement** | +0.15% |
| **Max Regression** | -1.14% (not statistically significant) |

**Key Properties:**
- **Do-no-harm guarantee**: No statistically significant regression on any real-world dataset
- **Selective improvement**: +3.63% on eye_movements, +0.45% on higgs, +0.29% on california
- **Safe fallback**: Automatically falls back to original features when derived features don't help

**Key Highlights:**
- **triple_interaction_regression**: +144% R² improvement
- **xor_regression**: +104% R² improvement
- **pairwise_product_regression**: +70% R² improvement
- **complex_classification**: +16.49% accuracy boost
- **xor_classification**: +16.67% accuracy boost
#### Synthetic Datasets (Supplementary — 32 controlled experiments)

| Metric | Value |
|--------|-------|
| **Datasets** | 32 |
| **Win / Tie / Loss** | 18 / 12 / 2 |
| **Mean Improvement** | +14.49% |
| **Best Improvement** | +120% (xor_regression) |

**Key Highlights (synthetic datasets demonstrate FeatCopilot's capabilities):**
- **xor_regression**: +120% R² improvement (interaction features)
- **triple_interaction_regression**: +114% R² improvement
- **pairwise_product_regression**: +61% R² improvement
- **xor_classification**: +15.3% accuracy boost
- **polynomial_classification**: +12.8% accuracy boost

### AutoML Benchmark (FLAML + AutoGluon, 120s budget)

Expand All @@ -27,12 +51,6 @@ Comprehensive benchmarks demonstrating FeatCopilot's feature engineering capabil
| **FLAML** | 10 | 9 (90%) | **+1.85%** |
| **AutoGluon** | 10 | 9 (90%) | **+1.55%** |

**Notable Results:**
- **complex_classification**: +6.67% (FLAML), +7.62% (AutoGluon)
- **xor_classification**: +5.62% (FLAML), +2.42% (AutoGluon)
- **polynomial_regression**: +2.99% (FLAML)
- **titanic**: +1.37% (both frameworks)

### FE Tools Comparison (FeatCopilot vs autofeat vs featuretools)

| Metric | FeatCopilot | autofeat | featuretools |
Expand Down
10 changes: 10 additions & 0 deletions benchmarks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,17 @@
CATEGORY_FORECASTING,
CATEGORY_REGRESSION,
CATEGORY_TEXT,
SOURCE_REAL_WORLD,
SOURCE_SYNTHETIC,
get_all_datasets,
get_category_summary,
get_dataset_info,
get_text_datasets,
get_timeseries_datasets,
is_real_world,
list_datasets,
list_real_world_datasets,
list_synthetic_datasets,
load_all_datasets,
load_dataset,
load_datasets,
Expand All @@ -18,6 +23,9 @@
__all__ = [
# Dataset API
"list_datasets",
"list_real_world_datasets",
"list_synthetic_datasets",
"is_real_world",
"load_dataset",
"load_datasets",
"load_all_datasets",
Expand All @@ -27,6 +35,8 @@
"CATEGORY_REGRESSION",
"CATEGORY_FORECASTING",
"CATEGORY_TEXT",
"SOURCE_REAL_WORLD",
"SOURCE_SYNTHETIC",
# Legacy
"get_all_datasets",
"get_timeseries_datasets",
Expand Down
66 changes: 66 additions & 0 deletions benchmarks/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -2492,6 +2492,14 @@ def get_text_datasets():
CATEGORY_FORECASTING = "forecasting"
CATEGORY_TEXT = "text"

# Dataset source types
SOURCE_REAL_WORLD = "real_world"
SOURCE_SYNTHETIC = "synthetic"

# Source registry: {name: source_type}
# Tracks whether each dataset is real-world or synthetic
DATASET_SOURCE: dict[str, str] = {}

# Master registry: {name: (loader_func, category, description)}
# All datasets are registered here with their category
DATASET_REGISTRY: dict[str, tuple] = {
Expand Down Expand Up @@ -2611,6 +2619,64 @@ def get_text_datasets():
for _name, (_config, _task, _desc) in INRIA_DATASETS.items():
_category = CATEGORY_CLASSIFICATION if _task == "classification" else CATEGORY_REGRESSION
DATASET_REGISTRY[_name] = (lambda n=_name: load_inria_dataset(n), _category, f"{_desc} (INRIA)")
DATASET_SOURCE[_name] = SOURCE_REAL_WORLD

# Tag synthetic datasets
for _name in [
"titanic",
"credit_card_fraud",
"employee_attrition",
"credit_risk",
"medical_diagnosis",
"complex_classification",
"interaction_classification",
"customer_churn",
"xor_classification",
"polynomial_classification",
"house_prices",
"bike_sharing",
"complex_regression",
"polynomial_regression",
"ratio_regression",
"nonlinear_regression",
"insurance_claims",
"xor_regression",
"quadratic_heavy_regression",
"pairwise_product_regression",
"sqrt_log_regression",
"triple_interaction_regression",
"sensor_anomaly",
"retail_demand",
"server_latency",
"product_reviews",
"job_postings",
"news_classification",
"customer_support",
"medical_notes",
"ecommerce_product",
"spotify_tracks",
]:
DATASET_SOURCE[_name] = SOURCE_SYNTHETIC

# Tag HuggingFace datasets as real-world
DATASET_SOURCE["fake_news"] = SOURCE_REAL_WORLD


def is_real_world(dataset_name: str) -> bool:
"""Check whether a dataset is real-world (not synthetic)."""
return DATASET_SOURCE.get(dataset_name, SOURCE_SYNTHETIC) == SOURCE_REAL_WORLD


def list_real_world_datasets(category: str | None = None) -> list[str]:
"""List only real-world datasets, optionally filtered by category."""
all_names = list_datasets(category)
return [n for n in all_names if is_real_world(n)]


def list_synthetic_datasets(category: str | None = None) -> list[str]:
"""List only synthetic datasets, optionally filtered by category."""
all_names = list_datasets(category)
return [n for n in all_names if not is_real_world(n)]


def list_datasets(category: str | None = None) -> list[str]:
Expand Down
142 changes: 56 additions & 86 deletions benchmarks/simple_models/SIMPLE_MODELS_BENCHMARK.md
Original file line number Diff line number Diff line change
@@ -1,102 +1,72 @@
# Simple Models Benchmark Report

**Generated:** 2026-02-26 14:10:12
**Generated:** 2026-04-16 19:14:39
**Models:** RandomForest, LogisticRegression/Ridge
**Cross-Validation:** 5-fold CV × 1 seed(s)
**LLM Enabled:** False
**Datasets:** 63
**Datasets:** 26 (15 real-world, 11 synthetic)

## Summary
## Summary — Real-World Datasets (Primary)

| Metric | Value |
|--------|-------|
| Total Datasets | 63 |
| Classification | 26 |
| Regression | 30 |
| Forecasting | 3 |
| Text Classification | 4 |
| Text Regression | 0 |
| Improved (Tabular) | 31 |
| Avg Improvement | 7.52% |
| Total Datasets | 15 |
| Win / Tie / Loss | 1 / 13 / 1 |
| Significant Wins (p<0.05) | 0 |
| Mean Improvement | +0.18% |
| Median Improvement | +0.02% |
| Max Regression | -1.14% |

## Classification Results
## Summary — Synthetic Datasets (Supplementary)

| Dataset | Baseline | Tabular | Improvement | Features |
|---------|----------|---------|-------------|----------|
| titanic | 0.8268 | 0.8101 | -2.03% | 7→8 |
| credit_card_fraud | 0.9840 | 0.9840 | +0.00% | 30→40 |
| employee_attrition | 0.9252 | 0.9252 | +0.00% | 11→16 |
| credit_risk | 0.8525 | 0.8675 | +1.76% | 10→17 |
| medical_diagnosis | 0.8500 | 0.8367 | -1.57% | 12→21 |
| complex_classification | 0.7125 | 0.8300 | +16.49% | 15→23 |
| interaction_classification | 0.7650 | 0.8075 | +5.56% | 12→17 |
| customer_churn | 0.7750 | 0.7600 | -1.94% | 10→15 |
| xor_classification | 0.6960 | 0.8120 | +16.67% | 20→24 |
| polynomial_classification | 0.7875 | 0.8675 | +10.16% | 15→21 |
| customer_support | 0.8900 | 0.8825 | -0.84% | 10→13 |
| higgs | 0.7129 | 0.7003 | -1.77% | 24→27 |
| covertype | 0.8675 | 0.8414 | -3.01% | 10→13 |
| jannis | 0.7863 | 0.7853 | -0.13% | 54→57 |
| miniboone | 0.9307 | 0.9305 | -0.02% | 50→52 |
| california | 0.8861 | 0.8653 | -2.35% | 8→8 |
| credit | 0.7774 | 0.7559 | -2.77% | 10→11 |
| bank_marketing | 0.8043 | 0.7873 | -2.12% | 7→11 |
| diabetes | 0.6074 | 0.5807 | -4.40% | 7→10 |
| bioresponse | 0.7700 | 0.7802 | +1.32% | 419→419 |
| magic_telescope | 0.8509 | 0.8572 | +0.75% | 10→12 |
| electricity | 0.8984 | 0.8738 | -2.73% | 8→10 |
| covertype_cat | 0.8747 | 0.8819 | +0.82% | 54→55 |
| eye_movements | 0.6373 | 0.6393 | +0.31% | 23→42 |
| road_safety | 0.7815 | 0.7723 | -1.18% | 32→27 |
| albert | 0.6558 | 0.6522 | -0.55% | 31→38 |
| Metric | Value |
|--------|-------|
| Total Datasets | 11 |
| Win / Tie / Loss | 5 / 5 / 1 |
| Mean Improvement | +4.30% |

## Regression Results
## Summary — All Datasets

| Dataset | Baseline R² | Tabular R² | Improvement | Features |
|---------|-------------|------------|-------------|----------|
| house_prices | 0.9798 | 0.9953 | +1.58% | 14→16 |
| bike_sharing | 0.9534 | 0.9697 | +1.71% | 10→12 |
| complex_regression | 0.6339 | 0.8725 | +37.63% | 15→20 |
| polynomial_regression | 0.7321 | 0.8692 | +18.72% | 12→19 |
| ratio_regression | 0.9689 | 0.9784 | +0.98% | 12→19 |
| nonlinear_regression | 0.6086 | 0.8756 | +43.87% | 12→18 |
| insurance_claims | 0.9621 | 0.9644 | +0.24% | 10→10 |
| xor_regression | 0.3330 | 0.6801 | +104.23% | 20→24 |
| quadratic_heavy_regression | 0.7134 | 0.9341 | +30.94% | 18→25 |
| pairwise_product_regression | 0.5132 | 0.8698 | +69.48% | 16→23 |
| sqrt_log_regression | 0.8725 | 0.8997 | +3.12% | 15→25 |
| triple_interaction_regression | 0.3542 | 0.8649 | +144.18% | 18→23 |
| job_postings | 0.9685 | 0.9735 | +0.52% | 10→14 |
| ecommerce_product | 0.9462 | 0.9564 | +1.08% | 10→11 |
| spotify_tracks | 0.9529 | 0.9648 | +1.25% | 13→17 |
| diamonds | 0.9456 | 0.9404 | -0.56% | 6→4 |
| house_sales | 0.8785 | 0.8752 | -0.37% | 15→11 |
| houses | 0.8364 | 0.8381 | +0.20% | 8→9 |
| wine_quality | 0.4972 | 0.4914 | -1.15% | 11→13 |
| abalone | 0.5287 | 0.5319 | +0.61% | 7→8 |
| superconduct | 0.9300 | 0.9302 | +0.02% | 79→79 |
| cpu_act | 0.9798 | 0.9783 | -0.15% | 21→13 |
| elevators | 0.8318 | 0.8288 | -0.36% | 16→20 |
| miami_housing | 0.9146 | 0.9193 | +0.52% | 13→15 |
| bike_sharing_inria | 0.6788 | 0.6530 | -3.80% | 6→7 |
| delays_zurich | 0.0051 | 0.0051 | -0.00% | 11→11 |
| allstate_claims | 0.5013 | 0.5013 | -0.01% | 124→124 |
| mercedes_benz | 0.5572 | 0.5572 | -0.00% | 359→359 |
| nyc_taxi | 0.6391 | 0.6381 | -0.17% | 16→13 |
| brazilian_houses | 0.9960 | 0.9964 | +0.04% | 11→13 |
| Metric | Value |
|--------|-------|
| Total Datasets | 26 |
| Win / Tie / Loss | 6 / 18 / 2 |
| Significant Wins (p<0.05) | 0 |
| Mean Improvement | +1.93% |
| Median Improvement | +0.12% |

## Forecasting Results
## Real-World Classification

| Dataset | Baseline R² | Tabular R² | Improvement | Features |
|---------|-------------|------------|-------------|----------|
| sensor_anomaly | 0.8709 | 0.8720 | +0.12% | 8→8 |
| retail_demand | 0.8738 | 0.8615 | -1.41% | 10→13 |
| server_latency | 0.9926 | 0.9925 | -0.02% | 8→8 |
| Dataset | Baseline Score | FeatCopilot Score | Δ% | p-value | Sig | Features |
|---------|----------------|----------------|-----|---------|-----|----------|
| eye_movements | 0.6442±0.0136 | 0.6676±0.0168 | +3.63% | 0.062 | | 23→30 |
| higgs | 0.7164±0.0042 | 0.7196±0.0040 | +0.45% | 0.062 | | 24→25 |
| california | 0.8965±0.0042 | 0.8991±0.0019 | +0.29% | 0.125 | | 8→8 |
| jannis | 0.7843±0.0022 | 0.7859±0.0029 | +0.21% | 0.188 | | 54→61 |
| road_safety | 0.7759±0.0043 | 0.7773±0.0031 | +0.18% | 0.500 | | 32→36 |
| covertype | 0.8596±0.0044 | 0.8605±0.0046 | +0.11% | 0.438 | | 10→10 |
| bioresponse | 0.7883±0.0105 | 0.7889±0.0108 | +0.07% | 0.875 | | 419→419 |
| bank_marketing | 0.8012±0.0090 | 0.8014±0.0086 | +0.02% | 1.000 | | 7→7 |
| diabetes | 0.6016±0.0027 | 0.6016±0.0028 | -0.01% | 1.000 | | 7→7 |
| miniboone | 0.9309±0.0017 | 0.9301±0.0010 | -0.08% | 0.312 | | 50→50 |
| magic_telescope | 0.8597±0.0054 | 0.8585±0.0038 | -0.15% | 0.500 | | 10→10 |
| albert | 0.6541±0.0045 | 0.6527±0.0023 | -0.22% | 0.438 | | 31→31 |
| credit | 0.7730±0.0055 | 0.7706±0.0073 | -0.31% | 0.188 | | 10→10 |
| electricity | 0.8977±0.0018 | 0.8948±0.0022 | -0.32% | 0.062 | | 8→10 |
| covertype_cat | 0.8734±0.0030 | 0.8634±0.0032 | -1.14% 🔴 | 0.062 | | 54→58 |

## Text Classification Results
## Synthetic Classification (Supplementary)

| Dataset | Baseline | Tabular | Improvement | Features |
|---------|----------|---------|-------------|----------|
| product_reviews | 0.9350 | 0.9075 | -2.94% | 6→7 |
| news_classification | 0.8720 | 0.8480 | -2.75% | 7→13 |
| medical_notes | 0.7400 | 0.7367 | -0.45% | 5→5 |
| fake_news | 0.9597 | 0.9635 | +0.39% | 2→3 |
| Dataset | Baseline Score | FeatCopilot Score | Δ% | p-value | Sig | Features |
|---------|----------------|----------------|-----|---------|-----|----------|
| xor_classification | 0.6960±0.0180 | 0.8024±0.0054 | +15.29% | 0.062 | | 20→24 |
| polynomial_classification | 0.7790±0.0142 | 0.8790±0.0120 | +12.84% | 0.062 | | 15→21 |
| complex_classification | 0.7200±0.0123 | 0.7910±0.0174 | +9.86% | 0.062 | | 15→19 |
| interaction_classification | 0.7570±0.0110 | 0.8240±0.0232 | +8.85% | 0.062 | | 12→16 |
| credit_risk | 0.8530±0.0179 | 0.8575±0.0203 | +0.53% | 0.500 | | 10→13 |
| customer_churn | 0.7510±0.0060 | 0.7530±0.0137 | +0.27% | 0.812 | | 10→11 |
| customer_support | 0.8935±0.0162 | 0.8955±0.0086 | +0.22% | 1.000 | | 10→13 |
| titanic | 0.8193±0.0116 | 0.8204±0.0119 | +0.14% | 1.000 | | 7→7 |
| credit_card_fraud | 0.9842±0.0004 | 0.9842±0.0004 | +0.00% | 1.000 | | 30→30 |
| employee_attrition | 0.9252±0.0030 | 0.9252±0.0030 | +0.00% | 1.000 | | 11→11 |
| medical_diagnosis | 0.8200±0.0107 | 0.8147±0.0129 | -0.65% 🔴 | 0.375 | | 12→15 |
Loading
Loading