Skip to content

feat: Add GroupStandardScaler for scaling variables relative to a giv…#915

Open
ankitlade12 wants to merge 5 commits intofeature-engine:mainfrom
ankitlade12:feat/group-standard-scaler
Open

feat: Add GroupStandardScaler for scaling variables relative to a giv…#915
ankitlade12 wants to merge 5 commits intofeature-engine:mainfrom
ankitlade12:feat/group-standard-scaler

Conversation

@ankitlade12
Copy link
Copy Markdown
Contributor

@ankitlade12 ankitlade12 commented Mar 10, 2026

Description

This PR introduces the GroupStandardScaler to the feature_engine.scaling module.

Currently, native scalers like StandardScaler scale a numerical feature globally across an entire dataset. However, it is an extremely common pattern in data science to scale a feature relative to its group (e.g., standardizing house_price relative to its neighborhood, or scaling a student's exam_score relative to their class_id).

The GroupStandardScaler resolves this by taking both variables and reference variables (the grouping keys). During fit, it learns the mean and standard deviation for each numerical variable per group. During transform, it scales the variables using their respective group parameters. It gracefully handles unseen groups during transform by falling back to the global mean and standard deviation.

Changes:

  • Added GroupStandardScaler class in feature_engine/scaling/group_standard.py.
  • Exported GroupStandardScaler in feature_engine/scaling/__init__.py.
  • Included rigorous tests for single-reference scaling, missing values handling, unseen groups fallback, and parameter validation.
  • Added full API documentation in docs/api_doc/scaling/GroupStandardScaler.rst.
  • Added User Guide explanations and examples in docs/user_guide/scaling/GroupStandardScaler.rst.

Examples:

import pandas as pd
from feature_engine.scaling import GroupStandardScaler

df = pd.DataFrame({
    "House_Price": [100000, 150000, 120000, 500000, 550000, 480000],
    "Neighborhood": ["A", "A", "A", "B", "B", "B"]
})

scaler = GroupStandardScaler(
    variables=["House_Price"],
    reference=["Neighborhood"]
)

scaler.fit(df)
df_scaled = scaler.transform(df)

Checklist:

  • I have read the contribution guidelines.
  • I have tested my code locally.
  • I have added documentation for my new feature.
  • I have added unit tests for my changes.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.29%. Comparing base (f72a2b7) to head (0b266f8).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #915      +/-   ##
==========================================
+ Coverage   98.27%   98.29%   +0.02%     
==========================================
  Files         116      117       +1     
  Lines        4978     5048      +70     
  Branches      795      806      +11     
==========================================
+ Hits         4892     4962      +70     
  Misses         55       55              
  Partials       31       31              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…, std edge cases, get_feature_names_out, _more_tags)
@solegalli
Copy link
Copy Markdown
Collaborator

Hi @ankitlade12

Thanks a lot for this contribution. Would you by any chance have a reference where this type of transformation is applied?

I am not sure it belongs to the scaling module. Scaling is usually done to change the scale without really affecting the overall shape of the variable distribution, but this transformation will indeed change it's shape, so it does not belong in the scaling module.

It's also not a variable stabilizing transformation, so not suitable for the transformation module. I don't really know where it should be placed.

if you could send a few references or share more about how/when/how frequently this is used, maybe we can take it from there?

@ankitlade12
Copy link
Copy Markdown
Contributor Author

Hi @solegalli, thanks for the thoughtful feedback!

References / use cases:

This technique is commonly known as within-group standardization (or group-wise z-scoring) and appears frequently across several domains:

  • Econometrics & panel data: Within-group centering/scaling is a standard preprocessing step in fixed-effects and multilevel/hierarchical models to separate within-group variation from between-group variation. See Gelman & Hill, Data Analysis Using Regression and Multilevel Models (Ch. 12–13) for discussion of centering predictors within groups.
  • Education & psychometrics: Standardizing student scores relative to their school, cohort, or test form is routine practice to make scores comparable across groups (e.g., equating exam difficulty across sessions).
  • Sports analytics: Player performance metrics are regularly standardized relative to position or league to enable fair cross-group comparisons.
  • Healthcare / clinical trials: Lab values are often standardized relative to site or demographic group to account for systematic between-group differences before modeling.
  • General ML pipelines: Any time you have grouped/hierarchical data and want to remove between-group scale differences before feeding features into a model, this is the standard approach. It's the preprocessing counterpart to scikit-learn's GroupKFold.

On module placement:

You raise a valid point — this transformation does change the marginal distribution shape by removing between-group variation, which makes it different from a global scaler like StandardScaler.

That said, the core mechanic is still mean-centering and dividing by standard deviation — it's standardization conditioned on a grouping variable. I'd argue it's closest in spirit to scaling, but I'm open to alternatives. A few options:

  1. Keep it in scaling — the operation is standardization, just group-conditional. Users looking for "scaling by group" would naturally look here.
  2. A new submodule like feature_engine.group_transforms — if you anticipate other group-conditional operations (group-wise min-max, group-wise robust scaling, etc.), this could be a clean home.
  3. Place it in transformation — though I agree it's not a stabilizing transformation, so this feels like a weaker fit.

I'm happy to move the class to whichever module you think is best. What's your preference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants