Skip to content

feat: add errors parameter to CategoricalImputer for multimodal variables#908

Open
direkkakkar319-ops wants to merge 34 commits intofeature-engine:mainfrom
direkkakkar319-ops:issue-904-categorical-imputer-multimodal
Open

feat: add errors parameter to CategoricalImputer for multimodal variables#908
direkkakkar319-ops wants to merge 34 commits intofeature-engine:mainfrom
direkkakkar319-ops:issue-904-categorical-imputer-multimodal

Conversation

@direkkakkar319-ops
Copy link
Copy Markdown

Description

Fixes #904

Adds an errors parameter to CategoricalImputer to handle multimodal categorical variables gracefully, instead of always raising a ValueError.

Changes

  • Added errors parameter to CategoricalImputer.__init__() with options 'raise' (default), 'warn', and 'ignore'
  • Updated both single-variable and multi-variable branches in .fit() to respect the new parameter
  • When errors='warn', emits a UserWarning and imputes using the first most frequent category found
  • When errors='ignore', silently imputes using the first most frequent category found
  • Default errors='raise' preserves existing behaviour — no breaking changes
  • Updated docstring with full parameter documentation
  • Added tests covering all three errors values and invalid input
  • Added CHANGELOG.rst entry

Type of Change

  • Bug fix
  • New feature (non-breaking)
  • Breaking change
  • Documentation update

Tests

All new and existing tests pass:

pytest tests/test_imputation/test_categorical_imputer.py

Notes

The existing error message ("contains multiple frequent categories") is preserved so no existing test matchers break.

Screenshot 2026-03-08 195813

@direkkakkar319-ops direkkakkar319-ops marked this pull request as draft March 8, 2026 14:36
@direkkakkar319-ops direkkakkar319-ops marked this pull request as ready for review March 8, 2026 15:11
@direkkakkar319-ops direkkakkar319-ops marked this pull request as draft March 8, 2026 15:14
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.27%. Comparing base (f72a2b7) to head (6f5b4da).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #908   +/-   ##
=======================================
  Coverage   98.27%   98.27%           
=======================================
  Files         116      116           
  Lines        4978     4988   +10     
  Branches      795      800    +5     
=======================================
+ Hits         4892     4902   +10     
  Misses         55       55           
  Partials       31       31           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@direkkakkar319-ops
Copy link
Copy Markdown
Author

@solegalli facing issue with the 2 checks will work on this and move the PR for review in some time

@direkkakkar319-ops direkkakkar319-ops marked this pull request as ready for review March 8, 2026 16:35
@direkkakkar319-ops direkkakkar319-ops marked this pull request as draft March 8, 2026 17:26
@direkkakkar319-ops
Copy link
Copy Markdown
Author

the test pass percentage has increased still below the set benchmark
trying to resolve the problems

@direkkakkar319-ops direkkakkar319-ops marked this pull request as ready for review March 9, 2026 19:06
Copy link
Copy Markdown
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @direkkakkar319-ops

Thank you very much for taking care of this issue and a lot of apologies for taking so long to reply.

As mentioned previously I am travelling, currently in the last month, and not always have time or internet to take care of feature-engine as I would like to.

Anyhow, I went over this PR now, the logic is there, thanks a lot for the implementation.

The tests need a bit of tidying but you covered every angle as well, thanks a lot for that.

Please see my comments below.

Would you be able to take a look at them?

PS: I'm back in Europe from April 13th so I will have normal capacity from then on :)

_fit_transform_docstring,
_transform_imputers_docstring,
)
_feature_names_in_docstring, _imputer_dict_docstring,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please restore to previous format.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restored the previous format

find_all_variables,
find_categorical_variables,
)
from feature_engine.variable_handling import (check_all_variables,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please restore to previous format

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restored the previous format

type object or categorical. If True, the imputer will select all variables or
accept all variables entered by the user, including those cast as numeric.

errors : str, default='raise'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "errors", let's call this parameter "multimodal" so it is immediately obvious what it is about.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made errors to multimodal

accept all variables entered by the user, including those cast as numeric.

errors : str, default='raise'
Indicates what to do when the selected imputation_method='frequent'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Indicates what to do when the selected imputation_method='frequent'
Indicates what to do when `imputation_method='frequent'`

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied the sugesstion

imputer = CategoricalImputer(imputation_method="missing", errors="warn")
# Should fit without warnings since there's no mode computation
with warnings.catch_warnings():
warnings.simplefilter("error")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure we are testing that there were no warning. Could you pls check?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new approach specifically checks that no warnings containing the message "multiple frequent categories" are raised when imputation_method='missing' , even if errors='warn' is set.

tests cases are passing
Image


def test_errors_ignore_single_variable():
"""errors='ignore' on single multimodal variable — silent, uses first mode."""
X = pd.DataFrame(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this test? the logic of the transformer is tested in previous tests

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed test_errors_ignore_single_variable and test_errors_ignore_multiple_variables .

test cases are passing
Image



def test_errors_ignore_multiple_variables():
"""errors='ignore' on multiple multimodal variables — silent, uses first mode."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this test? the logic of the imputation is tested in previous tests

assert imputer.imputer_dict_["country"] == X["country"].mode()[0]


# =============================================================================
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for highlighting this but pls remove these commented block.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the commented block

"""
Covers the warnings.warn() inside the SINGLE-VARIABLE block of fit().

The existing test_errors_warn_emits_userwarning uses multimodal_df (2 columns),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for picking this up. Instead of lengthy comments, we should try and capture the essence of the test in the test name.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes will take care of it

@direkkakkar319-ops
Copy link
Copy Markdown
Author

direkkakkar319-ops commented Mar 26, 2026

made formatting changes for passing the ci:test_style from this

            "city": ["London", "London", "Paris", "Paris", "Berlin", "Berlin", "Madrid"],
            "country": ["UK", "UK", "FR", "FR", "DE", "DE", "ES"],
            "one_mode": ["London", "London", "London", "Paris", "Paris", "Berlin", "Berlin"],

to

            "city": [
                "London", "London", "Paris", "Paris", "Berlin", "Berlin", "Madrid"
            ],
            "country": ["UK", "UK", "FR", "FR", "DE", "DE", "ES"],
            "one_mode": [
                "London", "London", "London", "Paris", "Paris", "Berlin", "Berlin"
            ],

@direkkakkar319-ops
Copy link
Copy Markdown
Author

Hey @solegalli, the only failing checks are the Codecov coverage ones.
Could you guide me on which test file I should add the coverage for the new errors='warn' and errors='raise' branches?

_transform_imputers_docstring,
_variables_attribute_docstring
)
from feature_engine._docstrings.methods import (_fit_transform_docstring,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make format match other imports

variables: Union[None, int, str, List[Union[str, int]]] = None,
return_object: bool = False,
ignore_format: bool = False,
errors: str = "raise",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
errors: str = "raise",
multimodal: str = "raise",

if not isinstance(ignore_format, bool):
raise ValueError("ignore_format takes only booleans True and False")

if errors not in ["raise", "warn", "ignore"]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to update all of these errors to multimodal :)

)
imputer = CategoricalImputer(imputation_method="frequent", variables="Name")
with pytest.raises(ValueError) as record:
with pytest.raises(ValueError, match=re.escape(msg)):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of making the test now dependent on re, we can test that it matches just part of the error message, 1 line, that's all we need :)

imputer = CategoricalImputer(imputation_method="frequent")
msg = (
"The variable(s) city, country contain(s) multiple frequent categories. "
"Set errors='warn' or errors='ignore' to allow imputation "
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove the 2nd and 3rd line and then we don't need re


@pytest.mark.parametrize("errors", ["warn", "ignore"])
def test_multimodal_imputation_result(multimodal_df, errors):
"""Check that result is the same when errors='warn' or 'ignore'."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls remove comment


@pytest.mark.parametrize("errors", ["bad_value", 1, True])
def test_errors_invalid_value_raises(errors):
"""Passing an unsupported value for errors should raise ValueError at init."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls remove comments from all tests :)

Copy link
Copy Markdown
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @direkkakkar319-ops

Thank you so much for the quick turnaround. Really appreciate it.

I saw you changed errors to multimodal in the docstring. We also need to change the parameter name and hence the logic throughout. Could you do that?

I think there is one part of the warn logic that is not being tested, I am not sure if you need to test multimodal=warn when passing 1 variable name to the variables parameters?

The logic in the transformer is broken down in 1: when variables get 1 variable, or otherwise. In one of the 2, we are not testing multimodal=warn, i think.

@direkkakkar319-ops
Copy link
Copy Markdown
Author

direkkakkar319-ops commented Mar 27, 2026

Changes made

  1. Rename of errors to multimodal in categorical.py , test_categorical_imputer.py
  2. Added missing Unit test case test_warning_when_single_variable_in_list_is_multimodal in test_categorical_imputer.py

tests cases are passing locally
image

@direkkakkar319-ops
Copy link
Copy Markdown
Author

Hi, @solegalli the changes required in the comments you mentioned i will do then shortly

@solegalli
Copy link
Copy Markdown
Collaborator

Hi, @solegalli the changes required in the comments you mentioned i will do then shortly

Alright! No problem at all. Give me a shout when you are ready! Thanks a lot for your support.

@direkkakkar319-ops
Copy link
Copy Markdown
Author

Hi, @solegalli , Changes have been added , now the pr is ready for your review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

make CategoricalImputer not raise an error with multimodal variables

2 participants