feat: add errors parameter to CategoricalImputer for multimodal variables#908
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #908 +/- ##
=======================================
Coverage 98.27% 98.27%
=======================================
Files 116 116
Lines 4978 4988 +10
Branches 795 800 +5
=======================================
+ Hits 4892 4902 +10
Misses 55 55
Partials 31 31 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@solegalli facing issue with the 2 checks will work on this and move the PR for review in some time |
|
the test pass percentage has increased still below the set benchmark |
solegalli
left a comment
There was a problem hiding this comment.
Thank you very much for taking care of this issue and a lot of apologies for taking so long to reply.
As mentioned previously I am travelling, currently in the last month, and not always have time or internet to take care of feature-engine as I would like to.
Anyhow, I went over this PR now, the logic is there, thanks a lot for the implementation.
The tests need a bit of tidying but you covered every angle as well, thanks a lot for that.
Please see my comments below.
Would you be able to take a look at them?
PS: I'm back in Europe from April 13th so I will have normal capacity from then on :)
| _fit_transform_docstring, | ||
| _transform_imputers_docstring, | ||
| ) | ||
| _feature_names_in_docstring, _imputer_dict_docstring, |
There was a problem hiding this comment.
Please restore to previous format.
There was a problem hiding this comment.
restored the previous format
| find_all_variables, | ||
| find_categorical_variables, | ||
| ) | ||
| from feature_engine.variable_handling import (check_all_variables, |
There was a problem hiding this comment.
please restore to previous format
There was a problem hiding this comment.
restored the previous format
| type object or categorical. If True, the imputer will select all variables or | ||
| accept all variables entered by the user, including those cast as numeric. | ||
|
|
||
| errors : str, default='raise' |
There was a problem hiding this comment.
Instead of "errors", let's call this parameter "multimodal" so it is immediately obvious what it is about.
There was a problem hiding this comment.
made errors to multimodal
| accept all variables entered by the user, including those cast as numeric. | ||
|
|
||
| errors : str, default='raise' | ||
| Indicates what to do when the selected imputation_method='frequent' |
There was a problem hiding this comment.
| Indicates what to do when the selected imputation_method='frequent' | |
| Indicates what to do when `imputation_method='frequent'` |
There was a problem hiding this comment.
applied the sugesstion
| imputer = CategoricalImputer(imputation_method="missing", errors="warn") | ||
| # Should fit without warnings since there's no mode computation | ||
| with warnings.catch_warnings(): | ||
| warnings.simplefilter("error") |
There was a problem hiding this comment.
i am not sure we are testing that there were no warning. Could you pls check?
|
|
||
| def test_errors_ignore_single_variable(): | ||
| """errors='ignore' on single multimodal variable — silent, uses first mode.""" | ||
| X = pd.DataFrame( |
There was a problem hiding this comment.
Do we need this test? the logic of the transformer is tested in previous tests
|
|
||
|
|
||
| def test_errors_ignore_multiple_variables(): | ||
| """errors='ignore' on multiple multimodal variables — silent, uses first mode.""" |
There was a problem hiding this comment.
do we need this test? the logic of the imputation is tested in previous tests
| assert imputer.imputer_dict_["country"] == X["country"].mode()[0] | ||
|
|
||
|
|
||
| # ============================================================================= |
There was a problem hiding this comment.
thanks for highlighting this but pls remove these commented block.
There was a problem hiding this comment.
removed the commented block
| """ | ||
| Covers the warnings.warn() inside the SINGLE-VARIABLE block of fit(). | ||
|
|
||
| The existing test_errors_warn_emits_userwarning uses multimodal_df (2 columns), |
There was a problem hiding this comment.
thanks for picking this up. Instead of lengthy comments, we should try and capture the essence of the test in the test name.
There was a problem hiding this comment.
yes will take care of it
|
made formatting changes for passing the to |
|
Hey @solegalli, the only failing checks are the |
| _transform_imputers_docstring, | ||
| _variables_attribute_docstring | ||
| ) | ||
| from feature_engine._docstrings.methods import (_fit_transform_docstring, |
There was a problem hiding this comment.
please make format match other imports
| variables: Union[None, int, str, List[Union[str, int]]] = None, | ||
| return_object: bool = False, | ||
| ignore_format: bool = False, | ||
| errors: str = "raise", |
There was a problem hiding this comment.
| errors: str = "raise", | |
| multimodal: str = "raise", |
| if not isinstance(ignore_format, bool): | ||
| raise ValueError("ignore_format takes only booleans True and False") | ||
|
|
||
| if errors not in ["raise", "warn", "ignore"]: |
There was a problem hiding this comment.
we need to update all of these errors to multimodal :)
| ) | ||
| imputer = CategoricalImputer(imputation_method="frequent", variables="Name") | ||
| with pytest.raises(ValueError) as record: | ||
| with pytest.raises(ValueError, match=re.escape(msg)): |
There was a problem hiding this comment.
instead of making the test now dependent on re, we can test that it matches just part of the error message, 1 line, that's all we need :)
| imputer = CategoricalImputer(imputation_method="frequent") | ||
| msg = ( | ||
| "The variable(s) city, country contain(s) multiple frequent categories. " | ||
| "Set errors='warn' or errors='ignore' to allow imputation " |
There was a problem hiding this comment.
we can remove the 2nd and 3rd line and then we don't need re
|
|
||
| @pytest.mark.parametrize("errors", ["warn", "ignore"]) | ||
| def test_multimodal_imputation_result(multimodal_df, errors): | ||
| """Check that result is the same when errors='warn' or 'ignore'.""" |
|
|
||
| @pytest.mark.parametrize("errors", ["bad_value", 1, True]) | ||
| def test_errors_invalid_value_raises(errors): | ||
| """Passing an unsupported value for errors should raise ValueError at init.""" |
There was a problem hiding this comment.
pls remove comments from all tests :)
solegalli
left a comment
There was a problem hiding this comment.
Thank you so much for the quick turnaround. Really appreciate it.
I saw you changed errors to multimodal in the docstring. We also need to change the parameter name and hence the logic throughout. Could you do that?
I think there is one part of the warn logic that is not being tested, I am not sure if you need to test multimodal=warn when passing 1 variable name to the variables parameters?
The logic in the transformer is broken down in 1: when variables get 1 variable, or otherwise. In one of the 2, we are not testing multimodal=warn, i think.
|
Hi, @solegalli the changes required in the comments you mentioned i will do then shortly |
Co-authored-by: Soledad Galli <solegalli@protonmail.com>
Co-authored-by: Soledad Galli <solegalli@protonmail.com>
…thub.com/direkkakkar319-ops/feature_engine into issue-904-categorical-imputer-multimodal
…thub.com/direkkakkar319-ops/feature_engine into issue-904-categorical-imputer-multimodal
Alright! No problem at all. Give me a shout when you are ready! Thanks a lot for your support. |
|
Hi, @solegalli , Changes have been added , now the pr is ready for your review |



Description
Fixes #904
Adds an
errorsparameter toCategoricalImputerto handle multimodal categorical variables gracefully, instead of always raising aValueError.Changes
errorsparameter toCategoricalImputer.__init__()with options'raise'(default),'warn', and'ignore'.fit()to respect the new parametererrors='warn', emits aUserWarningand imputes using the first most frequent category founderrors='ignore', silently imputes using the first most frequent category founderrors='raise'preserves existing behaviour — no breaking changeserrorsvalues and invalid inputCHANGELOG.rstentryType of Change
Tests
All new and existing tests pass:
Notes
The existing error message (
"contains multiple frequent categories") is preserved so no existing test matchers break.