feat: Add StringListBinarizer to encode multi-label strings and lists#916
feat: Add StringListBinarizer to encode multi-label strings and lists#916ankitlade12 wants to merge 7 commits intofeature-engine:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #916 +/- ##
==========================================
+ Coverage 98.27% 98.30% +0.02%
==========================================
Files 116 117 +1
Lines 4978 5063 +85
Branches 795 814 +19
==========================================
+ Hits 4892 4977 +85
Misses 55 55
Partials 31 31 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…ormat paths, non-str/list rows, get_feature_names_out, _more_tags)
|
Hi @ankitlade12 Thanks a lot for the suggestion on the new transformer. As it stands, this transformer first separates a string based on a separator and then applies OHE. I am usually not a great fan of doing 2 different things together in one transformer. Here we are mixing string manipulation with encoding. I'd prefer to have 1 transformer for string manipulation in a relevant module, and defer encoding to the encoders in the encoding module. I do know that strings can be messy and I'd like to add more functionality to clean strings to feature-engine. Maybe we could focus this transformer to just split strings based on a separator? and defer the encoding to the encoders? Users can use one after the other to obtain the OHE version of the original string. |
|
Hi @solegalli, thanks for the detailed feedback — that's a really good point about separation of concerns! I completely agree that mixing string manipulation with encoding in a single transformer isn't ideal. Breaking it into two composable steps is cleaner and more aligned with feature-engine's design philosophy. Here's what I'm thinking for the revised approach:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
("split", StringSplitter(variables=["genres"], separator=", ")),
("encode", OneHotEncoder(variables=["genres"])), # or relevant encoder
])A couple of questions to align on before I refactor:
Happy to rework the PR once we're aligned on the direction! |
|
Hi @ankitlade12 We wouldn't be modifying the encoders at this stage, so we need to make string splitter output something that those encoders can take in as inputs. The questions got me thinking though, I am not sure what the output of this transformer should be. In short, the output should be something that is suitable for most use cases. The example with movie genres is fairly straightforward, it's almost so neat that it needs minimal working. I wonder what more complex scenarios would be and what the output needed would be? Have you seen this logic used in different projects? I don't really know what to suggest. |
|
Hi @solegalli, great question — let me share some real-world scenarios and think through the output format. Real-world use cases beyond movie genres:
In all of these, the raw data arrives as a single string column that needs to be split before any encoding can happen. On the output format: I think the most natural and pipeline-friendly output would be to expand each variable into multiple rows or keep lists and let downstream transformers handle it — but since feature-engine encoders currently expect one value per cell, I see a few practical options:
Honestly, option 3 (outputting lists) is the cleanest from a separation-of-concerns perspective, but it requires encoder support. Without that, the transformer on its own has limited utility in a pipeline. A possible middle ground: What if the Alternatively, if you'd prefer to keep it purely as a string operation, I could build the What direction feels right to you? Happy to go either way. |
Description
This PR introduces the StringListBinarizer to the
feature_engine.encodingmodule.When dealing with modern datasets (e.g., e-commerce, web logs, or NLP metadata), it's extremely common to have columns containing multiple categories per row. This data usually arrives in one of two ways:
"action, comedy, thriller"["action", "comedy", "thriller"]Currently in scikit-learn, users are forced to write messy custom pandas
.applyfunctions or wrestle withMultiLabelBinarizer(which returns raw numpy arrays, strips feature names, and requires iterable-of-iterables).The StringListBinarizer acts as a native Feature-engine transformer that smoothly splits string lists by a given
separatorand applies one-hot encoding across all the tags identified in the dataset. It operates directly on pandas DataFrames and returns beautifully named Boolean columns (e.g.,genres_action,genres_comedy).Changes:
Examples: