Skip to content

feat: Add StringListBinarizer to encode multi-label strings and lists#916

Open
ankitlade12 wants to merge 7 commits intofeature-engine:mainfrom
ankitlade12:feat/string-list-binarizer
Open

feat: Add StringListBinarizer to encode multi-label strings and lists#916
ankitlade12 wants to merge 7 commits intofeature-engine:mainfrom
ankitlade12:feat/string-list-binarizer

Conversation

@ankitlade12
Copy link
Copy Markdown
Contributor

Description

This PR introduces the StringListBinarizer to the feature_engine.encoding module.

When dealing with modern datasets (e.g., e-commerce, web logs, or NLP metadata), it's extremely common to have columns containing multiple categories per row. This data usually arrives in one of two ways:

  1. Comma-delimited strings: "action, comedy, thriller"
  2. Python lists evaluated from JSON: ["action", "comedy", "thriller"]

Currently in scikit-learn, users are forced to write messy custom pandas .apply functions or wrestle with MultiLabelBinarizer (which returns raw numpy arrays, strips feature names, and requires iterable-of-iterables).

The StringListBinarizer acts as a native Feature-engine transformer that smoothly splits string lists by a given separator and applies one-hot encoding across all the tags identified in the dataset. It operates directly on pandas DataFrames and returns beautifully named Boolean columns (e.g., genres_action, genres_comedy).

Changes:

  • Added StringListBinarizer class in feature_engine/encoding/string_list_binarizer.py.
  • Exported StringListBinarizer in feature_engine/encoding/init.py.
  • Included rigorous tests for delimited string formats, python list formats, unseen categories fallback, and parameter validation.
  • Added full API documentation in docs/api_doc/encoding/StringListBinarizer.rst.
  • Added User Guide explanations and examples in docs/user_guide/encoding/StringListBinarizer.rst.

Examples:

import pandas as pd
from feature_engine.encoding import StringListBinarizer

df = pd.DataFrame({
    "user_id": [1, 2, 3],
    "genres": ["action, comedy", "comedy", "action, thriller"]
})

encoder = StringListBinarizer(
    variables=["genres"],
    separator=", " 
)

encoder.fit(df)
df_encoded = encoder.transform(df)

# Output:
#    user_id  genres_action  genres_comedy  genres_thriller
# 0        1              1              1                0
# 1        2              0              1                0
# 2        3              1              0                1

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.30%. Comparing base (f72a2b7) to head (eae5f5e).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #916      +/-   ##
==========================================
+ Coverage   98.27%   98.30%   +0.02%     
==========================================
  Files         116      117       +1     
  Lines        4978     5063      +85     
  Branches      795      814      +19     
==========================================
+ Hits         4892     4977      +85     
  Misses         55       55              
  Partials       31       31              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ormat paths, non-str/list rows, get_feature_names_out, _more_tags)
@solegalli
Copy link
Copy Markdown
Collaborator

Hi @ankitlade12

Thanks a lot for the suggestion on the new transformer.

As it stands, this transformer first separates a string based on a separator and then applies OHE.

I am usually not a great fan of doing 2 different things together in one transformer. Here we are mixing string manipulation with encoding.

I'd prefer to have 1 transformer for string manipulation in a relevant module, and defer encoding to the encoders in the encoding module.

I do know that strings can be messy and I'd like to add more functionality to clean strings to feature-engine. Maybe we could focus this transformer to just split strings based on a separator? and defer the encoding to the encoders?

Users can use one after the other to obtain the OHE version of the original string.

@ankitlade12
Copy link
Copy Markdown
Contributor Author

Hi @solegalli, thanks for the detailed feedback — that's a really good point about separation of concerns!

I completely agree that mixing string manipulation with encoding in a single transformer isn't ideal. Breaking it into two composable steps is cleaner and more aligned with feature-engine's design philosophy.

Here's what I'm thinking for the revised approach:

  1. A StringSplitter transformer (or similar name) that focuses purely on string manipulation — splitting delimited strings like "action, comedy, thriller" into Python lists ["action", "comedy", "thriller"]. This could live in a text or string_manipulation module, or wherever you think string-cleaning utilities best belong.

  2. Defer the one-hot encoding to existing encoders in the encoding module. Users would simply chain them in a pipeline:

from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("split", StringSplitter(variables=["genres"], separator=", ")),
    ("encode", OneHotEncoder(variables=["genres"])),  # or relevant encoder
])

A couple of questions to align on before I refactor:

  • Module placement: Would you prefer the string splitter in a new module (e.g., feature_engine.text or feature_engine.preprocessing), or within an existing module?
  • Scope of the splitter: Should it only handle splitting by separator, or would you also like it to handle rows that already contain Python lists (pass-through)? This was a feature of the original StringListBinarizer that seemed useful for messy real-world data.
  • Encoding compatibility: The split output would be list-valued columns. Do existing feature-engine encoders already support list-type inputs, or would a small adapter/update be needed on the encoding side?

Happy to rework the PR once we're aligned on the direction!

@solegalli
Copy link
Copy Markdown
Collaborator

Hi @ankitlade12

We wouldn't be modifying the encoders at this stage, so we need to make string splitter output something that those encoders can take in as inputs.

The questions got me thinking though, I am not sure what the output of this transformer should be.

In short, the output should be something that is suitable for most use cases. The example with movie genres is fairly straightforward, it's almost so neat that it needs minimal working. I wonder what more complex scenarios would be and what the output needed would be? Have you seen this logic used in different projects?

I don't really know what to suggest.

@ankitlade12
Copy link
Copy Markdown
Contributor Author

Hi @solegalli, great question — let me share some real-world scenarios and think through the output format.

Real-world use cases beyond movie genres:

  1. E-commerce product tags: "wireless, bluetooth, noise-cancelling" — products tagged with multiple attributes, often inconsistently (extra spaces, mixed case).
  2. Survey/form data: "Python; R; SQL" — multi-select responses exported as delimited strings from tools like Google Forms or Qualtrics.
  3. Medical records: "diabetes, hypertension, asthma" — patient condition fields where multiple diagnoses are stored in a single column.
  4. Job postings: "react, node.js, postgresql" — required skills scraped or exported from job boards.

In all of these, the raw data arrives as a single string column that needs to be split before any encoding can happen.

On the output format:

I think the most natural and pipeline-friendly output would be to expand each variable into multiple rows or keep lists and let downstream transformers handle it — but since feature-engine encoders currently expect one value per cell, I see a few practical options:

  1. Explode into long format (one value per row) — this changes the DataFrame shape, which could break pipelines and is hard to reverse. Probably not ideal.
  2. Expand into multiple binary columns directly — but that's encoding, which we're trying to separate out.
  3. Output Python lists (e.g., ["action", "comedy"]) — clean separation, but existing encoders don't support list inputs yet.

Honestly, option 3 (outputting lists) is the cleanest from a separation-of-concerns perspective, but it requires encoder support. Without that, the transformer on its own has limited utility in a pipeline.

A possible middle ground:

What if the StringSplitter outputs one new column per unique token with boolean values? This is technically still "binarizing," but the logic is really just: "does this token appear in this row's string?" — which feels more like string manipulation than statistical encoding (no fitting of category frequencies, no handling of unseen categories, etc.).

Alternatively, if you'd prefer to keep it purely as a string operation, I could build the StringSplitter to output lists, and then we could open a follow-up issue/PR to add list-input support to the existing encoders. That way the architecture stays clean and we build toward composability.

What direction feels right to you? Happy to go either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants