Skip to content

step_woe_bin() for binning numeric and factor predictors #239

@AndrewKostandy

Description

@AndrewKostandy

Feature

Thanks for your work on this package.

It would be great if a recipe step is added that can bin numeric and factor features by using weight of evidence against a binary outcome. There are functions that do this such as woebin() from {scorecard} or woe.binning() from {woeBinning}. This recipe step will do two things:

  1. Bin the numeric or factor features (can lump some factor levels together)
  2. Replace the bin values / factor levels with their woe values (like what step_woe() currently does)

Example with woebin() from {scorecard}:

library(scorecard)
library(rsample)

data("germancredit")
data_split <- initial_split(germancredit, strata = creditability)

germancredit_train <- training(data_split)
germancredit_test <- testing(data_split)

bins <- woebin(germancredit_train, "creditability")
#> ℹ Creating woe binning ...
#> ✔ Binning on 750 rows and 21 columns in 00:00:02

bins$duration.in.month
#>             variable       bin count count_distr   neg   pos   posprob        woe      bin_iv  total_iv breaks is_special_values
#>               <char>    <char> <int>       <num> <int> <int>     <num>      <num>       <num>     <num> <char>            <lgcl>
#> 1: duration.in.month  [-Inf,8)    68  0.09066667    60     8 0.1176471 -1.1676052 0.091925740 0.2587426      8             FALSE
#> 2: duration.in.month    [8,14)   205  0.27333333   151    54 0.2634146 -0.1809979 0.008618949 0.2587426     14             FALSE
#> 3: duration.in.month   [14,16)    53  0.07066667    45     8 0.1509434 -0.8799231 0.044135825 0.2587426     16             FALSE
#> 4: duration.in.month   [16,34)   291  0.38800000   197    94 0.3230241  0.1073889 0.004568290 0.2587426     34             FALSE
#> 5: duration.in.month   [34,44)    76  0.10133333    46    30 0.3947368  0.4198538 0.019193319 0.2587426     44             FALSE
#> 6: duration.in.month [44, Inf)    57  0.07600000    26    31 0.5438596  1.0231885 0.090300448 0.2587426    Inf             FALSE

bins$purpose
#>    variable                                                              bin count count_distr   neg   pos   posprob        woe     bin_iv  total_iv                                                           breaks is_special_values
#>      <char>                                                           <char> <int>       <num> <int> <int>     <num>      <num>      <num>     <num>                                                           <char>            <lgcl>
#> 1:  purpose                                          retraining%,%car (used)    83  0.11066667    70    13 0.1566265 -0.8362480 0.06318318 0.1960758                                          retraining%,%car (used)             FALSE
#> 2:  purpose                                       radio/television%,%repairs   220  0.29333333   172    48 0.2181818 -0.4289956 0.04902807 0.1960758                                       radio/television%,%repairs             FALSE
#> 3:  purpose furniture/equipment%,%business%,%domestic appliances%,%car (new)   395  0.52666667   257   138 0.3493671  0.2254755 0.02791601 0.1960758 furniture/equipment%,%business%,%domestic appliances%,%car (new)             FALSE
#> 4:  purpose                                               education%,%others    52  0.06933333    26    26 0.5000000  0.8472979 0.05594856 0.1960758                                               education%,%others             FALSE

germancredit_test_woe <- woebin_ply(germancredit_test, bins=bins)
#> ℹ Converting into woe values ...
#> ✔ Woe transformating on 250 rows and 20 columns in 00:00:00

head(germancredit_test_woe)
#>    creditability status.of.existing.checking.account_woe duration.in.month_woe credit.history_woe purpose_woe credit.amount_woe savings.account.and.bonds_woe present.employment.since_woe
#>           <fctr>                                   <num>                 <num>              <num>       <num>             <num>                         <num>                        <num>
#> 1:          good                               0.7901394           -0.83910109        -0.73005174  -0.5518446        0.01369884                    -0.7833423                  -0.34989526
#> 2:           bad                               0.7901394            0.06578153        -0.05715841   0.3677248        0.31508105                     0.2344150                   0.06559728
#> 3:          good                               0.2814901            0.80349524         0.10090617  -0.5518446        0.82320031                     0.2344150                   0.06559728
#> 4:          good                              -1.2599785           -0.30766736         0.10090617  -0.5518446       -0.33683660                    -0.7833423                  -0.34989526
#> 5:           bad                               0.2814901            0.06578153        -0.73005174   0.3677248        0.31508105                     0.2344150                   0.21868920
#> 6:           bad                               0.7901394            0.06578153        -0.73005174   0.3677248        0.01369884                     0.2344150                  -0.34989526
#>    installment.rate.in.percentage.of.disposable.income_woe personal.status.and.sex_woe other.debtors.or.guarantors_woe present.residence.since_woe property_woe age.in.years_woe other.installment.plans_woe
#>                                                      <num>                       <num>                           <num>                       <num>        <num>            <num>                       <num>
#> 1:                                             0.095061763                 -0.09790421                       0.0287165                 -0.01712104  -0.56976816       -0.1941560                  -0.1688382
#> 2:                                            -0.004073325                 -0.09790421                       0.0287165                 -0.01712104   0.49062292       -0.1941560                  -0.1688382
#> 3:                                            -0.077291674                 -0.09790421                       0.0287165                  0.14090545   0.09425254       -0.9650809                  -0.1688382
#> 4:                                            -0.077291674                 -0.09790421                       0.0287165                 -0.01712104  -0.56976816       -0.1941560                  -0.1688382
#> 5:                                             0.095061763                 -0.09790421                       0.0287165                  0.14090545   0.09425254       -0.1044233                  -0.1688382
#> 6:                                             0.095061763                 -0.09790421                       0.0287165                 -0.01712104   0.09425254       -0.1941560                  -0.1688382
#>    housing_woe number.of.existing.credits.at.this.bank_woe     job_woe number.of.people.being.liable.to.provide.maintenance.for_woe telephone_woe foreign.worker_woe
#>          <num>                                       <num>       <num>                                                        <num>         <num>              <num>
#> 1:  -0.2121896                                  -0.1009105 -0.02034658                                                   0.01369884   -0.14732471                  0
#> 2:   0.4616354                                  -0.1009105 -0.02034658                                                  -0.06899287    0.09352606                  0
#> 3:   0.4944765                                   0.0534367  0.09858083                                                   0.01369884   -0.14732471                  0
#> 4:  -0.2121896                                   0.0534367 -0.00836825                                                   0.01369884    0.09352606                  0
#> 5:  -0.2121896                                  -0.1009105  0.09858083                                                   0.01369884    0.09352606                  0
#> 6:  -0.2121896                                  -0.1009105 -0.00836825                                                   0.01369884    0.09352606                  0

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurea feature request or enhancement

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions