The unseen levels are calculated based on the mean of the coeficients rather than the mean of global data. This should be fixed to better reflect the literature.
Make sure that the documentation is changed accordingly.
This change will be easily backward compatible as it changes how new values will change only.
data <- data.frame(
outcome = rnorm(1000) + c(rep(10, 900), rep(0, 100)),
predictor = c(rep("Big", 900), rep(letters[1:10], each = 10))
)
library(tidyverse)
data |>
count(predictor)
#> predictor n
#> 1 Big 900
#> 2 a 10
#> 3 b 10
#> 4 c 10
#> 5 d 10
#> 6 e 10
#> 7 f 10
#> 8 g 10
#> 9 h 10
#> 10 i 10
#> 11 j 10
data |>
summarize(
mean = mean(outcome),
.by = predictor
)
#> predictor mean
#> 1 Big 9.92621834
#> 2 a -0.12884918
#> 3 b 0.24802560
#> 4 c 0.12339453
#> 5 d 0.33307724
#> 6 e 0.08705590
#> 7 f 0.86433875
#> 8 g 0.42452332
#> 9 h 0.42548890
#> 10 i -0.07257279
#> 11 j -0.67403943
embed:::glm_coefs(y = select(data, outcome), x = pull(data, predictor))
#> # A tibble: 12 × 2
#> ..level ..value
#> <chr> <dbl>
#> 1 a -0.129
#> 2 b 0.248
#> 3 Big 9.93
#> 4 c 0.123
#> 5 d 0.333
#> 6 e 0.0871
#> 7 f 0.864
#> 8 g 0.425
#> 9 h 0.425
#> 10 i -0.0726
#> 11 j -0.674
#> 12 ..new 0.256
mean(data$outcome, trim = 0.1)
#> [1] 9.717217
The unseen levels are calculated based on the mean of the coeficients rather than the mean of global data. This should be fixed to better reflect the literature.
Make sure that the documentation is changed accordingly.
This change will be easily backward compatible as it changes how new values will change only.