-
Notifications
You must be signed in to change notification settings - Fork 89
Description
Feature Request
Support for sample_weights: Vec<f64> in RandomForestRegressor::fit() (and ideally RandomForestClassifier and the underlying DecisionTree models as well).
Use Case
I'm training a RandomForest on time-series data where recent observations should be weighted more heavily than older ones (exponential decay: weight = 0.9^months_ago). This is a common pattern in scikit-learn:
model.fit(X, y, sample_weight=weights)Without sample weights, there's no way to express "this training example matters more than that one" — which is important for recency weighting, class imbalance correction, and importance sampling.
Current State
Looking at the source code, the internal plumbing is close to supporting this:
BaseForestRegressor::sample_with_replacement()does uniform bootstrap sampling — this could be extended to weighted samplingBaseTreeRegressor::fit_weak_learner()already acceptssamples: Vec<usize>(bootstrap counts) and uses them as integer multipliers in split statistics:sum += *sample_i as f64 * y_m.get(i).to_f64().unwrap();
- Generalizing
samplesfromVec<usize>(integer counts) toVec<f64>(continuous weights) in the tree splitter would enable this
Proposed API
Option A — Add to parameters struct:
RandomForestRegressorParameters {
// ... existing fields ...
sample_weights: Option<Vec<f64>>,
}Option B — Extend the fit signature (breaking change):
pub fn fit(x: &X, y: &Y, parameters: P, sample_weights: Option<&[f64]>) -> Result<Self, Failed>Option A is backwards-compatible and probably preferable.
Scope
Two pieces:
- Weighted bootstrap sampling in
BaseForestRegressor— sample with probability proportional to weights instead of uniformly - Weighted split statistics in
BaseTreeRegressor— use float weights instead of integer counts when computing mean/variance for split criteria
scikit-learn Reference
For reference, scikit-learn's implementation:
- Passes
sample_weightthrough to each tree'sfit() - Uses weights in bootstrap sampling (weighted random draw with replacement)
- Uses weights in impurity calculations (weighted mean, weighted variance)
- Docs: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.fit
This is one of the most commonly used features in scikit-learn's RandomForest and would make smartcore a much more viable alternative for real-world ML pipelines.
Thank you for maintaining this crate — the WASM-first posture is exactly what drew me to it!