Skip to content

Feature request: sample weights support in RandomForest (and other tree-based models) #356

@nicksrandall

Description

@nicksrandall

Feature Request

Support for sample_weights: Vec<f64> in RandomForestRegressor::fit() (and ideally RandomForestClassifier and the underlying DecisionTree models as well).

Use Case

I'm training a RandomForest on time-series data where recent observations should be weighted more heavily than older ones (exponential decay: weight = 0.9^months_ago). This is a common pattern in scikit-learn:

model.fit(X, y, sample_weight=weights)

Without sample weights, there's no way to express "this training example matters more than that one" — which is important for recency weighting, class imbalance correction, and importance sampling.

Current State

Looking at the source code, the internal plumbing is close to supporting this:

  • BaseForestRegressor::sample_with_replacement() does uniform bootstrap sampling — this could be extended to weighted sampling
  • BaseTreeRegressor::fit_weak_learner() already accepts samples: Vec<usize> (bootstrap counts) and uses them as integer multipliers in split statistics:
    sum += *sample_i as f64 * y_m.get(i).to_f64().unwrap();
  • Generalizing samples from Vec<usize> (integer counts) to Vec<f64> (continuous weights) in the tree splitter would enable this

Proposed API

Option A — Add to parameters struct:

RandomForestRegressorParameters {
    // ... existing fields ...
    sample_weights: Option<Vec<f64>>,
}

Option B — Extend the fit signature (breaking change):

pub fn fit(x: &X, y: &Y, parameters: P, sample_weights: Option<&[f64]>) -> Result<Self, Failed>

Option A is backwards-compatible and probably preferable.

Scope

Two pieces:

  1. Weighted bootstrap sampling in BaseForestRegressor — sample with probability proportional to weights instead of uniformly
  2. Weighted split statistics in BaseTreeRegressor — use float weights instead of integer counts when computing mean/variance for split criteria

scikit-learn Reference

For reference, scikit-learn's implementation:

This is one of the most commonly used features in scikit-learn's RandomForest and would make smartcore a much more viable alternative for real-world ML pipelines.

Thank you for maintaining this crate — the WASM-first posture is exactly what drew me to it!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions