Enhance performance evaluations #1034

ablaom · 2025-12-02T04:28:16Z

This PR provides a few enhancements for the results of evaluate or evaluate!, which estimate
various kinds of out-of-sample performance of MLJ models (resp., machines). These should make evaluate more convenient when applying it to batches of models, to be compared:

The estimate of standard errors, which is currently only calculated for display of a
returned object, is now calculated when the object is constructed, and is a new,
user-accessible property, uncertainty_radius_95.
Users can now "tag" their estimates, by doing evaluate("some tag" => model, ...)
instead of evaluate(model, ...) and the returned object has a new user-accessible
property tag for storing this. Tags are auto-generated using the model name when not supplied,
but for deeply wrapped models, this is often inadequate, hence the addition of user
tags. The tag is shown when the object is displayed.
Users can now evaluate a vector of models, or tagged models, as in the following
example, where we can see the user-supplied tags in the output:

evaluate(["const" => ConstantClassifier(), "knn" => KNNClassifier()], X , y)
# 2-element Vector{...}
#  PerformanceEvaluation("const", 0.698 ± 0.0062)
#  PerformanceEvaluation("knn", 2.22e-16 ± 0.0)

Similar changes apply to the evaluate!(::Machine, ...) form.

In the future we might add a summarize(evaluations) to convert the kind of information displayed here in a table.

I found a few corner-case bugs in the display of performance evaluation objects, which I
have fixed. I have also added a lot more testing of the display, and added examples to the
docstrings for evaluate and evaluate!.

This PR closes #1031.

ablaom · 2025-12-02T05:21:56Z

cc @LucasMatSP @mohdibntarek

LucasMatSP · 2025-12-02T12:50:19Z

Nice feature!

ablaom · 2025-12-02T18:28:45Z

In the vector case, perhaps we should just parallelize with multiple threads by default. It's just serial for now. @OkonSamuel What do you think?

OkonSamuel · 2025-12-02T18:54:13Z

src/resampling.jl

+
+# multiple model evaluations:
+evaluate(
+    models_or_pairs::AbstractVector{<:Union{Machine,Pair{String,<:Model}}}, args...;


Is there any reason we are allowing Machines to be passed here. I say this because of the type <:Union{Machine,Pair{String,<:Model}} I thought this would have been <:Union{Model,Pair{String,<:Model}}

Good catch. This is a mistake. I'll fix and add a test.

OkonSamuel · 2025-12-02T19:06:33Z

In the vector case, perhaps we should just parallelize with multiple threads by default. It's just serial for now. @OkonSamuel What do you think?

If we did this we will encounter some issues. For example we might have a data race. I don't think anything prevents someone from doing this

mach1 = machine(ConstantClassifier(), X, y)
evaluate!(["const1" => mach1, "const2" => mach1])

But if we ran this on the same thread, we will be modifying the same machine object from different threads which could lead to race issue.

Although this shouldn't be an issue if we just use the regular evaluate method and if the models run of different datasets, this is perfect. The only problem here is if the model runs on the same dataset, then we will have different copies of the same dataset (No?).

ablaom · 2025-12-08T19:49:05Z

What if we just throw an error if there are duplicate machines? Would it suffice to test with === ?

Although this shouldn't be an issue if we just use the regular evaluate method and if the models run of different datasets, this is perfect. The only problem here is if the model runs on the same dataset, then we will have different copies of the same dataset (No?).

Binding a machine to data does not duplicate the data, as far as I can tell. However, there is at least one MLJ operation that I can think of that mutates data, such as coerce!(::DataFrame, ...), and there's nothing stopping a user from putting her own data-mutating operations in a pipeline model. There may be others in MLJTransforms.jl that might be mutating for certain table types (OneHotEncoding, for example??); I suspect not, but we should look into this and remedy if necessary. As a general principle, we should probably disallow mutation of the data consumed by fit, predict or transform. If it's needed internally, a copy will have to made. I mean, we already have parallelism in MLJTuning that would lead to race conditions for these kind of data-mutating operations. Mmm...

ablaom · 2025-12-14T19:14:35Z

@OkonSamuel I think we just leave out the parallelism for now.

Are you happy for me to merge?

OkonSamuel · 2025-12-15T08:08:09Z

Yes @ablaom we should leave out parallelism for now. Asides that I'm happy for this to be merged.

ablaom · 2025-12-16T20:55:35Z

@OkonSamuel Thank you for your review and helpful feedback.

ablaom added 3 commits December 2, 2025 08:45

add uncertainty_radius_95 to PerformanceEvaluation structs;

2a06338

add the tag field to PerformanceEvaluation objects

e41417c

add multi-model support for evaluate/evaluate!

dcc5889

OkonSamuel self-assigned this Dec 2, 2025

OkonSamuel reviewed Dec 2, 2025

View reviewed changes

ablaom added 4 commits December 9, 2025 09:03

fix mistake found in review and add test to catch

0aa2f78

Merge branch 'dev' into standard-errors

084a8a8

Merge branch 'dev' into standard-errors

8c63f68

fix typo to close #1025

aeb7d29

ablaom merged commit 005266e into dev Dec 16, 2025
3 checks passed

ablaom deleted the standard-errors branch December 16, 2025 20:55

This was referenced Dec 16, 2025

For a 1.12 release #1038

Merged

Issue to trigger releases #345

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhance performance evaluations #1034

Enhance performance evaluations #1034

ablaom commented Dec 2, 2025 •

edited

Loading

Uh oh!

ablaom commented Dec 2, 2025

Uh oh!

LucasMatSP commented Dec 2, 2025

Uh oh!

ablaom commented Dec 2, 2025

Uh oh!

OkonSamuel Dec 2, 2025

Uh oh!

ablaom Dec 8, 2025

Uh oh!

OkonSamuel commented Dec 2, 2025

Uh oh!

ablaom commented Dec 8, 2025

Uh oh!

ablaom commented Dec 14, 2025

Uh oh!

OkonSamuel commented Dec 15, 2025

Uh oh!

ablaom commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Enhance performance evaluations #1034

Enhance performance evaluations #1034

Conversation

ablaom commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ablaom commented Dec 2, 2025

Uh oh!

LucasMatSP commented Dec 2, 2025

Uh oh!

ablaom commented Dec 2, 2025

Uh oh!

OkonSamuel Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

ablaom Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

OkonSamuel commented Dec 2, 2025

Uh oh!

ablaom commented Dec 8, 2025

Uh oh!

ablaom commented Dec 14, 2025

Uh oh!

OkonSamuel commented Dec 15, 2025

Uh oh!

ablaom commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ablaom commented Dec 2, 2025 •

edited

Loading