test(workflow-operator): add unit test coverage for Sklearn Naive Bayes descriptors#5925
test(workflow-operator): add unit test coverage for Sklearn Naive Bayes descriptors#5925aglinxinyuan wants to merge 1 commit into
Conversation
…es descriptors (Bernoulli, Complement, Gaussian, Multinomial)
Automated Reviewer SuggestionsBased on the
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5925 +/- ##
=========================================
Coverage 54.60% 54.60%
+ Complexity 2927 2925 -2
=========================================
Files 1109 1109
Lines 42828 42828
Branches 4608 4608
=========================================
Hits 23385 23385
+ Misses 18081 18080 -1
- Partials 1362 1363 +1
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds Scala unit tests in common/workflow-operator to pin the current behavior/contract of the four Sklearn Naive Bayes operator descriptors (Bernoulli, Complement, Gaussian, Multinomial) without changing production code.
Changes:
- Introduces new
AnyFlatSpectest suites validatingoperatorInfo(name/description/group + port shape) and default config values. - Verifies
getOutputSchemasemits the expectedmodel_name(STRING) andmodel(BINARY) schema on the declared output port. - Pins
generatePythonCodeoutput (sklearn estimator import + pipeline usage) and JSON polymorphic round-trip viaLogicalOpusing theoperatorTypediscriminator.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/SklearnBernoulliNaiveBayesOpDescSpec.scala | Adds contract/unit tests for BernoulliNB descriptor behavior and JSON round-trip. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/SklearnComplementNaiveBayesOpDescSpec.scala | Adds contract/unit tests for ComplementNB descriptor behavior and JSON round-trip. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/SklearnGaussianNaiveBayesOpDescSpec.scala | Adds contract/unit tests for GaussianNB descriptor behavior and JSON round-trip. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/SklearnMultinomialNaiveBayesOpDescSpec.scala | Adds contract/unit tests for MultinomialNB descriptor behavior and JSON round-trip. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 389 | 0.238 | 25,780/33,058/33,058 us | 🔴 +10.0% / 🔴 +8.4% |
| 🟢 | bs=100 sw=10 sl=64 | 833 | 0.508 | 116,571/139,934/139,934 us | 🟢 -11.4% / 🔴 -6.7% |
| ⚪ | bs=1000 sw=10 sl=64 | 935 | 0.57 | 1,069,345/1,118,882/1,118,882 us | ⚪ within ±5% / 🔴 -10.3% |
Baseline details
Latest main 1c580e5 from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 389 tuples/sec | 395 tuples/sec | 410.82 tuples/sec | -1.5% | -5.3% |
| bs=10 sw=10 sl=64 | MB/s | 0.238 MB/s | 0.241 MB/s | 0.251 MB/s | -1.2% | -5.1% |
| bs=10 sw=10 sl=64 | p50 | 25,780 us | 23,427 us | 23,785 us | +10.0% | +8.4% |
| bs=10 sw=10 sl=64 | p95 | 33,058 us | 36,524 us | 34,980 us | -9.5% | -5.5% |
| bs=10 sw=10 sl=64 | p99 | 33,058 us | 36,524 us | 34,980 us | -9.5% | -5.5% |
| bs=100 sw=10 sl=64 | throughput | 833 tuples/sec | 819 tuples/sec | 891.94 tuples/sec | +1.7% | -6.6% |
| bs=100 sw=10 sl=64 | MB/s | 0.508 MB/s | 0.5 MB/s | 0.544 MB/s | +1.6% | -6.7% |
| bs=100 sw=10 sl=64 | p50 | 116,571 us | 119,613 us | 112,277 us | -2.5% | +3.8% |
| bs=100 sw=10 sl=64 | p95 | 139,934 us | 158,005 us | 139,802 us | -11.4% | +0.1% |
| bs=100 sw=10 sl=64 | p99 | 139,934 us | 158,005 us | 139,802 us | -11.4% | +0.1% |
| bs=1000 sw=10 sl=64 | throughput | 935 tuples/sec | 920 tuples/sec | 1,041 tuples/sec | +1.6% | -10.2% |
| bs=1000 sw=10 sl=64 | MB/s | 0.57 MB/s | 0.561 MB/s | 0.635 MB/s | +1.6% | -10.3% |
| bs=1000 sw=10 sl=64 | p50 | 1,069,345 us | 1,084,616 us | 972,714 us | -1.4% | +9.9% |
| bs=1000 sw=10 sl=64 | p95 | 1,118,882 us | 1,133,084 us | 1,023,057 us | -1.3% | +9.4% |
| bs=1000 sw=10 sl=64 | p99 | 1,118,882 us | 1,133,084 us | 1,023,057 us | -1.3% | +9.4% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,513.76,200,128000,389,0.238,25779.92,33057.68,33057.68
1,100,10,64,20,2400.65,2000,1280000,833,0.508,116571.14,139933.82,139933.82
2,1000,10,64,20,21400.53,20000,12800000,935,0.570,1069345.18,1118882.16,1118882.16…assifier descriptors (apache#5941) ### What changes were proposed in this PR? Pin behavior of four previously-untested Sklearn linear classifier descriptors in `common/workflow-operator`. No production-code changes. | Spec | Source class | Tests | | --- | --- | --- | | `SklearnLogisticRegressionOpDescSpec` | `SklearnLogisticRegressionOpDesc` | 5 | | `SklearnLogisticRegressionCVOpDescSpec` | `SklearnLogisticRegressionCVOpDesc` | 5 | | `SklearnPerceptronOpDescSpec` | `SklearnPerceptronOpDesc` | 5 | | `SklearnPassiveAggressiveOpDescSpec` | `SklearnPassiveAggressiveOpDesc` | 5 | **Behavior pinned** | Surface | Contract | | --- | --- | | `operatorInfo` | exact model name + `Sklearn <name> Operator` description; Sklearn group; training/testing input ports + one blocking output | | field defaults | `countVectorizer`/`tfidfTransformer` `false`; `target`/`text` `null` | | `getOutputSchemas` | `model_name` (STRING) + `model` (BINARY) keyed by the declared output port | | `generatePythonCode` | imports the matching sklearn estimator and builds the `make_pipeline` model | | Round-trip | config fields preserved through the polymorphic `LogicalOp` base, with the correct `operatorType` discriminator | ### Any related issues, documentation, discussions? Part of the ongoing `workflow-operator` unit-test coverage effort (follow-up to the Sklearn Naive Bayes coverage in apache#5925). ### How was this PR tested? - `sbt "WorkflowOperator/testOnly *SklearnLogisticRegressionOpDescSpec *SklearnLogisticRegressionCVOpDescSpec *SklearnPerceptronOpDescSpec *SklearnPassiveAggressiveOpDescSpec"` — 20 tests, all green - `sbt "WorkflowOperator/Test/scalafmtCheck"` and `sbt "WorkflowOperator/scalafixAll --check"` — clean - CI to confirm ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8 [1M context])
…d classifier descriptors (apache#5939) ### What changes were proposed in this PR? Pin behavior of four previously-untested Sklearn tree-based classifier descriptors in `common/workflow-operator`. No production-code changes. | Spec | Source class | Tests | | --- | --- | --- | | `SklearnDecisionTreeOpDescSpec` | `SklearnDecisionTreeOpDesc` | 5 | | `SklearnExtraTreeOpDescSpec` | `SklearnExtraTreeOpDesc` | 5 | | `SklearnExtraTreesOpDescSpec` | `SklearnExtraTreesOpDesc` | 5 | | `SklearnRandomForestOpDescSpec` | `SklearnRandomForestOpDesc` | 5 | **Behavior pinned** | Surface | Contract | | --- | --- | | `operatorInfo` | exact model name + `Sklearn <name> Operator` description; Sklearn group; training/testing input ports + one blocking output | | field defaults | `countVectorizer`/`tfidfTransformer` `false`; `target`/`text` `null` | | `getOutputSchemas` | `model_name` (STRING) + `model` (BINARY) keyed by the declared output port | | `generatePythonCode` | imports the matching sklearn estimator and builds the `make_pipeline` model | | Round-trip | config fields preserved through the polymorphic `LogicalOp` base, with the correct `operatorType` discriminator | ### Any related issues, documentation, discussions? Part of the ongoing `workflow-operator` unit-test coverage effort (follow-up to the Sklearn Naive Bayes coverage in apache#5925). ### How was this PR tested? - `sbt "WorkflowOperator/testOnly *SklearnDecisionTreeOpDescSpec *SklearnExtraTreeOpDescSpec *SklearnExtraTreesOpDescSpec *SklearnRandomForestOpDescSpec"` — 20 tests, all green - `sbt "WorkflowOperator/Test/scalafmtCheck"` and `sbt "WorkflowOperator/scalafixAll --check"` — clean - CI to confirm ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8 [1M context])
…eighbor classifier descriptors (apache#5945) ### What changes were proposed in this PR? Pin behavior of four previously-untested Sklearn support-vector and neighbor classifier descriptors in `common/workflow-operator`. No production-code changes. | Spec | Source class | Tests | | --- | --- | --- | | `SklearnSVMOpDescSpec` | `SklearnSVMOpDesc` | 5 | | `SklearnLinearSVMOpDescSpec` | `SklearnLinearSVMOpDesc` | 5 | | `SklearnKNNOpDescSpec` | `SklearnKNNOpDesc` | 5 | | `SklearnNearestCentroidOpDescSpec` | `SklearnNearestCentroidOpDesc` | 5 | **Behavior pinned** | Surface | Contract | | --- | --- | | `operatorInfo` | exact model name + `Sklearn <name> Operator` description; Sklearn group; training/testing input ports + one blocking output | | field defaults | `countVectorizer`/`tfidfTransformer` `false`; `target`/`text` `null` | | `getOutputSchemas` | `model_name` (STRING) + `model` (BINARY) keyed by the declared output port | | `generatePythonCode` | imports the matching sklearn estimator and builds the `make_pipeline` model | | Round-trip | config fields preserved through the polymorphic `LogicalOp` base, with the correct `operatorType` discriminator | ### Any related issues, documentation, discussions? Part of the ongoing `workflow-operator` unit-test coverage effort (follow-up to the Sklearn classifier coverage in apache#5925, apache#5939, apache#5940, apache#5941). ### How was this PR tested? - `sbt "WorkflowOperator/testOnly *SklearnSVMOpDescSpec *SklearnLinearSVMOpDescSpec *SklearnKNNOpDescSpec *SklearnNearestCentroidOpDescSpec"` — 20 tests, all green - `sbt "WorkflowOperator/Test/scalafmtCheck"` and `sbt "WorkflowOperator/scalafixAll --check"` — clean - CI to confirm ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8 [1M context])
…classifier descriptors (apache#5940) ### What changes were proposed in this PR? Pin behavior of three previously-untested Sklearn ensemble classifier descriptors in `common/workflow-operator`. No production-code changes. | Spec | Source class | Tests | | --- | --- | --- | | `SklearnAdaptiveBoostingOpDescSpec` | `SklearnAdaptiveBoostingOpDesc` | 5 | | `SklearnBaggingOpDescSpec` | `SklearnBaggingOpDesc` | 5 | | `SklearnGradientBoostingOpDescSpec` | `SklearnGradientBoostingOpDesc` | 5 | **Behavior pinned** | Surface | Contract | | --- | --- | | `operatorInfo` | exact model name + `Sklearn <name> Operator` description; Sklearn group; training/testing input ports + one blocking output | | field defaults | `countVectorizer`/`tfidfTransformer` `false`; `target`/`text` `null` | | `getOutputSchemas` | `model_name` (STRING) + `model` (BINARY) keyed by the declared output port | | `generatePythonCode` | imports the matching sklearn estimator and builds the `make_pipeline` model | | Round-trip | config fields preserved through the polymorphic `LogicalOp` base, with the correct `operatorType` discriminator | ### Any related issues, documentation, discussions? Part of the ongoing `workflow-operator` unit-test coverage effort (follow-up to the Sklearn Naive Bayes coverage in apache#5925). ### How was this PR tested? - `sbt "WorkflowOperator/testOnly *SklearnAdaptiveBoostingOpDescSpec *SklearnBaggingOpDescSpec *SklearnGradientBoostingOpDescSpec"` — 15 tests, all green - `sbt "WorkflowOperator/Test/scalafmtCheck"` and `sbt "WorkflowOperator/scalafixAll --check"` — clean - CI to confirm ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8 [1M context])
…/dummy classifier descriptors (apache#5946) ### What changes were proposed in this PR? Pin behavior of four previously-untested Sklearn classifier descriptors (ridge/SGD/dummy) in `common/workflow-operator`. No production-code changes. | Spec | Source class | Tests | | --- | --- | --- | | `SklearnRidgeOpDescSpec` | `SklearnRidgeOpDesc` | 5 | | `SklearnRidgeCVOpDescSpec` | `SklearnRidgeCVOpDesc` | 5 | | `SklearnSDGOpDescSpec` | `SklearnSDGOpDesc` | 5 | | `SklearnDummyClassifierOpDescSpec` | `SklearnDummyClassifierOpDesc` | 5 | **Behavior pinned** | Surface | Contract | | --- | --- | | `operatorInfo` | exact model name + `Sklearn <name> Operator` description; Sklearn group; training/testing input ports + one blocking output | | field defaults | `countVectorizer`/`tfidfTransformer` `false`; `target`/`text` `null` | | `getOutputSchemas` | `model_name` (STRING) + `model` (BINARY) keyed by the declared output port | | `generatePythonCode` | imports the matching sklearn estimator (`RidgeClassifier`/`RidgeClassifierCV`/`SGDClassifier`/`DummyClassifier`) and builds the `make_pipeline` model | | Round-trip | config fields preserved through the polymorphic `LogicalOp` base, with the correct `operatorType` discriminator | ### Any related issues, documentation, discussions? Part of the ongoing `workflow-operator` unit-test coverage effort (follow-up to the Sklearn classifier coverage in apache#5925, apache#5939, apache#5940, apache#5941). ### How was this PR tested? - `sbt "WorkflowOperator/testOnly *SklearnRidgeOpDescSpec *SklearnRidgeCVOpDescSpec *SklearnSDGOpDescSpec *SklearnDummyClassifierOpDescSpec"` — 20 tests, all green - `sbt "WorkflowOperator/Test/scalafmtCheck"` and `sbt "WorkflowOperator/scalafixAll --check"` — clean - CI to confirm ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8 [1M context])
What changes were proposed in this PR?
Pin behavior of the four previously-untested Sklearn Naive Bayes classifier descriptors in
common/workflow-operator. No production-code changes.SklearnBernoulliNaiveBayesOpDescSpecSklearnBernoulliNaiveBayesOpDescSklearnComplementNaiveBayesOpDescSpecSklearnComplementNaiveBayesOpDescSklearnGaussianNaiveBayesOpDescSpecSklearnGaussianNaiveBayesOpDescSklearnMultinomialNaiveBayesOpDescSpecSklearnMultinomialNaiveBayesOpDescBehavior pinned
operatorInfoSklearn <name> Operatordescription; Sklearn group; training/testing input ports + one blocking outputcountVectorizer/tfidfTransformerfalse;target/textnullgetOutputSchemasmodel_name(STRING) +model(BINARY) keyed by the declared output portgeneratePythonCodeBernoulliNB) viamake_pipelineLogicalOpbase, with the correctoperatorTypediscriminatorAny related issues, documentation, discussions?
Part of the ongoing
workflow-operatorunit-test coverage effort.How was this PR tested?
sbt "WorkflowOperator/testOnly *SklearnBernoulliNaiveBayesOpDescSpec *SklearnComplementNaiveBayesOpDescSpec *SklearnGaussianNaiveBayesOpDescSpec *SklearnMultinomialNaiveBayesOpDescSpec"— 20 tests, all greensbt "WorkflowOperator/Test/scalafmtCheck"andsbt "WorkflowOperator/scalafixAll --check"— cleanWas this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.8 [1M context])