test(workflow-operator): add unit test coverage for Sklearn Naive Bayes descriptors by aglinxinyuan · Pull Request #5925 · apache/texera

aglinxinyuan · 2026-06-24T04:07:45Z

What changes were proposed in this PR?

Pin behavior of the four previously-untested Sklearn Naive Bayes classifier descriptors in common/workflow-operator. No production-code changes.

Spec	Source class	Tests
`SklearnBernoulliNaiveBayesOpDescSpec`	`SklearnBernoulliNaiveBayesOpDesc`	5
`SklearnComplementNaiveBayesOpDescSpec`	`SklearnComplementNaiveBayesOpDesc`	5
`SklearnGaussianNaiveBayesOpDescSpec`	`SklearnGaussianNaiveBayesOpDesc`	5
`SklearnMultinomialNaiveBayesOpDescSpec`	`SklearnMultinomialNaiveBayesOpDesc`	5

Behavior pinned

Surface	Contract
`operatorInfo`	exact model name + `Sklearn <name> Operator` description; Sklearn group; training/testing input ports + one blocking output
field defaults	`countVectorizer`/`tfidfTransformer` `false`; `target`/`text` `null`
`getOutputSchemas`	`model_name` (STRING) + `model` (BINARY) keyed by the declared output port
`generatePythonCode`	imports and instantiates the matching sklearn estimator (e.g. `BernoulliNB`) via `make_pipeline`
Round-trip	config fields preserved through the polymorphic `LogicalOp` base, with the correct `operatorType` discriminator

Any related issues, documentation, discussions?

Part of the ongoing workflow-operator unit-test coverage effort.

How was this PR tested?

sbt "WorkflowOperator/testOnly *SklearnBernoulliNaiveBayesOpDescSpec *SklearnComplementNaiveBayesOpDescSpec *SklearnGaussianNaiveBayesOpDescSpec *SklearnMultinomialNaiveBayesOpDescSpec" — 20 tests, all green
sbt "WorkflowOperator/Test/scalafmtCheck" and sbt "WorkflowOperator/scalafixAll --check" — clean
CI to confirm

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8 [1M context])

…es descriptors (Bernoulli, Complement, Gaussian, Multinomial)

github-actions · 2026-06-24T04:07:59Z

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

No candidates found from git blame history.

codecov-commenter · 2026-06-24T04:10:33Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 54.60%. Comparing base (1c580e5) to head (7a72636).

Additional details and impacted files

@@            Coverage Diff            @@
##               main    #5925   +/-   ##
=========================================
  Coverage     54.60%   54.60%           
+ Complexity     2927     2925    -2     
=========================================
  Files          1109     1109           
  Lines         42828    42828           
  Branches       4608     4608           
=========================================
  Hits          23385    23385           
+ Misses        18081    18080    -1     
- Partials       1362     1363    +1

Flag	Coverage Δ	*Carryforward flag
access-control-service	`70.44% <ø> (ø)`
agent-service	`34.36% <ø> (ø)`	Carriedforward from 1c580e5
amber	`56.73% <ø> (ø)`
computing-unit-managing-service	`1.65% <ø> (ø)`
config-service	`57.35% <ø> (ø)`
file-service	`58.59% <ø> (ø)`
frontend	`48.31% <ø> (ø)`	Carriedforward from 1c580e5
pyamber	`90.20% <ø> (ø)`	Carriedforward from 1c580e5
python	`90.76% <ø> (ø)`	Carriedforward from 1c580e5
workflow-compiling-service	`58.69% <ø> (ø)`

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Adds Scala unit tests in common/workflow-operator to pin the current behavior/contract of the four Sklearn Naive Bayes operator descriptors (Bernoulli, Complement, Gaussian, Multinomial) without changing production code.

Changes:

Introduces new AnyFlatSpec test suites validating operatorInfo (name/description/group + port shape) and default config values.
Verifies getOutputSchemas emits the expected model_name (STRING) and model (BINARY) schema on the declared output port.
Pins generatePythonCode output (sklearn estimator import + pipeline usage) and JSON polymorphic round-trip via LogicalOp using the operatorType discriminator.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/SklearnBernoulliNaiveBayesOpDescSpec.scala	Adds contract/unit tests for BernoulliNB descriptor behavior and JSON round-trip.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/SklearnComplementNaiveBayesOpDescSpec.scala	Adds contract/unit tests for ComplementNB descriptor behavior and JSON round-trip.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/SklearnGaussianNaiveBayesOpDescSpec.scala	Adds contract/unit tests for GaussianNB descriptor behavior and JSON round-trip.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/SklearnMultinomialNaiveBayesOpDescSpec.scala	Adds contract/unit tests for MultinomialNB descriptor behavior and JSON round-trip.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-06-24T04:13:47Z

⚠️ Benchmark changes need a look

🟢 4 better · 🔴 1 worse · ⚪ 10 noise (<±5%) · 0 without baseline

Compared against main 1c580e5 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

	config	throughput	MB/s	latency	max Δ latest / 7d
🔴	bs=10 sw=10 sl=64	389	0.238	25,780/33,058/33,058 us	🔴 +10.0% / 🔴 +8.4%
🟢	bs=100 sw=10 sl=64	833	0.508	116,571/139,934/139,934 us	🟢 -11.4% / 🔴 -6.7%
⚪	bs=1000 sw=10 sl=64	935	0.57	1,069,345/1,118,882/1,118,882 us	⚪ within ±5% / 🔴 -10.3%

Baseline details

Latest main 1c580e5 from same runner

config	metric	PR	latest main	7d avg	Δ latest	Δ 7d
bs=10 sw=10 sl=64	throughput	389 tuples/sec	395 tuples/sec	410.82 tuples/sec	-1.5%	-5.3%
bs=10 sw=10 sl=64	MB/s	0.238 MB/s	0.241 MB/s	0.251 MB/s	-1.2%	-5.1%
bs=10 sw=10 sl=64	p50	25,780 us	23,427 us	23,785 us	+10.0%	+8.4%
bs=10 sw=10 sl=64	p95	33,058 us	36,524 us	34,980 us	-9.5%	-5.5%
bs=10 sw=10 sl=64	p99	33,058 us	36,524 us	34,980 us	-9.5%	-5.5%
bs=100 sw=10 sl=64	throughput	833 tuples/sec	819 tuples/sec	891.94 tuples/sec	+1.7%	-6.6%
bs=100 sw=10 sl=64	MB/s	0.508 MB/s	0.5 MB/s	0.544 MB/s	+1.6%	-6.7%
bs=100 sw=10 sl=64	p50	116,571 us	119,613 us	112,277 us	-2.5%	+3.8%
bs=100 sw=10 sl=64	p95	139,934 us	158,005 us	139,802 us	-11.4%	+0.1%
bs=100 sw=10 sl=64	p99	139,934 us	158,005 us	139,802 us	-11.4%	+0.1%
bs=1000 sw=10 sl=64	throughput	935 tuples/sec	920 tuples/sec	1,041 tuples/sec	+1.6%	-10.2%
bs=1000 sw=10 sl=64	MB/s	0.57 MB/s	0.561 MB/s	0.635 MB/s	+1.6%	-10.3%
bs=1000 sw=10 sl=64	p50	1,069,345 us	1,084,616 us	972,714 us	-1.4%	+9.9%
bs=1000 sw=10 sl=64	p95	1,118,882 us	1,133,084 us	1,023,057 us	-1.3%	+9.4%
bs=1000 sw=10 sl=64	p99	1,118,882 us	1,133,084 us	1,023,057 us	-1.3%	+9.4%

Raw CSV

config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,513.76,200,128000,389,0.238,25779.92,33057.68,33057.68
1,100,10,64,20,2400.65,2000,1280000,833,0.508,116571.14,139933.82,139933.82
2,1000,10,64,20,21400.53,20000,12800000,935,0.570,1069345.18,1118882.16,1118882.16

…assifier descriptors (apache#5941) ### What changes were proposed in this PR? Pin behavior of four previously-untested Sklearn linear classifier descriptors in `common/workflow-operator`. No production-code changes. | Spec | Source class | Tests | | --- | --- | --- | | `SklearnLogisticRegressionOpDescSpec` | `SklearnLogisticRegressionOpDesc` | 5 | | `SklearnLogisticRegressionCVOpDescSpec` | `SklearnLogisticRegressionCVOpDesc` | 5 | | `SklearnPerceptronOpDescSpec` | `SklearnPerceptronOpDesc` | 5 | | `SklearnPassiveAggressiveOpDescSpec` | `SklearnPassiveAggressiveOpDesc` | 5 | **Behavior pinned** | Surface | Contract | | --- | --- | | `operatorInfo` | exact model name + `Sklearn <name> Operator` description; Sklearn group; training/testing input ports + one blocking output | | field defaults | `countVectorizer`/`tfidfTransformer` `false`; `target`/`text` `null` | | `getOutputSchemas` | `model_name` (STRING) + `model` (BINARY) keyed by the declared output port | | `generatePythonCode` | imports the matching sklearn estimator and builds the `make_pipeline` model | | Round-trip | config fields preserved through the polymorphic `LogicalOp` base, with the correct `operatorType` discriminator | ### Any related issues, documentation, discussions? Part of the ongoing `workflow-operator` unit-test coverage effort (follow-up to the Sklearn Naive Bayes coverage in apache#5925). ### How was this PR tested? - `sbt "WorkflowOperator/testOnly *SklearnLogisticRegressionOpDescSpec *SklearnLogisticRegressionCVOpDescSpec *SklearnPerceptronOpDescSpec *SklearnPassiveAggressiveOpDescSpec"` — 20 tests, all green - `sbt "WorkflowOperator/Test/scalafmtCheck"` and `sbt "WorkflowOperator/scalafixAll --check"` — clean - CI to confirm ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8 [1M context])

…d classifier descriptors (apache#5939) ### What changes were proposed in this PR? Pin behavior of four previously-untested Sklearn tree-based classifier descriptors in `common/workflow-operator`. No production-code changes. | Spec | Source class | Tests | | --- | --- | --- | | `SklearnDecisionTreeOpDescSpec` | `SklearnDecisionTreeOpDesc` | 5 | | `SklearnExtraTreeOpDescSpec` | `SklearnExtraTreeOpDesc` | 5 | | `SklearnExtraTreesOpDescSpec` | `SklearnExtraTreesOpDesc` | 5 | | `SklearnRandomForestOpDescSpec` | `SklearnRandomForestOpDesc` | 5 | **Behavior pinned** | Surface | Contract | | --- | --- | | `operatorInfo` | exact model name + `Sklearn <name> Operator` description; Sklearn group; training/testing input ports + one blocking output | | field defaults | `countVectorizer`/`tfidfTransformer` `false`; `target`/`text` `null` | | `getOutputSchemas` | `model_name` (STRING) + `model` (BINARY) keyed by the declared output port | | `generatePythonCode` | imports the matching sklearn estimator and builds the `make_pipeline` model | | Round-trip | config fields preserved through the polymorphic `LogicalOp` base, with the correct `operatorType` discriminator | ### Any related issues, documentation, discussions? Part of the ongoing `workflow-operator` unit-test coverage effort (follow-up to the Sklearn Naive Bayes coverage in apache#5925). ### How was this PR tested? - `sbt "WorkflowOperator/testOnly *SklearnDecisionTreeOpDescSpec *SklearnExtraTreeOpDescSpec *SklearnExtraTreesOpDescSpec *SklearnRandomForestOpDescSpec"` — 20 tests, all green - `sbt "WorkflowOperator/Test/scalafmtCheck"` and `sbt "WorkflowOperator/scalafixAll --check"` — clean - CI to confirm ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8 [1M context])

…eighbor classifier descriptors (apache#5945) ### What changes were proposed in this PR? Pin behavior of four previously-untested Sklearn support-vector and neighbor classifier descriptors in `common/workflow-operator`. No production-code changes. | Spec | Source class | Tests | | --- | --- | --- | | `SklearnSVMOpDescSpec` | `SklearnSVMOpDesc` | 5 | | `SklearnLinearSVMOpDescSpec` | `SklearnLinearSVMOpDesc` | 5 | | `SklearnKNNOpDescSpec` | `SklearnKNNOpDesc` | 5 | | `SklearnNearestCentroidOpDescSpec` | `SklearnNearestCentroidOpDesc` | 5 | **Behavior pinned** | Surface | Contract | | --- | --- | | `operatorInfo` | exact model name + `Sklearn <name> Operator` description; Sklearn group; training/testing input ports + one blocking output | | field defaults | `countVectorizer`/`tfidfTransformer` `false`; `target`/`text` `null` | | `getOutputSchemas` | `model_name` (STRING) + `model` (BINARY) keyed by the declared output port | | `generatePythonCode` | imports the matching sklearn estimator and builds the `make_pipeline` model | | Round-trip | config fields preserved through the polymorphic `LogicalOp` base, with the correct `operatorType` discriminator | ### Any related issues, documentation, discussions? Part of the ongoing `workflow-operator` unit-test coverage effort (follow-up to the Sklearn classifier coverage in apache#5925, apache#5939, apache#5940, apache#5941). ### How was this PR tested? - `sbt "WorkflowOperator/testOnly *SklearnSVMOpDescSpec *SklearnLinearSVMOpDescSpec *SklearnKNNOpDescSpec *SklearnNearestCentroidOpDescSpec"` — 20 tests, all green - `sbt "WorkflowOperator/Test/scalafmtCheck"` and `sbt "WorkflowOperator/scalafixAll --check"` — clean - CI to confirm ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8 [1M context])

…classifier descriptors (apache#5940) ### What changes were proposed in this PR? Pin behavior of three previously-untested Sklearn ensemble classifier descriptors in `common/workflow-operator`. No production-code changes. | Spec | Source class | Tests | | --- | --- | --- | | `SklearnAdaptiveBoostingOpDescSpec` | `SklearnAdaptiveBoostingOpDesc` | 5 | | `SklearnBaggingOpDescSpec` | `SklearnBaggingOpDesc` | 5 | | `SklearnGradientBoostingOpDescSpec` | `SklearnGradientBoostingOpDesc` | 5 | **Behavior pinned** | Surface | Contract | | --- | --- | | `operatorInfo` | exact model name + `Sklearn <name> Operator` description; Sklearn group; training/testing input ports + one blocking output | | field defaults | `countVectorizer`/`tfidfTransformer` `false`; `target`/`text` `null` | | `getOutputSchemas` | `model_name` (STRING) + `model` (BINARY) keyed by the declared output port | | `generatePythonCode` | imports the matching sklearn estimator and builds the `make_pipeline` model | | Round-trip | config fields preserved through the polymorphic `LogicalOp` base, with the correct `operatorType` discriminator | ### Any related issues, documentation, discussions? Part of the ongoing `workflow-operator` unit-test coverage effort (follow-up to the Sklearn Naive Bayes coverage in apache#5925). ### How was this PR tested? - `sbt "WorkflowOperator/testOnly *SklearnAdaptiveBoostingOpDescSpec *SklearnBaggingOpDescSpec *SklearnGradientBoostingOpDescSpec"` — 15 tests, all green - `sbt "WorkflowOperator/Test/scalafmtCheck"` and `sbt "WorkflowOperator/scalafixAll --check"` — clean - CI to confirm ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8 [1M context])

…/dummy classifier descriptors (apache#5946) ### What changes were proposed in this PR? Pin behavior of four previously-untested Sklearn classifier descriptors (ridge/SGD/dummy) in `common/workflow-operator`. No production-code changes. | Spec | Source class | Tests | | --- | --- | --- | | `SklearnRidgeOpDescSpec` | `SklearnRidgeOpDesc` | 5 | | `SklearnRidgeCVOpDescSpec` | `SklearnRidgeCVOpDesc` | 5 | | `SklearnSDGOpDescSpec` | `SklearnSDGOpDesc` | 5 | | `SklearnDummyClassifierOpDescSpec` | `SklearnDummyClassifierOpDesc` | 5 | **Behavior pinned** | Surface | Contract | | --- | --- | | `operatorInfo` | exact model name + `Sklearn <name> Operator` description; Sklearn group; training/testing input ports + one blocking output | | field defaults | `countVectorizer`/`tfidfTransformer` `false`; `target`/`text` `null` | | `getOutputSchemas` | `model_name` (STRING) + `model` (BINARY) keyed by the declared output port | | `generatePythonCode` | imports the matching sklearn estimator (`RidgeClassifier`/`RidgeClassifierCV`/`SGDClassifier`/`DummyClassifier`) and builds the `make_pipeline` model | | Round-trip | config fields preserved through the polymorphic `LogicalOp` base, with the correct `operatorType` discriminator | ### Any related issues, documentation, discussions? Part of the ongoing `workflow-operator` unit-test coverage effort (follow-up to the Sklearn classifier coverage in apache#5925, apache#5939, apache#5940, apache#5941). ### How was this PR tested? - `sbt "WorkflowOperator/testOnly *SklearnRidgeOpDescSpec *SklearnRidgeCVOpDescSpec *SklearnSDGOpDescSpec *SklearnDummyClassifierOpDescSpec"` — 20 tests, all green - `sbt "WorkflowOperator/Test/scalafmtCheck"` and `sbt "WorkflowOperator/scalafixAll --check"` — clean - CI to confirm ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8 [1M context])

test(workflow-operator): add unit test coverage for Sklearn Naive Bay…

7a72636

…es descriptors (Bernoulli, Complement, Gaussian, Multinomial)

Copilot AI review requested due to automatic review settings June 24, 2026 04:07

github-actions Bot assigned aglinxinyuan Jun 24, 2026

github-actions Bot added the common label Jun 24, 2026

Copilot started reviewing on behalf of aglinxinyuan June 24, 2026 04:08 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(workflow-operator): add unit test coverage for Sklearn Naive Bayes descriptors#5925

test(workflow-operator): add unit test coverage for Sklearn Naive Bayes descriptors#5925
aglinxinyuan wants to merge 1 commit into
apache:mainfrom
aglinxinyuan:test-sklearn-naive-bayes-descriptors

aglinxinyuan commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

codecov-commenter commented Jun 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

aglinxinyuan commented Jun 24, 2026

What changes were proposed in this PR?

Any related issues, documentation, discussions?

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

github-actions Bot commented Jun 24, 2026

Automated Reviewer Suggestions

Uh oh!

codecov-commenter commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions Bot commented Jun 24, 2026

⚠️ Benchmark changes need a look

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Jun 24, 2026 •

edited

Loading