Skip to content

test(workflow-operator): add unit test coverage for Sklearn Naive Bayes descriptors#5925

Open
aglinxinyuan wants to merge 1 commit into
apache:mainfrom
aglinxinyuan:test-sklearn-naive-bayes-descriptors
Open

test(workflow-operator): add unit test coverage for Sklearn Naive Bayes descriptors#5925
aglinxinyuan wants to merge 1 commit into
apache:mainfrom
aglinxinyuan:test-sklearn-naive-bayes-descriptors

Conversation

@aglinxinyuan

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

Pin behavior of the four previously-untested Sklearn Naive Bayes classifier descriptors in common/workflow-operator. No production-code changes.

Spec Source class Tests
SklearnBernoulliNaiveBayesOpDescSpec SklearnBernoulliNaiveBayesOpDesc 5
SklearnComplementNaiveBayesOpDescSpec SklearnComplementNaiveBayesOpDesc 5
SklearnGaussianNaiveBayesOpDescSpec SklearnGaussianNaiveBayesOpDesc 5
SklearnMultinomialNaiveBayesOpDescSpec SklearnMultinomialNaiveBayesOpDesc 5

Behavior pinned

Surface Contract
operatorInfo exact model name + Sklearn <name> Operator description; Sklearn group; training/testing input ports + one blocking output
field defaults countVectorizer/tfidfTransformer false; target/text null
getOutputSchemas model_name (STRING) + model (BINARY) keyed by the declared output port
generatePythonCode imports and instantiates the matching sklearn estimator (e.g. BernoulliNB) via make_pipeline
Round-trip config fields preserved through the polymorphic LogicalOp base, with the correct operatorType discriminator

Any related issues, documentation, discussions?

Part of the ongoing workflow-operator unit-test coverage effort.

How was this PR tested?

  • sbt "WorkflowOperator/testOnly *SklearnBernoulliNaiveBayesOpDescSpec *SklearnComplementNaiveBayesOpDescSpec *SklearnGaussianNaiveBayesOpDescSpec *SklearnMultinomialNaiveBayesOpDescSpec" — 20 tests, all green
  • sbt "WorkflowOperator/Test/scalafmtCheck" and sbt "WorkflowOperator/scalafixAll --check" — clean
  • CI to confirm

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8 [1M context])

…es descriptors (Bernoulli, Complement, Gaussian, Multinomial)
Copilot AI review requested due to automatic review settings June 24, 2026 04:07
@github-actions

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • No candidates found from git blame history.

@codecov-commenter

codecov-commenter commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 54.60%. Comparing base (1c580e5) to head (7a72636).

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #5925   +/-   ##
=========================================
  Coverage     54.60%   54.60%           
+ Complexity     2927     2925    -2     
=========================================
  Files          1109     1109           
  Lines         42828    42828           
  Branches       4608     4608           
=========================================
  Hits          23385    23385           
+ Misses        18081    18080    -1     
- Partials       1362     1363    +1     
Flag Coverage Δ *Carryforward flag
access-control-service 70.44% <ø> (ø)
agent-service 34.36% <ø> (ø) Carriedforward from 1c580e5
amber 56.73% <ø> (ø)
computing-unit-managing-service 1.65% <ø> (ø)
config-service 57.35% <ø> (ø)
file-service 58.59% <ø> (ø)
frontend 48.31% <ø> (ø) Carriedforward from 1c580e5
pyamber 90.20% <ø> (ø) Carriedforward from 1c580e5
python 90.76% <ø> (ø) Carriedforward from 1c580e5
workflow-compiling-service 58.69% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Scala unit tests in common/workflow-operator to pin the current behavior/contract of the four Sklearn Naive Bayes operator descriptors (Bernoulli, Complement, Gaussian, Multinomial) without changing production code.

Changes:

  • Introduces new AnyFlatSpec test suites validating operatorInfo (name/description/group + port shape) and default config values.
  • Verifies getOutputSchemas emits the expected model_name (STRING) and model (BINARY) schema on the declared output port.
  • Pins generatePythonCode output (sklearn estimator import + pipeline usage) and JSON polymorphic round-trip via LogicalOp using the operatorType discriminator.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/SklearnBernoulliNaiveBayesOpDescSpec.scala Adds contract/unit tests for BernoulliNB descriptor behavior and JSON round-trip.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/SklearnComplementNaiveBayesOpDescSpec.scala Adds contract/unit tests for ComplementNB descriptor behavior and JSON round-trip.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/SklearnGaussianNaiveBayesOpDescSpec.scala Adds contract/unit tests for GaussianNB descriptor behavior and JSON round-trip.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/SklearnMultinomialNaiveBayesOpDescSpec.scala Adds contract/unit tests for MultinomialNB descriptor behavior and JSON round-trip.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 4 better · 🔴 1 worse · ⚪ 10 noise (<±5%) · 0 without baseline

Compared against main 1c580e5 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🔴 bs=10 sw=10 sl=64 389 0.238 25,780/33,058/33,058 us 🔴 +10.0% / 🔴 +8.4%
🟢 bs=100 sw=10 sl=64 833 0.508 116,571/139,934/139,934 us 🟢 -11.4% / 🔴 -6.7%
bs=1000 sw=10 sl=64 935 0.57 1,069,345/1,118,882/1,118,882 us ⚪ within ±5% / 🔴 -10.3%
Baseline details

Latest main 1c580e5 from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 389 tuples/sec 395 tuples/sec 410.82 tuples/sec -1.5% -5.3%
bs=10 sw=10 sl=64 MB/s 0.238 MB/s 0.241 MB/s 0.251 MB/s -1.2% -5.1%
bs=10 sw=10 sl=64 p50 25,780 us 23,427 us 23,785 us +10.0% +8.4%
bs=10 sw=10 sl=64 p95 33,058 us 36,524 us 34,980 us -9.5% -5.5%
bs=10 sw=10 sl=64 p99 33,058 us 36,524 us 34,980 us -9.5% -5.5%
bs=100 sw=10 sl=64 throughput 833 tuples/sec 819 tuples/sec 891.94 tuples/sec +1.7% -6.6%
bs=100 sw=10 sl=64 MB/s 0.508 MB/s 0.5 MB/s 0.544 MB/s +1.6% -6.7%
bs=100 sw=10 sl=64 p50 116,571 us 119,613 us 112,277 us -2.5% +3.8%
bs=100 sw=10 sl=64 p95 139,934 us 158,005 us 139,802 us -11.4% +0.1%
bs=100 sw=10 sl=64 p99 139,934 us 158,005 us 139,802 us -11.4% +0.1%
bs=1000 sw=10 sl=64 throughput 935 tuples/sec 920 tuples/sec 1,041 tuples/sec +1.6% -10.2%
bs=1000 sw=10 sl=64 MB/s 0.57 MB/s 0.561 MB/s 0.635 MB/s +1.6% -10.3%
bs=1000 sw=10 sl=64 p50 1,069,345 us 1,084,616 us 972,714 us -1.4% +9.9%
bs=1000 sw=10 sl=64 p95 1,118,882 us 1,133,084 us 1,023,057 us -1.3% +9.4%
bs=1000 sw=10 sl=64 p99 1,118,882 us 1,133,084 us 1,023,057 us -1.3% +9.4%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,513.76,200,128000,389,0.238,25779.92,33057.68,33057.68
1,100,10,64,20,2400.65,2000,1280000,833,0.508,116571.14,139933.82,139933.82
2,1000,10,64,20,21400.53,20000,12800000,935,0.570,1069345.18,1118882.16,1118882.16

xuang7 pushed a commit to xuang7/texera that referenced this pull request Jun 25, 2026
…assifier descriptors (apache#5941)

### What changes were proposed in this PR?

Pin behavior of four previously-untested Sklearn linear classifier
descriptors in `common/workflow-operator`. No production-code changes.

| Spec | Source class | Tests |
| --- | --- | --- |
| `SklearnLogisticRegressionOpDescSpec` |
`SklearnLogisticRegressionOpDesc` | 5 |
| `SklearnLogisticRegressionCVOpDescSpec` |
`SklearnLogisticRegressionCVOpDesc` | 5 |
| `SklearnPerceptronOpDescSpec` | `SklearnPerceptronOpDesc` | 5 |
| `SklearnPassiveAggressiveOpDescSpec` |
`SklearnPassiveAggressiveOpDesc` | 5 |

**Behavior pinned**

| Surface | Contract |
| --- | --- |
| `operatorInfo` | exact model name + `Sklearn <name> Operator`
description; Sklearn group; training/testing input ports + one blocking
output |
| field defaults | `countVectorizer`/`tfidfTransformer` `false`;
`target`/`text` `null` |
| `getOutputSchemas` | `model_name` (STRING) + `model` (BINARY) keyed by
the declared output port |
| `generatePythonCode` | imports the matching sklearn estimator and
builds the `make_pipeline` model |
| Round-trip | config fields preserved through the polymorphic
`LogicalOp` base, with the correct `operatorType` discriminator |

### Any related issues, documentation, discussions?

Part of the ongoing `workflow-operator` unit-test coverage effort
(follow-up to the Sklearn Naive Bayes coverage in apache#5925).

### How was this PR tested?

- `sbt "WorkflowOperator/testOnly *SklearnLogisticRegressionOpDescSpec
*SklearnLogisticRegressionCVOpDescSpec *SklearnPerceptronOpDescSpec
*SklearnPassiveAggressiveOpDescSpec"` — 20 tests, all green
- `sbt "WorkflowOperator/Test/scalafmtCheck"` and `sbt
"WorkflowOperator/scalafixAll --check"` — clean
- CI to confirm

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8 [1M context])
xuang7 pushed a commit to xuang7/texera that referenced this pull request Jun 25, 2026
…d classifier descriptors (apache#5939)

### What changes were proposed in this PR?

Pin behavior of four previously-untested Sklearn tree-based classifier
descriptors in `common/workflow-operator`. No production-code changes.

| Spec | Source class | Tests |
| --- | --- | --- |
| `SklearnDecisionTreeOpDescSpec` | `SklearnDecisionTreeOpDesc` | 5 |
| `SklearnExtraTreeOpDescSpec` | `SklearnExtraTreeOpDesc` | 5 |
| `SklearnExtraTreesOpDescSpec` | `SklearnExtraTreesOpDesc` | 5 |
| `SklearnRandomForestOpDescSpec` | `SklearnRandomForestOpDesc` | 5 |

**Behavior pinned**

| Surface | Contract |
| --- | --- |
| `operatorInfo` | exact model name + `Sklearn <name> Operator`
description; Sklearn group; training/testing input ports + one blocking
output |
| field defaults | `countVectorizer`/`tfidfTransformer` `false`;
`target`/`text` `null` |
| `getOutputSchemas` | `model_name` (STRING) + `model` (BINARY) keyed by
the declared output port |
| `generatePythonCode` | imports the matching sklearn estimator and
builds the `make_pipeline` model |
| Round-trip | config fields preserved through the polymorphic
`LogicalOp` base, with the correct `operatorType` discriminator |

### Any related issues, documentation, discussions?

Part of the ongoing `workflow-operator` unit-test coverage effort
(follow-up to the Sklearn Naive Bayes coverage in apache#5925).

### How was this PR tested?

- `sbt "WorkflowOperator/testOnly *SklearnDecisionTreeOpDescSpec
*SklearnExtraTreeOpDescSpec *SklearnExtraTreesOpDescSpec
*SklearnRandomForestOpDescSpec"` — 20 tests, all green
- `sbt "WorkflowOperator/Test/scalafmtCheck"` and `sbt
"WorkflowOperator/scalafixAll --check"` — clean
- CI to confirm

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8 [1M context])
Mrudhulraj pushed a commit to Mrudhulraj/texera that referenced this pull request Jun 25, 2026
…eighbor classifier descriptors (apache#5945)

### What changes were proposed in this PR?

Pin behavior of four previously-untested Sklearn support-vector and
neighbor classifier descriptors in `common/workflow-operator`. No
production-code changes.

| Spec | Source class | Tests |
| --- | --- | --- |
| `SklearnSVMOpDescSpec` | `SklearnSVMOpDesc` | 5 |
| `SklearnLinearSVMOpDescSpec` | `SklearnLinearSVMOpDesc` | 5 |
| `SklearnKNNOpDescSpec` | `SklearnKNNOpDesc` | 5 |
| `SklearnNearestCentroidOpDescSpec` | `SklearnNearestCentroidOpDesc` |
5 |

**Behavior pinned**

| Surface | Contract |
| --- | --- |
| `operatorInfo` | exact model name + `Sklearn <name> Operator`
description; Sklearn group; training/testing input ports + one blocking
output |
| field defaults | `countVectorizer`/`tfidfTransformer` `false`;
`target`/`text` `null` |
| `getOutputSchemas` | `model_name` (STRING) + `model` (BINARY) keyed by
the declared output port |
| `generatePythonCode` | imports the matching sklearn estimator and
builds the `make_pipeline` model |
| Round-trip | config fields preserved through the polymorphic
`LogicalOp` base, with the correct `operatorType` discriminator |

### Any related issues, documentation, discussions?

Part of the ongoing `workflow-operator` unit-test coverage effort
(follow-up to the Sklearn classifier coverage in apache#5925, apache#5939, apache#5940,
apache#5941).

### How was this PR tested?

- `sbt "WorkflowOperator/testOnly *SklearnSVMOpDescSpec
*SklearnLinearSVMOpDescSpec *SklearnKNNOpDescSpec
*SklearnNearestCentroidOpDescSpec"` — 20 tests, all green
- `sbt "WorkflowOperator/Test/scalafmtCheck"` and `sbt
"WorkflowOperator/scalafixAll --check"` — clean
- CI to confirm

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8 [1M context])
zyratlo pushed a commit to zyratlo/texera that referenced this pull request Jun 25, 2026
…classifier descriptors (apache#5940)

### What changes were proposed in this PR?

Pin behavior of three previously-untested Sklearn ensemble classifier
descriptors in `common/workflow-operator`. No production-code changes.

| Spec | Source class | Tests |
| --- | --- | --- |
| `SklearnAdaptiveBoostingOpDescSpec` | `SklearnAdaptiveBoostingOpDesc`
| 5 |
| `SklearnBaggingOpDescSpec` | `SklearnBaggingOpDesc` | 5 |
| `SklearnGradientBoostingOpDescSpec` | `SklearnGradientBoostingOpDesc`
| 5 |

**Behavior pinned**

| Surface | Contract |
| --- | --- |
| `operatorInfo` | exact model name + `Sklearn <name> Operator`
description; Sklearn group; training/testing input ports + one blocking
output |
| field defaults | `countVectorizer`/`tfidfTransformer` `false`;
`target`/`text` `null` |
| `getOutputSchemas` | `model_name` (STRING) + `model` (BINARY) keyed by
the declared output port |
| `generatePythonCode` | imports the matching sklearn estimator and
builds the `make_pipeline` model |
| Round-trip | config fields preserved through the polymorphic
`LogicalOp` base, with the correct `operatorType` discriminator |

### Any related issues, documentation, discussions?

Part of the ongoing `workflow-operator` unit-test coverage effort
(follow-up to the Sklearn Naive Bayes coverage in apache#5925).

### How was this PR tested?

- `sbt "WorkflowOperator/testOnly *SklearnAdaptiveBoostingOpDescSpec
*SklearnBaggingOpDescSpec *SklearnGradientBoostingOpDescSpec"` — 15
tests, all green
- `sbt "WorkflowOperator/Test/scalafmtCheck"` and `sbt
"WorkflowOperator/scalafixAll --check"` — clean
- CI to confirm

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8 [1M context])
zyratlo pushed a commit to zyratlo/texera that referenced this pull request Jun 25, 2026
…/dummy classifier descriptors (apache#5946)

### What changes were proposed in this PR?

Pin behavior of four previously-untested Sklearn classifier descriptors
(ridge/SGD/dummy) in `common/workflow-operator`. No production-code
changes.

| Spec | Source class | Tests |
| --- | --- | --- |
| `SklearnRidgeOpDescSpec` | `SklearnRidgeOpDesc` | 5 |
| `SklearnRidgeCVOpDescSpec` | `SklearnRidgeCVOpDesc` | 5 |
| `SklearnSDGOpDescSpec` | `SklearnSDGOpDesc` | 5 |
| `SklearnDummyClassifierOpDescSpec` | `SklearnDummyClassifierOpDesc` |
5 |

**Behavior pinned**

| Surface | Contract |
| --- | --- |
| `operatorInfo` | exact model name + `Sklearn <name> Operator`
description; Sklearn group; training/testing input ports + one blocking
output |
| field defaults | `countVectorizer`/`tfidfTransformer` `false`;
`target`/`text` `null` |
| `getOutputSchemas` | `model_name` (STRING) + `model` (BINARY) keyed by
the declared output port |
| `generatePythonCode` | imports the matching sklearn estimator
(`RidgeClassifier`/`RidgeClassifierCV`/`SGDClassifier`/`DummyClassifier`)
and builds the `make_pipeline` model |
| Round-trip | config fields preserved through the polymorphic
`LogicalOp` base, with the correct `operatorType` discriminator |

### Any related issues, documentation, discussions?

Part of the ongoing `workflow-operator` unit-test coverage effort
(follow-up to the Sklearn classifier coverage in apache#5925, apache#5939, apache#5940,
apache#5941).

### How was this PR tested?

- `sbt "WorkflowOperator/testOnly *SklearnRidgeOpDescSpec
*SklearnRidgeCVOpDescSpec *SklearnSDGOpDescSpec
*SklearnDummyClassifierOpDescSpec"` — 20 tests, all green
- `sbt "WorkflowOperator/Test/scalafmtCheck"` and `sbt
"WorkflowOperator/scalafixAll --check"` — clean
- CI to confirm

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8 [1M context])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants