feat(workflow-operator): support COUNT(*) in the Aggregate operator#5896
feat(workflow-operator): support COUNT(*) in the Aggregate operator#5896tanishqgandhi1908 wants to merge 6 commits into
Conversation
Automated Reviewer SuggestionsBased on the
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5896 +/- ##
============================================
+ Coverage 54.09% 54.61% +0.52%
- Complexity 2817 2924 +107
============================================
Files 1103 1109 +6
Lines 42650 42849 +199
Branches 4588 4612 +24
============================================
+ Hits 23070 23402 +332
+ Misses 18236 18071 -165
- Partials 1344 1376 +32
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 399 | 0.244 | 23,582/34,176/34,176 us | 🔴 +13.3% / 🔴 +128.5% |
| 🟢 | bs=100 sw=10 sl=64 | 958 | 0.585 | 105,202/117,371/117,371 us | 🟢 -19.0% / 🔴 +17.1% |
| ⚪ | bs=1000 sw=10 sl=64 | 1,115 | 0.681 | 890,708/942,803/942,803 us | ⚪ within ±5% / ⚪ within ±5% |
Baseline details
Latest main 439ea72 from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 399 tuples/sec | 431 tuples/sec | 799.7 tuples/sec | -7.4% | -50.1% |
| bs=10 sw=10 sl=64 | MB/s | 0.244 MB/s | 0.263 MB/s | 0.488 MB/s | -7.2% | -50.0% |
| bs=10 sw=10 sl=64 | p50 | 23,582 us | 20,808 us | 12,125 us | +13.3% | +94.5% |
| bs=10 sw=10 sl=64 | p95 | 34,176 us | 33,692 us | 14,959 us | +1.4% | +128.5% |
| bs=10 sw=10 sl=64 | p99 | 34,176 us | 33,692 us | 18,503 us | +1.4% | +84.7% |
| bs=100 sw=10 sl=64 | throughput | 958 tuples/sec | 962 tuples/sec | 1,069 tuples/sec | -0.4% | -10.3% |
| bs=100 sw=10 sl=64 | MB/s | 0.585 MB/s | 0.587 MB/s | 0.652 MB/s | -0.3% | -10.3% |
| bs=100 sw=10 sl=64 | p50 | 105,202 us | 100,294 us | 92,709 us | +4.9% | +13.5% |
| bs=100 sw=10 sl=64 | p95 | 117,371 us | 144,896 us | 100,227 us | -19.0% | +17.1% |
| bs=100 sw=10 sl=64 | p99 | 117,371 us | 144,896 us | 109,877 us | -19.0% | +6.8% |
| bs=1000 sw=10 sl=64 | throughput | 1,115 tuples/sec | 1,098 tuples/sec | 1,114 tuples/sec | +1.5% | +0.1% |
| bs=1000 sw=10 sl=64 | MB/s | 0.681 MB/s | 0.67 MB/s | 0.68 MB/s | +1.6% | +0.2% |
| bs=1000 sw=10 sl=64 | p50 | 890,708 us | 918,337 us | 899,197 us | -3.0% | -0.9% |
| bs=1000 sw=10 sl=64 | p95 | 942,803 us | 946,965 us | 945,162 us | -0.4% | -0.2% |
| bs=1000 sw=10 sl=64 | p99 | 942,803 us | 946,965 us | 986,231 us | -0.4% | -4.4% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,500.74,200,128000,399,0.244,23582.21,34175.92,34175.92
1,100,10,64,20,2087.24,2000,1280000,958,0.585,105201.65,117371.40,117371.40
2,1000,10,64,20,17936.41,20000,12800000,1115,0.681,890708.20,942802.89,942802.89|
@chenlica, what do you think? Should we support COUNT(*)? |
There was a problem hiding this comment.
Pull request overview
Adds first-class COUNT(*) support to the Aggregate operator so users can count rows without selecting a column, with backend execution/schema support and a small Aggregate-specific UI rule to disable the Attribute field when count(*) is selected.
Changes:
- Backend: introduce
COUNT_STAR("count(*)"), implement row-count aggregation, and rewrite both COUNT variants toSUMfor the global aggregation stage. - Backend/validation: make
attributeconditionally required via JSON-schema (attributerequired for all functions exceptcount(*)). - Frontend + docs: disable/clear the Attribute field when
count(*)is selected and document the new behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| frontend/src/app/workspace/component/property-editor/operator-property-edit-frame/operator-property-edit-frame.component.ts | Disables + clears Aggregate “Attribute” field when count(*) is chosen. |
| docs/reference/operators/data-cleaning/aggregate/aggregate.md | Documents count(*) and clarifies COUNT vs COUNT(*). |
| common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/aggregate/AggregationFunction.java | Adds COUNT_STAR("count(*)") enum value for JSON. |
| common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/aggregate/AggregationOperation.scala | Adds COUNT(*) semantics, conditional attribute requirement in schema, and global-stage rewrite logic. |
| common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/aggregate/AggregateOpExec.scala | Updates executor initialization to tolerate COUNT(*) without input-column lookup. |
| common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/aggregate/AggregateOpDesc.scala | Updates schema propagation to tolerate COUNT(*) without input-column lookup. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/aggregate/AggregateOpSpec.scala | Adds unit/integration coverage for COUNT(*) behavior and final-stage rewrite. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/aggregate/AggregateOpDescSpec.scala | Adds schema-propagation coverage for COUNT(*) output typing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@mengw15 Please review it. |
|
maybe a bit more to ask, but can we hide the alternatively, you can add a |
Thanks! Quick reasoning on both, and happy to change either: I disabled (greyed out) the Attribute instead of hiding it so the rows don't change height/jump when switching functions, the layout stays consistent. But hiding is a one-line change can do if team prefers it. The Attribute box is a single column-picker shared by every row and filled from the input schema. If we add a * option to it, that * would also show up for sum, min, max, etc, since the dropdown is the same for all functions. Limiting * to count only would need extra per-row conditional logic. |
|
yeah but if we don't hide/remove/reset the value input box, it may have this experience: user selects |
|
That case is already handled, when you switch to |
ok so it is doing reset already. that will do. to hide it is optional and I will leave you to decide for the best experience! thanks |
Yicong-Huang
left a comment
There was a problem hiding this comment.
Left comments in line
|
@mengw15 Please review it after the pass of @Yicong-Huang . |
|
Same comment as yicong, we should do function count and * as a special column you can select there. honestly I think it will be eaiser if we make attribute an optional parameter and if you don't select anything it means count(*) |
What changes were proposed in this PR?
This PR adds support for
COUNT(*)in the Aggregate operator, so users can count all rows without having to pick a column.A dedicated
count(*)function is added alongside the existingcount:count— counts the non-null values of a selected column (unchanged behavior).count(*)— counts every row, including rows with nulls; no column needed.Backend
COUNT_STAR("count(*)")aggregation function;countStarAggcounts every row, andgetFinalrewrites both count variants toSUMfor the global stage.attributeis now required for every function exceptcount(*), enforced via a conditional JSON-schema rule. This gates execution (validated by Ajv), so a missing attribute oncount/sum/etc. correctly makes the operator invalid.Frontend (Aggregate only)
count(*)is selected, the Attribute field is disabled (greyed out, keepingeach aggregation row's layout consistent) and any previously-selected column is cleared.
Docs
Screenshots
count(*)selected — the Attribute field is disabled, and the result counts all rows:Any related issues, documentation, discussions?
Closes #3142.
How was this PR tested?
Automated (unit + integration,
AggregateOpSpec/AggregateOpDescSpec):count(*)counts every row including nulls, and ignores any attribute value thatleaks through.
countcounts only non-null values of its column.getAggregationAttribute/getFinalhandleCOUNT_STAR.AggregateOpDesc) and executor guard (AggregateOpExec) tolerate a blankcount(*)attribute without dereferencing a non-existent column; the result column is typedINTEGER.Manual (UI): verified the screenshots above — selecting
count(*)disables and clears the Attribute and counts all rows;count/sumwith an empty Attribute mark the operator invalid and disable Run; confirmed both with and without Group By. The frontend logic lives injsonSchemaMapInterceptalongside similar per-operator expression rules (e.g. FileScanOp), which are likewise validated manually rather than unit-tested.Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.8)