Skip to content

feat(workflow-operator): support COUNT(*) in the Aggregate operator#5896

Open
tanishqgandhi1908 wants to merge 6 commits into
apache:mainfrom
tanishqgandhi1908:feat/count-star
Open

feat(workflow-operator): support COUNT(*) in the Aggregate operator#5896
tanishqgandhi1908 wants to merge 6 commits into
apache:mainfrom
tanishqgandhi1908:feat/count-star

Conversation

@tanishqgandhi1908

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

This PR adds support for COUNT(*) in the Aggregate operator, so users can count all rows without having to pick a column.

A dedicated count(*) function is added alongside the existing count:

  • count — counts the non-null values of a selected column (unchanged behavior).
  • count(*) — counts every row, including rows with nulls; no column needed.

Backend

  • New COUNT_STAR("count(*)") aggregation function; countStarAgg counts every row, and getFinal rewrites both count variants to SUM for the global stage.
  • attribute is now required for every function except count(*), enforced via a conditional JSON-schema rule. This gates execution (validated by Ajv), so a missing attribute on count/sum/etc. correctly makes the operator invalid.

Frontend (Aggregate only)

  • When count(*) is selected, the Attribute field is disabled (greyed out, keeping
    each aggregation row's layout consistent) and any previously-selected column is cleared.

Docs

  • Updated the Aggregate operator reference page.

Screenshots

count(*) selected — the Attribute field is disabled, and the result counts all rows:

image

Any related issues, documentation, discussions?

Closes #3142.

How was this PR tested?

Automated (unit + integration, AggregateOpSpec / AggregateOpDescSpec):

  • count(*) counts every row including nulls, and ignores any attribute value that
    leaks through.
  • count counts only non-null values of its column.
  • getAggregationAttribute / getFinal handle COUNT_STAR.
  • Schema-propagation guard (AggregateOpDesc) and executor guard (AggregateOpExec) tolerate a blank count(*) attribute without dereferencing a non-existent column; the result column is typed INTEGER.

Manual (UI): verified the screenshots above — selecting count(*) disables and clears the Attribute and counts all rows; count/sum with an empty Attribute mark the operator invalid and disable Run; confirmed both with and without Group By. The frontend logic lives in jsonSchemaMapIntercept alongside similar per-operator expression rules (e.g. FileScanOp), which are likewise validated manually rather than unit-tested.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

@github-actions github-actions Bot added feature frontend Changes related to the frontend GUI docs Changes related to documentations common labels Jun 23, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • Contributors with relevant context: @Yicong-Huang, @aglinxinyuan, @Ma77Ball
    You can notify them by mentioning @Yicong-Huang, @aglinxinyuan, @Ma77Ball in a comment.

@codecov-commenter

codecov-commenter commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.76923% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.61%. Comparing base (439ea72) to head (0a9117c).
⚠️ Report is 18 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...it-frame/operator-property-edit-frame.component.ts 20.00% 3 Missing and 1 partial ⚠️
...mber/operator/aggregate/AggregationOperation.scala 94.73% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5896      +/-   ##
============================================
+ Coverage     54.09%   54.61%   +0.52%     
- Complexity     2817     2924     +107     
============================================
  Files          1103     1109       +6     
  Lines         42650    42849     +199     
  Branches       4588     4612      +24     
============================================
+ Hits          23070    23402     +332     
+ Misses        18236    18071     -165     
- Partials       1344     1376      +32     
Flag Coverage Δ *Carryforward flag
access-control-service 70.44% <ø> (ø)
agent-service 34.36% <ø> (ø) Carriedforward from 58fc27f
amber 56.68% <95.23%> (+1.06%) ⬆️
computing-unit-managing-service 1.65% <ø> (ø)
config-service 57.35% <ø> (ø)
file-service 58.59% <ø> (ø)
frontend 48.40% <20.00%> (+0.31%) ⬆️
pyamber 90.20% <ø> (ø) Carriedforward from 58fc27f
python 90.76% <ø> (ø) Carriedforward from 58fc27f
workflow-compiling-service 58.69% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 2 better · 🔴 3 worse · ⚪ 10 noise (<±5%) · 0 without baseline

Compared against main 439ea72 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🔴 bs=10 sw=10 sl=64 399 0.244 23,582/34,176/34,176 us 🔴 +13.3% / 🔴 +128.5%
🟢 bs=100 sw=10 sl=64 958 0.585 105,202/117,371/117,371 us 🟢 -19.0% / 🔴 +17.1%
bs=1000 sw=10 sl=64 1,115 0.681 890,708/942,803/942,803 us ⚪ within ±5% / ⚪ within ±5%
Baseline details

Latest main 439ea72 from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 399 tuples/sec 431 tuples/sec 799.7 tuples/sec -7.4% -50.1%
bs=10 sw=10 sl=64 MB/s 0.244 MB/s 0.263 MB/s 0.488 MB/s -7.2% -50.0%
bs=10 sw=10 sl=64 p50 23,582 us 20,808 us 12,125 us +13.3% +94.5%
bs=10 sw=10 sl=64 p95 34,176 us 33,692 us 14,959 us +1.4% +128.5%
bs=10 sw=10 sl=64 p99 34,176 us 33,692 us 18,503 us +1.4% +84.7%
bs=100 sw=10 sl=64 throughput 958 tuples/sec 962 tuples/sec 1,069 tuples/sec -0.4% -10.3%
bs=100 sw=10 sl=64 MB/s 0.585 MB/s 0.587 MB/s 0.652 MB/s -0.3% -10.3%
bs=100 sw=10 sl=64 p50 105,202 us 100,294 us 92,709 us +4.9% +13.5%
bs=100 sw=10 sl=64 p95 117,371 us 144,896 us 100,227 us -19.0% +17.1%
bs=100 sw=10 sl=64 p99 117,371 us 144,896 us 109,877 us -19.0% +6.8%
bs=1000 sw=10 sl=64 throughput 1,115 tuples/sec 1,098 tuples/sec 1,114 tuples/sec +1.5% +0.1%
bs=1000 sw=10 sl=64 MB/s 0.681 MB/s 0.67 MB/s 0.68 MB/s +1.6% +0.2%
bs=1000 sw=10 sl=64 p50 890,708 us 918,337 us 899,197 us -3.0% -0.9%
bs=1000 sw=10 sl=64 p95 942,803 us 946,965 us 945,162 us -0.4% -0.2%
bs=1000 sw=10 sl=64 p99 942,803 us 946,965 us 986,231 us -0.4% -4.4%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,500.74,200,128000,399,0.244,23582.21,34175.92,34175.92
1,100,10,64,20,2087.24,2000,1280000,958,0.585,105201.65,117371.40,117371.40
2,1000,10,64,20,17936.41,20000,12800000,1115,0.681,890708.20,942802.89,942802.89

@aglinxinyuan

Copy link
Copy Markdown
Contributor

@chenlica, what do you think? Should we support COUNT(*)?

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class COUNT(*) support to the Aggregate operator so users can count rows without selecting a column, with backend execution/schema support and a small Aggregate-specific UI rule to disable the Attribute field when count(*) is selected.

Changes:

  • Backend: introduce COUNT_STAR("count(*)"), implement row-count aggregation, and rewrite both COUNT variants to SUM for the global aggregation stage.
  • Backend/validation: make attribute conditionally required via JSON-schema (attribute required for all functions except count(*)).
  • Frontend + docs: disable/clear the Attribute field when count(*) is selected and document the new behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
frontend/src/app/workspace/component/property-editor/operator-property-edit-frame/operator-property-edit-frame.component.ts Disables + clears Aggregate “Attribute” field when count(*) is chosen.
docs/reference/operators/data-cleaning/aggregate/aggregate.md Documents count(*) and clarifies COUNT vs COUNT(*).
common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/aggregate/AggregationFunction.java Adds COUNT_STAR("count(*)") enum value for JSON.
common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/aggregate/AggregationOperation.scala Adds COUNT(*) semantics, conditional attribute requirement in schema, and global-stage rewrite logic.
common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/aggregate/AggregateOpExec.scala Updates executor initialization to tolerate COUNT(*) without input-column lookup.
common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/aggregate/AggregateOpDesc.scala Updates schema propagation to tolerate COUNT(*) without input-column lookup.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/aggregate/AggregateOpSpec.scala Adds unit/integration coverage for COUNT(*) behavior and final-stage rewrite.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/aggregate/AggregateOpDescSpec.scala Adds schema-propagation coverage for COUNT(*) output typing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

@chenlica

Copy link
Copy Markdown
Contributor

@mengw15 Please review it.

@chenlica chenlica requested review from mengw15 and removed request for chenlica June 23, 2026 07:02
@Yicong-Huang

Copy link
Copy Markdown
Contributor

maybe a bit more to ask, but can we hide the Attribute input box when count(*) is selected?

alternatively, you can add a * value in Attribute so that its semantic is func: count, value: "*".

@Yicong-Huang

Yicong-Huang commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

@chenlica, what do you think? Should we support COUNT(*)?

This better to be discussed on issue #3142, not PR.

@tanishqgandhi1908

Copy link
Copy Markdown
Contributor Author

maybe a bit more to ask, but can we hide the Attribute input box when count(*) is selected?

alternatively, you can add a * value in Attribute so that its semantic is func: count, value: "*".

Thanks! Quick reasoning on both, and happy to change either:

I disabled (greyed out) the Attribute instead of hiding it so the rows don't change height/jump when switching functions, the layout stays consistent. But hiding is a one-line change can do if team prefers it.

The Attribute box is a single column-picker shared by every row and filled from the input schema. If we add a * option to it, that * would also show up for sum, min, max, etc, since the dropdown is the same for all functions. Limiting * to count only would need extra per-row conditional logic.

@Yicong-Huang

Copy link
Copy Markdown
Contributor

yeah but if we don't hide/remove/reset the value input box, it may have this experience:

user selects Min, age, then switch to Count(*), age which can be confusing. we want to avoid that situation.

@tanishqgandhi1908

tanishqgandhi1908 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

That case is already handled, when you switch to count(*) we reset the attribute, so it shows count(*) with an empty box, not count(*), age. The change to hide is minor, do let me know what you think!

@Yicong-Huang

Copy link
Copy Markdown
Contributor

That case is already handled, when you switch to count(*) we reset the attribute, so it shows count(*) with an empty box, not count(*), age. The change to hide is minor, do let me know what you think!

ok so it is doing reset already. that will do. to hide it is optional and I will leave you to decide for the best experience! thanks

@Yicong-Huang Yicong-Huang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments in line

@chenlica

Copy link
Copy Markdown
Contributor

@mengw15 Please review it after the pass of @Yicong-Huang .

@zuozhiw

zuozhiw commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Same comment as yicong, we should do function count and * as a special column you can select there. honestly I think it will be eaiser if we make attribute an optional parameter and if you don't select anything it means count(*)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common docs Changes related to documentations feature frontend Changes related to the frontend GUI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support COUNT(*) for Aggregate COUNT function

7 participants