Skip to content

refactor(execution-service): consolidate init-error reporting into WorkflowExecutionService#5922

Open
aglinxinyuan wants to merge 2 commits into
apache:mainfrom
aglinxinyuan:consolidate-execservice-error-reporting
Open

refactor(execution-service): consolidate init-error reporting into WorkflowExecutionService#5922
aglinxinyuan wants to merge 2 commits into
apache:mainfrom
aglinxinyuan:consolidate-execservice-error-reporting

Conversation

@aglinxinyuan

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

Consolidates execution init-error reporting into a single site owned by WorkflowExecutionService, replacing the two-phase split #5781 introduced (per @Yicong-Huang's suggestion, tracked in #5921).

  • WorkflowExecutionService now registers its fatalErrors → WorkflowErrorEvent (and state) diff handler as the first construction action. Construction itself does no external work and cannot throw — it only assigns workflowSettings, creates a WebsocketInput (a PublishSubject), and registers the handler. All throwing work lives in executeWorkflow(), which runs after executionService.onNext(...) publishes the execution, so any failure there is recorded by errorHandler into the metadata store and surfaced by the handler that connectToExecution forwards.
  • WorkflowService.initExecutionService drops the pre-publish errorSubject fallback: the executionPublished gating and reportFatalErrorsToSubscribers are gone, and the catch is simply errorHandler(e). The now-unused errorSubject field and its connect() subscription are removed.
  • WorkflowServiceSpec is removed — it only tested the deleted reportFatalErrorsToSubscribers; the surfacing behavior is exercised by the integration/e2e suites.

Net: a single reporting path (the metadata-store diff handler), with no change to the init-error surfacing #5781 added — construction is provably side-effect-free, so the pre-publish window no longer has a failure mode.

Any related issues, documentation, discussions?

Resolves #5921 — the follow-up refactor agreed during the #5781 review.

How was this PR tested?

scalafmtCheckAll clean. This is a behavior-preserving refactor of init-error reporting: the single reporting path is the metadata-store diff handler that already surfaced post-publish errors, and construction is side-effect-free so no error can escape it. The full compile, scalafix, and the integration/e2e suites that exercise init-error surfacing run in CI.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF.

…rkflowExecutionService

Per apache#5921 (follow-up to apache#5781): make error reporting a single site owned by
WorkflowExecutionService instead of the two-phase split in
WorkflowService.initExecutionService.

- WorkflowExecutionService registers its fatalErrors -> WorkflowErrorEvent diff
  handler as the FIRST construction action. Construction does no external work
  and cannot throw (workflowSettings assignment, WebsocketInput creation, handler
  registration); all throwing work is in executeWorkflow(), which runs after the
  execution is published, so its failures surface through this same handler.
- WorkflowService.initExecutionService drops the pre-publish errorSubject
  fallback (reportFatalErrorsToSubscribers + the executionPublished gating); the
  catch is simply errorHandler(e). Removes the now-unused errorSubject field and
  its connect() subscription.
- Remove WorkflowServiceSpec (it only tested the removed reportFatalErrorsToSubscribers);
  the behavior is exercised by the integration/e2e suites.

Resolves apache#5921.
Copilot AI review requested due to automatic review settings June 24, 2026 03:29
@github-actions

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • Contributors with relevant context: @Yicong-Huang
    You can notify them by mentioning @Yicong-Huang in a comment.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors Amber’s workflow execution initialization error surfacing to use a single reporting path owned by WorkflowExecutionService (the execution metadata-store diff handler), removing the pre-publish websocket fallback previously implemented in WorkflowService.

Changes:

  • Removed WorkflowService’s pre-publish errorSubject fallback and simplified init-time exception handling to rely on the metadata-store diff handler.
  • Reordered WorkflowExecutionService construction to register the fatal-error/state diff handler before other initialization steps.
  • Deleted WorkflowServiceSpec which only covered the removed fallback helper.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
amber/src/main/scala/org/apache/texera/web/service/WorkflowService.scala Removes errorSubject + pre-publish fallback logic; relies on metadata-store diff handler for fatal error surfacing.
amber/src/main/scala/org/apache/texera/web/service/WorkflowExecutionService.scala Registers metadata-store diff handler as the first constructor action to ensure fatal error/state events always have an emitter.
amber/src/test/scala/org/apache/texera/web/service/WorkflowServiceSpec.scala Removes unit tests tied to the deleted fallback method.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 0 better · 🔴 10 worse · ⚪ 5 noise (<±5%) · 0 without baseline

Compared against main 1c580e5 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🔴 bs=10 sw=10 sl=64 413 0.252 23,470/36,442/36,442 us 🔴 +9.5% / ⚪ within ±5%
🔴 bs=100 sw=10 sl=64 815 0.497 121,650/155,397/155,397 us 🔴 +14.2% / 🔴 +11.2%
bs=1000 sw=10 sl=64 975 0.595 1,022,476/1,064,671/1,064,671 us ⚪ within ±5% / 🔴 -6.4%
Baseline details

Latest main 1c580e5 from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 413 tuples/sec 445 tuples/sec 410.82 tuples/sec -7.2% +0.5%
bs=10 sw=10 sl=64 MB/s 0.252 MB/s 0.271 MB/s 0.251 MB/s -7.0% +0.5%
bs=10 sw=10 sl=64 p50 23,470 us 21,607 us 23,785 us +8.6% -1.3%
bs=10 sw=10 sl=64 p95 36,442 us 33,274 us 34,980 us +9.5% +4.2%
bs=10 sw=10 sl=64 p99 36,442 us 33,274 us 34,980 us +9.5% +4.2%
bs=100 sw=10 sl=64 throughput 815 tuples/sec 874 tuples/sec 891.94 tuples/sec -6.8% -8.6%
bs=100 sw=10 sl=64 MB/s 0.497 MB/s 0.534 MB/s 0.544 MB/s -6.9% -8.7%
bs=100 sw=10 sl=64 p50 121,650 us 112,128 us 112,277 us +8.5% +8.3%
bs=100 sw=10 sl=64 p95 155,397 us 136,062 us 139,802 us +14.2% +11.2%
bs=100 sw=10 sl=64 p99 155,397 us 136,062 us 139,802 us +14.2% +11.2%
bs=1000 sw=10 sl=64 throughput 975 tuples/sec 945 tuples/sec 1,041 tuples/sec +3.2% -6.3%
bs=1000 sw=10 sl=64 MB/s 0.595 MB/s 0.577 MB/s 0.635 MB/s +3.1% -6.4%
bs=1000 sw=10 sl=64 p50 1,022,476 us 1,052,923 us 972,714 us -2.9% +5.1%
bs=1000 sw=10 sl=64 p95 1,064,671 us 1,099,197 us 1,023,057 us -3.1% +4.1%
bs=1000 sw=10 sl=64 p99 1,064,671 us 1,099,197 us 1,023,057 us -3.1% +4.1%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,484.53,200,128000,413,0.252,23469.58,36441.60,36441.60
1,100,10,64,20,2454.67,2000,1280000,815,0.497,121650.01,155397.26,155397.26
2,1000,10,64,20,20522.36,20000,12800000,975,0.595,1022475.66,1064670.56,1064670.56

@codecov-commenter

codecov-commenter commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.58%. Comparing base (1c580e5) to head (351fed2).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...rg/apache/texera/web/service/WorkflowService.scala 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5922      +/-   ##
============================================
- Coverage     54.60%   54.58%   -0.02%     
+ Complexity     2927     2920       -7     
============================================
  Files          1109     1109              
  Lines         42828    42821       -7     
  Branches       4608     4607       -1     
============================================
- Hits          23385    23373      -12     
- Misses        18081    18083       +2     
- Partials       1362     1365       +3     
Flag Coverage Δ *Carryforward flag
access-control-service 70.44% <ø> (ø) Carriedforward from 340f70a
agent-service 34.36% <ø> (ø) Carriedforward from 340f70a
amber 56.68% <50.00%> (-0.05%) ⬇️
computing-unit-managing-service 1.65% <ø> (ø) Carriedforward from 340f70a
config-service 57.35% <ø> (ø) Carriedforward from 340f70a
file-service 58.59% <ø> (ø) Carriedforward from 340f70a
frontend 48.31% <ø> (ø) Carriedforward from 340f70a
pyamber 90.20% <ø> (ø) Carriedforward from 340f70a
python 90.76% <ø> (ø) Carriedforward from 340f70a
workflow-compiling-service 58.69% <ø> (ø) Carriedforward from 340f70a

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Yicong-Huang

Copy link
Copy Markdown
Contributor

Need test coverage ;)

Address review (Copilot + @Yicong-Huang): deleting WorkflowServiceSpec left the
single-reporting-site invariant untested. Add WorkflowExecutionServiceSpec:
- a recorded fatal error surfaces as a WorkflowErrorEvent through the
  metadata-store diff handler registered at construction (the regression guard
  for init-error surfacing);
- a state change emits a WorkflowStateEvent (handler's other branch).

The construction-unused controllerConfig/resultService are passed as null on
purpose, so a future change that does external work during construction (which
would reopen the pre-publish gap) fails this test.
@aglinxinyuan

Copy link
Copy Markdown
Contributor Author

Added test coverage in 351fed2fWorkflowExecutionServiceSpec guards that a recorded fatal error surfaces as a WorkflowErrorEvent via the metadata-store handler (plus a state change → WorkflowStateEvent). 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consolidate WorkflowExecutionService error reporting into a single site (error-free construction + init phase)

4 participants