Detect active deployments before provisioning#7251
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a pre-deployment check that detects in-progress ARM deployments at the target scope and waits for them to complete, avoiding DeploymentActive failures during provisioning.
Changes:
- Introduces
waitForActiveDeployments()in the Bicep provisioning flow and polls until deployments clear or a timeout is reached. - Adds
IsActiveDeploymentState()plus new tests to classify which provisioning states are considered “active”. - Extends
infra.ScopewithListActiveDeployments()and adds aDeploymentActiveerror suggestion rule.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| cli/azd/resources/error_suggestions.yaml | Adds a user-facing suggestion for DeploymentActive ARM errors. |
| cli/azd/pkg/infra/scope.go | Extends scope interface + implements ListActiveDeployments() for RG and subscription scopes. |
| cli/azd/pkg/infra/provisioning/bicep/bicep_provider.go | Adds wait loop before deployment submission, with polling/timeout defaults. |
| cli/azd/pkg/infra/provisioning/bicep/bicep_provider_test.go | Updates mocked scope to satisfy the new Scope interface. |
| cli/azd/pkg/infra/provisioning/bicep/active_deployment_check_test.go | Adds tests covering wait-loop behavior, errors, cancellation, and timeout. |
| cli/azd/pkg/azapi/deployments.go | Adds IsActiveDeploymentState() helper. |
| cli/azd/pkg/azapi/deployment_state_test.go | Adds unit tests for active/inactive state classification. |
f456f95 to
8829c13
Compare
Telemetry Context: DeploymentActive + Retry BehaviorThis PR addresses Retry behavior makes this especially valuableOf machines that hit
Time savings
This is a clean win — the fix is architecturally simple (poll + wait) and eliminates a category of failure that can never be solved by retrying. |
8829c13 to
38516a8
Compare
|
@copilot - will you check to ensure we have metrics coverage so we can see how often this error and fix occur after merging this change? |
|
@kristenwomack I've opened a new pull request, #7288, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
Converting to draft — the active deployment check integration point in Need to either:
The standalone tests for |
3732718 to
f1f8bfc
Compare
|
/azp run |
|
You have several pipelines (over 10) configured to build pull requests in this repository. Specify which pipelines you would like to run by using /azp run [pipelines] command. You can specify multiple pipelines using a comma separated list. |
f1f8bfc to
74a9edb
Compare
Add lessons learned from recent PR reviews (#7290, #7251, #7250, #7247, #7236, #7235, #7202, #7039) as agent instructions to prevent recurring review findings. New sections: - Error handling: ErrorWithSuggestion completeness, telemetry service attribution, scope-agnostic messages - Architecture boundaries: pkg/project target-agnostic, extension docs - Output formatting: shell-safe paths, consistent JSON contracts - Path safety: traversal validation, quoted paths in messages - Testing best practices: test actual rules, extract shared helpers, correct env vars, TypeScript patterns, efficient dir checks - CI/GitHub Actions: permissions, PATH handling, artifact downloads, prefer ADO for secrets Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add lessons learned from team and Copilot reviews across PRs #7290, #7251, #7250, #7247, #7236, #7235, #7202, #7039 as agent instructions to prevent recurring review findings. New/expanded sections: - Error handling: ErrorWithSuggestion field completeness, telemetry service attribution, scope-agnostic messages, link/suggestion parity, stale data in polling loops - Architecture boundaries: pkg/project target-agnostic, extension docs separation, env var verification against source code - Output formatting: shell-safe quoted paths, consistent JSON types - Path safety: traversal validation, quoted paths in messages - Code organization: extract shared logic across scopes - Documentation standards: help text consistency, no dead references, PR description accuracy - Testing best practices: test YAML rules e2e, extract shared helpers, correct env vars (AZD_FORCE_TTY, NO_COLOR), TypeScript patterns, reasonable timeouts, cross-platform paths, test new JSON fields - CI / GitHub Actions: permissions blocks, PATH handling, cross-workflow artifacts, prefer ADO for secrets, no placeholder steps Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run azure-dev - cli |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…per, refresh timeout names - Fix 'range 200' compile error (not valid in all Go versions) - Make DeploymentActive YAML rule scope-agnostic - Extract filterActiveDeployments helper to deduplicate scope logic - Refresh deployment names from latest poll on timeout message Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move ListActiveDeployments to a standalone function instead of adding it to the exported Scope interface. Adding methods to exported interfaces is a breaking change for any external implementation (including test mocks in CI). The standalone infra.ListActiveDeployments(ctx, scope) function calls scope.ListDeployments and filters for active states, achieving the same result without widening the interface contract. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The deployment object returned by generateDeploymentObject embeds a Scope that can be nil in test environments (e.g. mockedScope returns an empty SubscriptionDeployment). Using scopeForTemplate resolves the scope from the provider's configuration, avoiding nil panics in existing tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
If the resource group is deleted externally while waiting for active deployments to drain, the poll now returns nil instead of surfacing a hard error. This matches the initial check behavior. Known limitations documented: - Only queries the active deployment backend (standard or stacks) - Race window between wait completion and deploy request is inherent Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ic messages - Fix spinner to show StepFailed on error paths, StepDone only on success - Log warning when scopeForTemplate fails instead of silently skipping - Make error wrapping consistent: 'checking for active deployments' - Make DeploymentActive error suggestion scope-agnostic Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
7e81141 to
9d66ad5
Compare
wbreza
left a comment
There was a problem hiding this comment.
Code Review — PR #7251
Detect active deployments before provisioning by @spboyer
What's Done Well
- Clean three-layer architecture — state classifier in
azapi, filter helper ininfra, wait loop inbicep. Each layer independently testable and follows DRY - Comprehensive test coverage — 8 test functions, 24 subtests;
activeDeploymentScopemock with per-call control is elegant - All 9 prior review findings addressed — ErrDeploymentsNotFound in poll, spinner status, scopeForTemplate logging, error wrapping consistency, scope-agnostic messaging
- Error suggestion as defense-in-depth —
DeploymentActiveYAML rule catches the inherent race window - Modern Go —
for i := range 200,t.Context(), clean switch exhaustiveness - Excellent PR description — telemetry context (199 failures, 5.3 min avg waste) justifies the feature clearly
Findings
| Priority | Count |
|---|---|
| Critical | 0 |
| High | 0 |
| Medium | 1 |
| Low | 2 |
| Total | 3 |
🟡 Medium
1. Root cause: Ctrl+C doesn't cancel ARM deployments — follow-up opportunity
bicep_provider.go — Deploy() / deployModule()
@spboyer This PR elegantly handles the symptom (detecting and waiting for active deployments), but the most likely root cause of those 199 DeploymentActive failures is users pressing Ctrl+C during provisioning. When azd receives SIGINT, PollUntilDone() stops polling and the process exits — but the ARM deployment continues running on Azure. The DeploymentService interface has no Cancel method, and there's no signal-handler cleanup that sends a cancellation request to ARM.
Follow-up idea: Consider a future PR that:
- Adds a
CancelDeployment()method toDeploymentService - Registers a signal handler (or
context.AfterFunc) in the Deploy flow that calls cancel on SIGINT - Shows "Cancelling deployment..." feedback to the user
This would prevent the issue entirely rather than waiting for orphaned deployments to finish. Combined with this PR's detection, it'd be a complete solution.
🟢 Low
2. Transient poll errors are hard failures — waitForActiveDeployments() poll loop
A single transient ARM API error during polling (e.g., throttling 429, network blip) aborts the entire provisioning attempt. Consider retrying 2-3 times with backoff before surfacing the error.
3. Timeout message could include provisioning states — waitForActiveDeployments() deadline case
The timeout error shows deployment names but not their provisioning states. Showing stuck-deploy (Canceling) vs stuck-deploy (Running) would help users decide whether to wait or take manual action.
Summary
Overall Assessment: Approve — the PR is well-implemented and directly addresses user pain. The medium finding about ARM cancellation is a follow-up enhancement, not a blocker for this PR.
Review performed with GitHub Copilot CLI
vhvb1989
left a comment
There was a problem hiding this comment.
Concern: Blocking on ALL active deployments creates a bottleneck for shared subscriptions
The current implementation of waitForActiveDeployments calls ListActiveDeployments, which lists all deployments at the target scope and blocks if any of them are in an active provisioning state — regardless of deployment name.
Why this is a problem
Using the same subscription for a large team is a very common approach with azd. In our own CI, we have parallel tests that each deploy a different template to the same subscription concurrently, and this works fine because ARM allows concurrent deployments with different names at the same scope.
With this PR, those parallel deployments would serialize — the second azd up would see the first one as "active" and wait up to 30 minutes, even though ARM would happily accept both. This introduces an artificial bottleneck that does not exist today:
- Shared subscription teams: Developer A deploys env
dev-alice, Developer B deploys envdev-bobto the same subscription scope → B is blocked waiting for A, even though the deployments are completely independent. - CI pipelines: Parallel test jobs deploying different templates to the same subscription would queue up one-by-one instead of running concurrently.
- Multi-environment workflows: Deploying staging and production from the same subscription scope at the same time would be blocked.
Root cause in the code
azd already generates unique deployment names per environment via GenerateDeploymentName ({envName}-{unixTimestamp}). ARM's DeploymentActive error is triggered when a deployment with the same name is submitted while another with that name is already running — not when any deployment in the scope is active.
The 199 DeploymentActive errors from telemetry are almost certainly same-user retries hitting the same deployment name collision, not cross-team conflicts. The fix should target that specific scenario.
Suggested approach
Instead of blocking on all active deployments, we should investigate how ARM actually determines a conflict. The signal is likely the deployment name. The check should:
- Generate the deployment name that
azdis about to use (or match the environment name prefix pattern) - Only block if an active deployment with a matching name (or name prefix based on the environment) exists
- Allow unrelated deployments to proceed concurrently, matching ARM's actual behavior
This way, the retry scenario (same env, same deployment name pattern) is handled correctly, while teams sharing a subscription or running parallel CI jobs are not penalized.
Please investigate what makes ARM fail with DeploymentActive — whether it's the deployment name, a template hash, or something else — and scope the detection to match that logic.
jongio
left a comment
There was a problem hiding this comment.
PR Review - #7251
Detect active deployments before provisioning by @spboyer
Summary
What: Adds a pre-deployment check that detects in-progress ARM deployments at the target scope, warns the user, and polls until they clear - preventing the DeploymentActive error that wastes ~5 min per occurrence.
Why: 199 DeploymentActive failures in March, users averaging 3.6 blind retries. This check breaks the retry loop immediately.
Assessment: Solid implementation that follows existing codebase patterns well - state classifier in azapi, filter helper in infra, wait loop in bicep. The design avoids breaking the Scope interface (standalone function instead of method). Previous review rounds addressed spinner status, error wrapping consistency, and poll-loop ErrDeploymentsNotFound handling. Two remaining items below.
Findings
| Category | Critical | High | Medium | Low |
|---|---|---|---|---|
| Performance | 0 | 0 | 1 | 0 |
| Logic | 0 | 0 | 1 | 0 |
| Tests | 0 | 0 | 2 | 1 |
| Total | 0 | 0 | 4 | 1 |
Key Findings
- [MEDIUM] Performance:
time.Afterleaks a 30-min timer goroutine; same file already usestime.NewTimer(see inline) - [MEDIUM] Logic:
SubscriptionScope.ListDeploymentsnever producesErrDeploymentsNotFound- the recovery paths inwaitForActiveDeploymentsare dead code for subscription-scoped templates. Not a bug today, but worth a code comment documenting the asymmetry. - [MEDIUM] Tests: CancelledContext test is subtly racy (see inline)
- [MEDIUM] Tests: No integration test for the
Deploy()call site -scopeForTemplatefailure silent-skip path and ordering relative to preflight are untested. - [LOW] Tests:
TestIsActiveDeploymentStatedoesn't test an unknown/future state string against thedefaultbranch. A one-liner would document the design choice.
Test Coverage Estimate
- Well covered:
IsActiveDeploymentState(17 subtests),waitForActiveDeployments(8 tests covering happy path, errors, polling, timeout, cancellation) - Indirectly covered:
ListActiveDeployments/filterActiveDeploymentsexercised through wait tests - Missing:
Deploy()integration with active-deployment check; no dedicatedListActiveDeploymentsunit test inscope_test.go
What's Done Well
- Clean layering: state classifier (azapi) -> filter (infra) -> orchestrator (bicep). Each layer testable independently.
- Standalone
ListActiveDeploymentsfunction avoids breaking the exportedScopeinterface - good API hygiene. ErrDeploymentsNotFoundhandling in poll loop correctly handles resource group deletion during wait.- Test helper
activeDeploymentScopeis well designed with per-call response maps and atomic call counter.
2 inline comments below.
- Use time.NewTimer with deferred Stop instead of time.After - Seed multiple poll indices in cancellation test to prevent races Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jongio
left a comment
There was a problem hiding this comment.
Reviewed the incremental changes in 142aa47 (since my last review at 9d66ad5). Both fixes look correct:
-
Timer leak -
time.Afterreplaced withtime.NewTimer+ deferredStop(), matching the existing pattern in the same file'sDeploy()progress goroutine. The<-deadlinechannel updated to<-deadlineTimer.C. -
Racy mock -
CancelledContexttest now seeds poll indices 0-4 so the ticker can't return nil when it wins the select race againstctx.Done(). Comment added explaining the intent.
No novel findings after reviewing 6 changed files (435+ lines) and deduplicating against existing feedback.
…lures match main) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jongio
left a comment
There was a problem hiding this comment.
Previous issues from 142aa47 all addressed. No code changes in the latest push (00ef2a8, CI retrigger only).
Re @vhvb1989's design concern about blocking on all active deployments - ARM's constraint is per-scope, not per-deployment-name. The docs explicitly say "Wait for concurrent deployment to this resource group to complete" for DeploymentActiveAndUneditable (https://learn.microsoft.com/azure/azure-resource-manager/troubleshooting/common-deployment-errors). Only one deployment can be active per resource group or subscription-level scope at a time, regardless of name. Concurrent deployments to different resource groups aren't affected since scopeForTemplate correctly narrows the check to the specific scope being targeted.
The CI scenario mentioned (parallel tests deploying different templates to the same subscription) likely works because each test targets a different resource group - those have independent deployment locks.
We can deploy the same template at the same time from Windows, Mac and Ubuntu to the same subscription. The template uses subscriptio-level deployments. Yes, we use a different env-name to use a unique deployment-name object and create different resource groups for each, but they we do the subscription-scope deployments at the same time. That's the concern I am referring to, @jongio --- with the change, a new subscription-level deployment will not continue if there is another deployment going on in the same subscription, right? - |
Use switch statement for TestPtr to keep nil-check and dereference in the same branch. Add nolint directive for TestMCPSecurityFluentBuilder where t.Fatal guarantees non-nil. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@vhvb1989 Good catch. Your concern is valid. The current implementation of Docs backing this up:
The issue: Suggested fix: Pass the current deployment name into the wait function and filter to only match that name: // In bicep_provider.go, pass the deployment name through
if activeScope, err := p.scopeForTemplate(planned.Template); err == nil {
if err := p.waitForActiveDeployments(ctx, activeScope, deployment.Name()); err != nil {
return nil, err
}
}
// Updated waitForActiveDeployments signature
func (p *BicepProvider) waitForActiveDeployments(
ctx context.Context,
scope infra.Scope,
currentDeploymentName string,
) error {
active, err := infra.ListActiveDeploymentsByName(ctx, scope, currentDeploymentName)
// ... rest unchanged
}
// In scope.go, name-scoped filter
func ListActiveDeploymentsByName(
ctx context.Context,
scope Scope,
deploymentName string,
) ([]*azapi.ResourceDeployment, error) {
all, err := scope.ListDeployments(ctx)
if err != nil {
return nil, err
}
var active []*azapi.ResourceDeployment
for _, d := range all {
if d.Name == deploymentName && azapi.IsActiveDeploymentState(d.ProvisioningState) {
active = append(active, d)
}
}
return active, nil
}This preserves the pre-check for same-name conflicts (the actual For stack deployments ( |
Replace manual nil-check + t.Fatal with require.NotNil which makes staticcheck aware the pointer is non-nil for subsequent accesses. Convert all manual assertions to require.True/require.Len. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Filter active deployment detection by the current deployment name so parallel CI runs using different env-names (and therefore different ARM deployment names) don't block each other. ARM allows concurrent deployments with different names at the same scope. Added ListActiveDeploymentsByName to scope.go that filters by both name and active provisioning state. Updated waitForActiveDeployments to accept and pass through the deployment name. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Implemented in b504502. Changes:
This allows parallel CI runs with different env-names to proceed without blocking each other, while still catching same-name |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… to PR changes) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
I think you need to generate new recordings for some tests to pass CI |
Summary
Fixes #7248
Before starting a deployment, azd now checks for active deployments on the target scope. If another deployment is in progress, it warns the user and waits for it to complete — avoiding the
DeploymentActiveARM error that wastes ~5 minutes of the user's time.Telemetry Context
DeploymentActivefailures in March (~270/month projected)provision, 19% fromupChanges
Pre-deployment active check (
bicep_provider.go)Added
waitForActiveDeployments()between preflight validation and deployment submission:ErrDeploymentsNotFound(scope doesn't exist yet); other errors propagateActive state classification (
deployments.go)IsActiveDeploymentState()classifies 11 provisioning states as active, including transitional states (Canceling,Deleting,DeletingResources,UpdatingDenyAssignments) that can still block new deployments.Scope interface (
scope.go)Added
ListActiveDeployments()to bothResourceGroupScopeandSubscriptionScope.Error suggestion (
error_suggestions.yaml)Added
DeploymentActiverule with user-friendly message and ARM troubleshooting link.Test Coverage (8 tests, 24 subtests)
TestIsActiveDeploymentStateTestWaitForActiveDeployments_NoActiveTestWaitForActiveDeployments_InitialListError_NotFoundTestWaitForActiveDeployments_InitialListError_OtherTestWaitForActiveDeployments_ActiveThenClearTestWaitForActiveDeployments_CancelledContextTestWaitForActiveDeployments_PollErrorTestWaitForActiveDeployments_TimeoutRelated