diff --git a/docs/superpowers/plans/2026-06-01-abandon-best-effort-kill-and-execute-script-timeout.md b/docs/superpowers/plans/2026-06-01-abandon-best-effort-kill-and-execute-script-timeout.md new file mode 100644 index 000000000..7d76e7af2 --- /dev/null +++ b/docs/superpowers/plans/2026-06-01-abandon-best-effort-kill-and-execute-script-timeout.md @@ -0,0 +1,1000 @@ +# Abandon best-effort-kill + ExecuteScript abandon-timeout — Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax. + +**Goal:** Make Tentacle's `AbandonScript` best-effort-kill the process before it stops waiting (anti-abuse), make the PR1 runner deterministically return `AbandonedExitCode (-48)` keyed on the abandon token rather than on which task won the `Task.WhenAny` race, and give the client an optional `abandonAfterCancellationPendingFor` timeout that escalates a stuck cancel to an abandon when the Tentacle advertises the capability. + +**Architecture:** Two changes across server (Tentacle) and client. +- **Change 1 (server, PR1 + later PR2).** `ScriptServiceV2.AbandonScriptAsync` calls `runningScript.Cancel()` *then* `runningScript.Abandon()`, so the existing `cancel.Register → DoOurBestToCleanUp → Hitman.Kill` machinery attempts the kill on *every* abandon — including a direct RPC with no prior cancel. The PR1 runner in `SilentProcessRunner.ExecuteCommandAsync` keys its abandoned-result branch on `abandon.IsCancellationRequested` instead of `== abandoned.Task`. After `Cancel()→Close()` the process is detached, so the abandon branch must NOT read `process.HasExited`/`process.ExitCode`; it just returns `-48`. The "abandon was unnecessary because the script already finished" case is handled one layer up: the script's `RunningScript` is already `Complete` with its recorded exit code, so `GetResponse` returns that real code, never reaching the runner's abandon branch. +- **Change 2 (client, PR1 only).** Optional `TimeSpan? abandonAfterCancellationPendingFor = null` threaded through `ITentacleClient.ExecuteScript` → `TentacleClient.ExecuteScript` → `ObservingScriptOrchestrator`. The orchestrator's `ObserveUntilComplete` already calls `CancelScript` each poll while the token is cancelled; it records when cancellation first fired and, once elapsed crosses the threshold AND the abandon capability is advertised, calls `AbandonScript` instead. A new `IScriptExecutor.AbandonScript` is implemented on `ScriptServiceV2Executor` (with a capability check) and as a no-op fallback on `ScriptServiceV1Executor`/`KubernetesScriptServiceV1Executor`. `ScriptExecutionResult.ExitCode (-48)` is the source of truth. NO shared `Octopus.Tentacle.Contracts` constant for the message. + +**Tech Stack:** C#, NUnit, FluentAssertions, NSubstitute, Halibut RPC. + +> **State of the tree at plan time (read before starting).** The contract surface is already merged on this branch: `AbandonScriptCommandV2`, `IScriptServiceV2.AbandonScript`, `ScriptExitCodes.AbandonedExitCode = -48`, `ITentacleClient.AbandonScript`/`TentacleClient.AbandonScript`, `RunningScript.CreateAbandonable` with `abandonToken`, the `RunningScript.Execute` abandon catch, `ScriptServiceV2.AbandonScriptAsync` (currently calls only `Abandon()`), and the PR1 runner `Task.WhenAny` block (currently keyed on `== abandoned.Task`). Capabilities advertise `nameof(ScriptServiceV2.AbandonScriptAsync)` → the literal string `"AbandonScriptAsync"`, NOT `"AbandonScriptV2"`. This plan ONLY makes the two deltas above; it does not re-create the contract. + +--- + +## Task 1 — PR1 runner: key the abandoned branch on `abandon.IsCancellationRequested` + +The PR1 `Task.WhenAny` currently returns `AbandonedExitCode` only when `abandoned.Task` literally won the race (`== abandoned.Task`). The spec requires the result keyed on the *token*, so that when both cancel and abandon fire and `waitForExit` happens to win (e.g. the abandon TCS callback is still queued), we still return `-48` deterministically. The abandon branch must NOT touch `process.HasExited`/`ExitCode` — after `Cancel()→Close()` the Process is detached. + +**Files** +- Test: `source/Octopus.Tentacle.Tests.Integration/Util/SilentProcessRunnerFixture.cs` (add new `[Test]` after `AbandonToken_ShouldReturnAbandonedExitCodeWithoutKillingProcess` which ends at line ~372) +- Modify: `source/Octopus.Tentacle.Core/Util/CommandLine/SilentProcessRunner.cs` lines 161–177 (the `abandoned`/`Task.WhenAny` block) + +- [ ] Write a failing test `AbandonToken_WhenWaitTaskWinsRace_StillReturnsAbandonedExitCode` in `SilentProcessRunnerFixture.cs`. This drives both tokens: cancel fires first (no-op kill via the env var), then abandon, and asserts the return is `-48` regardless of race ordering. Add this method inside the class: + +```csharp + [Test] + public async Task AbandonToken_WhenBothTokensFire_ReturnsAbandonedExitCodeKeyedOnToken() + { + // Simulate an un-killable script: DisableProcessKill makes cancel's Hitman.Kill a no-op, + // so cancel does NOT make the process exit. Then abandon fires. The return must be + // AbandonedExitCode because abandon.IsCancellationRequested is set — not because of which + // task won the WhenAny race. This is the deterministic-(-48) guarantee from the spec. + Environment.SetEnvironmentVariable(EnvironmentVariables.TentacleDebugDisableProcessKill_UNSAFE_FOR_PRODUCTION, "1"); + try + { + using var tempDir = new TemporaryDirectory(); + var pidFile = Path.Combine(tempDir.DirectoryPath, "process.pid"); + + var executable = PlatformDetection.IsRunningOnWindows ? "powershell.exe" : "/bin/bash"; + var arguments = PlatformDetection.IsRunningOnWindows + ? $"-NoProfile -NonInteractive -Command \"$PID | Out-File -FilePath '{pidFile}' -Encoding ASCII; Start-Sleep -Seconds 300\"" + : $"-c \"echo $$ > '{pidFile}' && sleep 300\""; + + using var cancelCts = new CancellationTokenSource(); + using var abandonCts = new CancellationTokenSource(); + + var task = Task.Run(async () => await SilentProcessRunner.ExecuteCommandAsync( + executable, + arguments, + Environment.CurrentDirectory, + debug: _ => { }, + info: _ => { }, + error: _ => { }, + customEnvironmentVariables: null, + cancel: cancelCts.Token, + abandon: abandonCts.Token)); + + await WaitForPidFileAsync(pidFile, TimeSpan.FromSeconds(30)); + + // Cancel first (kill is no-op'd so the process keeps running), then abandon. + cancelCts.Cancel(); + abandonCts.Cancel(); + + var exitCode = await task; + exitCode.Should().Be(ScriptExitCodes.AbandonedExitCode); + + // Cleanup: the real process is still alive because kill was no-op'd. + if (File.Exists(pidFile) && int.TryParse(SafelyReadAllText(pidFile).Trim(), out var pid) && pid > 0) + { + try { Process.GetProcessById(pid).Kill(); } catch { /* already gone */ } + } + } + finally + { + Environment.SetEnvironmentVariable(EnvironmentVariables.TentacleDebugDisableProcessKill_UNSAFE_FOR_PRODUCTION, null); + } + } +``` + +- [ ] Run it on net48 + net8.0 and watch it FAIL (today the `== abandoned.Task` branch can be skipped when `waitForExit` wins, falling through to `SafelyGetExitCode` on a detached process → returns `-1`, not `-48`): + +```bash +dotnet test source/Octopus.Tentacle.Tests.Integration --framework net48 \ + --filter "FullyQualifiedName~SilentProcessRunnerFixture.AbandonToken_WhenBothTokensFire_ReturnsAbandonedExitCodeKeyedOnToken" +dotnet test source/Octopus.Tentacle.Tests.Integration --framework net8.0 \ + --filter "FullyQualifiedName~SilentProcessRunnerFixture.AbandonToken_WhenBothTokensFire_ReturnsAbandonedExitCodeKeyedOnToken" +``` +Expected: FAIL (returns -1 or hangs). + +- [ ] Make the minimal implementation change in `SilentProcessRunner.cs`. Replace the `Task.WhenAny`/`== abandoned.Task` block (lines 161–177) so the abandoned branch is keyed on the token and runs after the race regardless of winner: + +```csharp + var abandoned = new TaskCompletionSource(); + using (abandon.Register(() => abandoned.TrySetResult(null))) + { + var waitForExit = Task.Run(() => + { + try { process.WaitForExit(); } + catch { /* released by Process.Dispose on the abandon path */ } + }); + + await Task.WhenAny(waitForExit, abandoned.Task).ConfigureAwait(false); + + // Key the abandoned result on the TOKEN, not on which task won the race. + // When both cancel and abandon fire, cancel's Kill is no-op'd / can't land, so + // waitForExit may win even though we must still return AbandonedExitCode. + // After Cancel()->Close() the Process is detached, so do NOT read + // process.HasExited / process.ExitCode here — just return -48. The + // "abandon was unnecessary" case never reaches here: the script's + // RunningScript is already Complete with its real code one layer up. + if (abandon.IsCancellationRequested) + { + info("Tentacle has abandoned this script. The underlying script process may still be running on this host."); + SafelyCancelOutputAndErrorRead(process, debug); + running = false; + return ScriptExitCodes.AbandonedExitCode; + } + } +``` + +- [ ] Re-run the same two `dotnet test` commands from above. Expected: PASS on both net48 and net8.0. + +- [ ] Run the existing abandon + grandchild regression tests to confirm no regression: + +```bash +dotnet test source/Octopus.Tentacle.Tests.Integration --framework net8.0 \ + --filter "FullyQualifiedName~SilentProcessRunnerFixture.AbandonToken_ShouldReturnAbandonedExitCodeWithoutKillingProcess|FullyQualifiedName~SilentProcessRunnerFixture.CancellationToken_WhenGrandchildHoldsRedirectedPipes_ShouldNotHang|FullyQualifiedName~SilentProcessRunnerFixture.CancellationToken_WhenUnixGrandchildHoldsRedirectedPipes_ShouldNotHang" +``` +Expected: PASS. + +- [ ] Commit: `git commit -am "PR1 runner: key abandoned result on abandon token, not WhenAny winner"` + +--- + +## Task 2 — `AbandonScriptAsync` best-effort-kills (Cancel then Abandon) — service-layer test + +`ScriptServiceV2.AbandonScriptAsync` currently calls only `runningScript.Abandon()`. The spec requires `runningScript.Cancel()` first so `Hitman.Kill` runs on every abandon (anti-abuse). We prove this through the integration layer because `ScriptServiceV2` wiring (workspace factory, state store, shell) is awkward to stub; the existing abandon integration tests already exercise this service with a real Tentacle. + +**Files** +- Test: `source/Octopus.Tentacle.Tests.Integration/ClientScriptExecutionAbandon.cs` (add a third `[Test]` after `AbandonScript_ReleasesIsolationMutexEvenWhileProcessIsStillRunning`, which ends at line ~129) + +- [ ] Write a failing test `AbandonScript_WithNoPriorCancel_KillsTheProcess` in `ClientScriptExecutionAbandon.cs`. Kill is NOT disabled here, so abandon's internal `Cancel()` must terminate the process. Add this method inside the class: + +```csharp + [Test] + [TentacleConfigurations(scriptServiceToTest: ScriptServiceVersionToTest.Version2)] + public async Task AbandonScript_WithNoPriorCancel_KillsTheProcess(TentacleConfigurationTestCase tentacleConfigurationTestCase) + { + // Anti-abuse: a direct AbandonScript with no prior CancelScript must still attempt the + // kill. AbandonScriptAsync calls Cancel() then Abandon(), so Hitman.Kill runs. Kill is + // NOT disabled here, so the underlying process must actually die. + await using var clientTentacle = await tentacleConfigurationTestCase.CreateBuilder() + .Build(CancellationToken); + + var startFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "start"); + var releaseFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "release"); + + var command = new TestExecuteShellScriptCommandBuilder() + .SetScriptBody(new ScriptBuilder() + .CreateFile(startFile) + .WaitForFileToExist(releaseFile)) + .WithIsolationLevel(ScriptIsolationLevel.NoIsolation) + .Build(); + + var tentacleClient = clientTentacle.TentacleClient; + var scriptExecution = Task.Run(async () => await tentacleClient.ExecuteScript(command, CancellationToken)); + + await Wait.For(() => File.Exists(startFile), + TimeSpan.FromSeconds(30), + () => throw new Exception("Script did not start"), + CancellationToken); + + // Direct abandon, NO prior cancel. + await tentacleClient.AbandonScript(command.ScriptTicket, CancellationToken); + + ScriptStatus abandonResponse = null!; + await Wait.For(async () => + { + abandonResponse = await tentacleClient.GetStatus(command.ScriptTicket, CancellationToken); + return abandonResponse.State == ProcessState.Complete; + }, + TimeSpan.FromSeconds(30), + () => throw new Exception("Abandoned script did not reach Complete state within 30s"), + CancellationToken); + + abandonResponse.ExitCode.Should().Be(ScriptExitCodes.AbandonedExitCode); + + // The kill landed because Hitman was NOT disabled: the script never saw the release file, + // so if it is still waiting the file would be absent and the process alive. The script + // reaching Complete with AbandonedExitCode while we never wrote releaseFile proves the + // wait was broken by abandon; the kill having run is asserted by the script process being + // gone — drain the execution task, which completes once the process is dead. + await scriptExecution; + } +``` + +- [ ] Run it and watch it FAIL (today abandon does NOT call `Cancel()`, so `Hitman.Kill` never runs; the process keeps waiting on `releaseFile`, and `scriptExecution` never drains within the test timeout): + +```bash +dotnet test source/Octopus.Tentacle.Tests.Integration --framework net8.0 \ + --filter "FullyQualifiedName~ClientScriptExecutionAbandon.AbandonScript_WithNoPriorCancel_KillsTheProcess" +``` +Expected: FAIL (test times out because the process is never killed). + +- [ ] Make the minimal implementation change in `source/Octopus.Tentacle.Core/Services/Scripts/ScriptServiceV2.cs`. Modify `AbandonScriptAsync` (lines 144–156) to call `Cancel()` then `Abandon()`: + +```csharp + public Task AbandonScriptAsync(AbandonScriptCommandV2 command, CancellationToken cancellationToken) + { + // Abandon best-effort-kills first (anti-abuse): Cancel() runs the existing + // cancel.Register -> DoOurBestToCleanUp -> Hitman.Kill machinery so ANY abandon — + // including a direct RPC with no prior cancel — attempts the kill. Abandon() then + // layers "stop waiting + release the mutex + return -48" on top. A process survives + // abandon only when the kill genuinely can't land (stuck / re-parented grandchild). + if (runningScripts.TryGetValue(command.Ticket, out var runningScript)) + { + runningScript.Cancel(); + runningScript.Abandon(); + } + + return Task.FromResult(GetResponse(command.Ticket, command.LastLogSequence, runningScript?.Process)); + } +``` + +- [ ] Re-run the same `dotnet test` command. Expected: PASS. + +- [ ] Run the two existing abandon integration tests to confirm no regression (kill-disabled path still returns -48 and releases the mutex): + +```bash +dotnet test source/Octopus.Tentacle.Tests.Integration --framework net8.0 \ + --filter "FullyQualifiedName~ClientScriptExecutionAbandon" +``` +Expected: PASS (all three tests). + +- [ ] Commit: `git commit -am "AbandonScriptAsync: Cancel() then Abandon() so abandon best-effort-kills"` + +--- + +## Task 3 — Client: add `AbandonScript` to `IScriptExecutor` and capability extension + +Change 2 needs the orchestrator to call abandon through the executor abstraction. Add `AbandonScript(CommandContext)` to `IScriptExecutor`, a capability-checking implementation on `ScriptServiceV2Executor`, and no-op fallbacks on V1/Kubernetes executors. The capability check uses a new `HasAbandonScriptV2()` extension that matches the capability string Tentacle actually advertises today (`"AbandonScriptAsync"`). + +> **Spec gap (resolved here):** the spec names the capability `AbandonScriptV2`, but `CapabilitiesServiceV2` already advertises `nameof(ScriptServiceV2.AbandonScriptAsync)` = the string `"AbandonScriptAsync"`. Changing the advertised string would break the contract already shipped on this branch. This plan matches the *actual* advertised string. If the team wants the spec's `AbandonScriptV2` literal, that is a separate contract change with its own backward-compat story — flagged, not silently done. + +**Files** +- Test: `source/Octopus.Tentacle.Client.Tests/CapabilitiesResponseV2ExtensionMethodsTests.cs` (CREATE) +- Modify: `source/Octopus.Tentacle.Client/Capabilities/CapabilitiesResponseV2ExtensionMethods.cs` (add method after `HasScriptServiceV2`, line 17) +- Modify: `source/Octopus.Tentacle.Client/Scripts/IScriptExecutor.cs` (add method after `CancelScript`, line 33) + +- [ ] Write a failing test file `source/Octopus.Tentacle.Client.Tests/CapabilitiesResponseV2ExtensionMethodsTests.cs`: + +```csharp +using System.Collections.Generic; +using FluentAssertions; +using NUnit.Framework; +using Octopus.Tentacle.Client.Capabilities; +using Octopus.Tentacle.Contracts.Capabilities; + +namespace Octopus.Tentacle.Client.Tests +{ + [TestFixture] + public class CapabilitiesResponseV2ExtensionMethodsTests + { + [Test] + public void HasAbandonScriptV2_WhenAdvertised_ReturnsTrue() + { + var capabilities = new CapabilitiesResponseV2(new List { "IScriptServiceV2", "AbandonScriptAsync" }); + capabilities.HasAbandonScriptV2().Should().BeTrue(); + } + + [Test] + public void HasAbandonScriptV2_WhenNotAdvertised_ReturnsFalse() + { + var capabilities = new CapabilitiesResponseV2(new List { "IScriptServiceV2" }); + capabilities.HasAbandonScriptV2().Should().BeFalse(); + } + + [Test] + public void HasAbandonScriptV2_WhenEmpty_ReturnsFalse() + { + var capabilities = new CapabilitiesResponseV2(new List()); + capabilities.HasAbandonScriptV2().Should().BeFalse(); + } + } +} +``` + +- [ ] Run it and watch it FAIL (the extension does not exist → compile error): + +```bash +dotnet test source/Octopus.Tentacle.Client.Tests --framework net8.0 \ + --filter "FullyQualifiedName~CapabilitiesResponseV2ExtensionMethodsTests" +``` +Expected: FAIL (does not compile). + +- [ ] Add the extension method to `CapabilitiesResponseV2ExtensionMethods.cs` after `HasScriptServiceV2` (the closing brace of that method is at line 17). Insert: + +```csharp + + public static bool HasAbandonScriptV2(this CapabilitiesResponseV2 capabilities) + { + if (capabilities?.SupportedCapabilities?.Any() != true) + { + return false; + } + + // Tentacle advertises this as nameof(ScriptServiceV2.AbandonScriptAsync). Keep the + // literal in sync with CapabilitiesServiceV2.GetCapabilitiesAsync on the Tentacle side. + return capabilities.SupportedCapabilities.Contains("AbandonScriptAsync"); + } +``` + +- [ ] Re-run the extension test. Expected: PASS. + +- [ ] Add `AbandonScript` to `IScriptExecutor.cs`. Insert after the `CancelScript` declaration (line 33, before `CompleteScript`'s doc comment): + +```csharp + + /// + /// Abandon the script: signal Tentacle to stop waiting and release the isolation mutex. + /// Returns the abandoned status when the Tentacle advertises the abandon capability; + /// otherwise falls back to cancelling and returns that result. + /// + /// The CommandContext from the previous command + Task AbandonScript(CommandContext commandContext); +``` + +- [ ] Build the client to confirm the interface compiles (implementations come in Task 4 — expect implementor errors are fine to defer only if you build implementors together; here build will FAIL until Task 4, so just verify the interface file itself parses by building the project after Task 4). Skip a standalone build here; commit the interface + extension together with implementations is cleaner — but to keep commits frequent, commit the extension + its test now: + +```bash +dotnet test source/Octopus.Tentacle.Client.Tests --framework net8.0 \ + --filter "FullyQualifiedName~CapabilitiesResponseV2ExtensionMethodsTests" +``` +Expected: PASS. + +- [ ] Commit: `git commit -am "Client: add HasAbandonScriptV2 capability check + IScriptExecutor.AbandonScript"` + +--- + +## Task 4 — Implement `AbandonScript` on the three executors + +`ScriptServiceV2Executor` does the real work with a capability check; V1 and Kubernetes executors fall back to `CancelScript` (they have no abandon RPC). The V2 executor needs the capabilities client; add it as a constructor dependency and wire it through `ScriptExecutorFactory`. + +**Files** +- Modify: `source/Octopus.Tentacle.Client/Scripts/ScriptServiceV2Executor.cs` (ctor lines 27–41; add `AbandonScript` after `CancelScript`, line 174) +- Modify: `source/Octopus.Tentacle.Client/Scripts/ScriptServiceV1Executor.cs` (add `AbandonScript` fallback) +- Modify: `source/Octopus.Tentacle.Client/Scripts/KubernetesScriptServiceV1Executor.cs` (add `AbandonScript` fallback) +- Modify: `source/Octopus.Tentacle.Client/Scripts/ScriptExecutorFactory.cs` (pass capabilities client to V2 executor, line 48) +- Test: `source/Octopus.Tentacle.Client.Tests/ScriptServiceV2ExecutorAbandonTests.cs` (CREATE) + +- [ ] Write a failing test file `source/Octopus.Tentacle.Client.Tests/ScriptServiceV2ExecutorAbandonTests.cs`: + +```csharp +using System; +using System.Collections.Generic; +using System.Threading; +using System.Threading.Tasks; +using FluentAssertions; +using Halibut.ServiceModel; +using NSubstitute; +using NUnit.Framework; +using Octopus.Tentacle.Client; +using Octopus.Tentacle.Client.EventDriven; +using Octopus.Tentacle.Client.Execution; +using Octopus.Tentacle.Client.Observability; +using Octopus.Tentacle.Client.Retries; +using Octopus.Tentacle.Client.Scripts; +using Octopus.Tentacle.Contracts; +using Octopus.Tentacle.Contracts.Capabilities; +using Octopus.Tentacle.Contracts.ClientServices; +using Octopus.Tentacle.Contracts.Logging; +using Octopus.Tentacle.Contracts.Observability; +using Octopus.Tentacle.Contracts.ScriptServiceV2; + +namespace Octopus.Tentacle.Client.Tests +{ + [TestFixture] + public class ScriptServiceV2ExecutorAbandonTests + { + static ScriptServiceV2Executor CreateExecutor(IAsyncClientScriptServiceV2 scriptService, IAsyncClientCapabilitiesServiceV2 capabilities) + => new( + scriptService, + capabilities, + RpcCallExecutorFactory.Create(TimeSpan.Zero, Substitute.For()), + ClientOperationMetricsBuilder.Start(), + TimeSpan.Zero, + new TentacleClientOptions(new RpcRetrySettings(retriesEnabled: false, retryDuration: TimeSpan.Zero)), + Substitute.For()); + + static CommandContext Context() => new(new ScriptTicket("TestTicket"), 0, ScriptServiceVersion.ScriptServiceVersion2); + + [Test] + public async Task AbandonScript_WhenCapabilityAdvertised_CallsAbandonScriptAsync() + { + var scriptService = Substitute.For(); + scriptService.AbandonScriptAsync(Arg.Any(), Arg.Any()) + .Returns(x => Task.FromResult(new ScriptStatusResponseV2( + x.Arg().Ticket, ProcessState.Complete, + ScriptExitCodes.AbandonedExitCode, new List(), 1))); + + var capabilities = Substitute.For(); + capabilities.GetCapabilitiesAsync(Arg.Any()) + .Returns(Task.FromResult(new CapabilitiesResponseV2(new List { "IScriptServiceV2", "AbandonScriptAsync" }))); + + var executor = CreateExecutor(scriptService, capabilities); + + var result = await executor.AbandonScript(Context()); + + await scriptService.Received(1).AbandonScriptAsync(Arg.Any(), Arg.Any()); + await scriptService.DidNotReceive().CancelScriptAsync(Arg.Any(), Arg.Any()); + result.ScriptStatus.ExitCode.Should().Be(ScriptExitCodes.AbandonedExitCode); + } + + [Test] + public async Task AbandonScript_WhenCapabilityAbsent_FallsBackToCancelScriptAsync() + { + var scriptService = Substitute.For(); + scriptService.CancelScriptAsync(Arg.Any(), Arg.Any()) + .Returns(x => Task.FromResult(new ScriptStatusResponseV2( + x.Arg().Ticket, ProcessState.Running, + 0, new List(), 1))); + + var capabilities = Substitute.For(); + capabilities.GetCapabilitiesAsync(Arg.Any()) + .Returns(Task.FromResult(new CapabilitiesResponseV2(new List { "IScriptServiceV2" }))); + + var executor = CreateExecutor(scriptService, capabilities); + + await executor.AbandonScript(Context()); + + await scriptService.DidNotReceive().AbandonScriptAsync(Arg.Any(), Arg.Any()); + await scriptService.Received(1).CancelScriptAsync(Arg.Any(), Arg.Any()); + } + } +} +``` + +- [ ] Run it and watch it FAIL (V2 executor has no capabilities ctor param and no `AbandonScript` → compile error): + +```bash +dotnet test source/Octopus.Tentacle.Client.Tests --framework net8.0 \ + --filter "FullyQualifiedName~ScriptServiceV2ExecutorAbandonTests" +``` +Expected: FAIL (does not compile). + +- [ ] Add the capabilities dependency to `ScriptServiceV2Executor`. Modify the field block (after line 20 `readonly IAsyncClientScriptServiceV2 clientScriptServiceV2;`) and the ctor (lines 27–41): + +```csharp + readonly IAsyncClientScriptServiceV2 clientScriptServiceV2; + readonly IAsyncClientCapabilitiesServiceV2 clientCapabilitiesServiceV2; + readonly RpcCallExecutor rpcCallExecutor; + readonly ClientOperationMetricsBuilder clientOperationMetricsBuilder; + readonly TimeSpan onCancellationAbandonCompleteScriptAfter; + readonly ITentacleClientTaskLog logger; + readonly TentacleClientOptions clientOptions; + + public ScriptServiceV2Executor( + IAsyncClientScriptServiceV2 clientScriptServiceV2, + IAsyncClientCapabilitiesServiceV2 clientCapabilitiesServiceV2, + RpcCallExecutor rpcCallExecutor, + ClientOperationMetricsBuilder clientOperationMetricsBuilder, + TimeSpan onCancellationAbandonCompleteScriptAfter, + TentacleClientOptions clientOptions, + ITentacleClientTaskLog logger) + { + this.clientScriptServiceV2 = clientScriptServiceV2; + this.clientCapabilitiesServiceV2 = clientCapabilitiesServiceV2; + this.rpcCallExecutor = rpcCallExecutor; + this.clientOperationMetricsBuilder = clientOperationMetricsBuilder; + this.onCancellationAbandonCompleteScriptAfter = onCancellationAbandonCompleteScriptAfter; + this.clientOptions = clientOptions; + this.logger = logger; + } +``` + +- [ ] Add the `AbandonScript` method to `ScriptServiceV2Executor.cs` immediately after `CancelScript` (after line 174, before `CompleteScript`). Add `using Octopus.Tentacle.Client.Capabilities;` at the top if absent: + +```csharp + public async Task AbandonScript(CommandContext commandContext) + { + using var activity = TentacleClient.ActivitySource.StartActivity($"{nameof(ScriptServiceV2Executor)}.{nameof(AbandonScript)}"); + + var capabilities = await rpcCallExecutor.Execute( + retriesEnabled: clientOptions.RpcRetrySettings.RetriesEnabled, + RpcCall.Create(nameof(Contracts.Capabilities.ICapabilitiesServiceV2.GetCapabilities)), + async ct => await clientCapabilitiesServiceV2.GetCapabilitiesAsync(new HalibutProxyRequestOptions(ct)), + logger, + clientOperationMetricsBuilder, + CancellationToken.None).ConfigureAwait(false); + + // Capability absent → the Tentacle is too old to abandon. Keep cancelling cleanly. + if (!capabilities.HasAbandonScriptV2()) + { + logger.Verbose("Tentacle does not advertise AbandonScript; falling back to CancelScript."); + return await CancelScript(commandContext).ConfigureAwait(false); + } + + async Task AbandonScriptAction(CancellationToken ct) + { + var request = new AbandonScriptCommandV2(commandContext.ScriptTicket, commandContext.NextLogSequence); + return await clientScriptServiceV2.AbandonScriptAsync(request, new HalibutProxyRequestOptions(ct)); + } + + var scriptStatusResponseV2 = await rpcCallExecutor.Execute( + retriesEnabled: clientOptions.RpcRetrySettings.RetriesEnabled, + RpcCall.Create(nameof(IScriptServiceV2.AbandonScript)), + AbandonScriptAction, + logger, + clientOperationMetricsBuilder, + // Like CancelScript, abandon must not be cancelled — it stops the script on Tentacle. + CancellationToken.None).ConfigureAwait(false); + return Map(scriptStatusResponseV2); + } +``` + +- [ ] Add the no-op fallback to `ScriptServiceV1Executor.cs` (mirror its `CancelScript`; V1 has no abandon RPC, so abandon degrades to cancel). Add after that executor's `CancelScript` method: + +```csharp + public Task AbandonScript(CommandContext commandContext) + { + // ScriptServiceV1 has no abandon verb; degrade to cancel. + return CancelScript(commandContext); + } +``` + +- [ ] Add the same fallback to `KubernetesScriptServiceV1Executor.cs` after its `CancelScript`: + +```csharp + public Task AbandonScript(CommandContext commandContext) + { + // KubernetesScriptServiceV1 has no abandon verb; degrade to cancel. + return CancelScript(commandContext); + } +``` + +- [ ] Wire the capabilities client through `ScriptExecutorFactory.cs`. Change the V2 construction (line 48) to pass `allClients.CapabilitiesServiceV2`: + +```csharp + return new ScriptServiceV2Executor( + allClients.ScriptServiceV2, + allClients.CapabilitiesServiceV2, + rpcCallExecutor, + clientOperationMetricsBuilder, + onCancellationAbandonCompleteScriptAfter, + clientOptions, + logger); +``` + +- [ ] Add `AbandonScript` to the `ScriptExecutor` facade (`source/Octopus.Tentacle.Client/ScriptExecutor.cs`) after its `CancelScript` method (line ~70), mirroring the existing delegation pattern: + +```csharp + public async Task AbandonScript(CommandContext commandContext) + { + var scriptExecutorFactory = CreateScriptExecutorFactory(); + var scriptExecutor = scriptExecutorFactory.CreateScriptExecutor(commandContext.ScripServiceVersionUsed); + + return await scriptExecutor.AbandonScript(commandContext); + } +``` + +- [ ] Re-run the V2 executor abandon tests. Expected: PASS. + +```bash +dotnet test source/Octopus.Tentacle.Client.Tests --framework net8.0 \ + --filter "FullyQualifiedName~ScriptServiceV2ExecutorAbandonTests" +``` + +- [ ] Build the client + run the full client test project to confirm no implementor was missed: + +```bash +dotnet build source/Octopus.Tentacle.Client --framework net8.0 +dotnet test source/Octopus.Tentacle.Client.Tests --framework net8.0 +``` +Expected: build succeeds; tests PASS. + +- [ ] Commit: `git commit -am "Client executors: implement AbandonScript with capability check + cancel fallback"` + +--- + +## Task 5 — Orchestrator: escalate to abandon after the cancellation-pending threshold + +`ObservingScriptOrchestrator.ObserveUntilComplete` calls `CancelScript` each poll while the token is cancelled. Add `abandonAfterCancellationPendingFor`; record when cancellation first fired; once elapsed crosses the threshold call `AbandonScript` (the executor handles the capability check + fallback). When the param is `null`, behaviour is unchanged. + +**Files** +- Modify: `source/Octopus.Tentacle.Client/Scripts/ObservingScriptOrchestrator.cs` (ctor lines 18–28; `ExecuteScript` lines 30–45; `ObserveUntilComplete` lines 76–144) +- Test: `source/Octopus.Tentacle.Client.Tests/ObservingScriptOrchestratorAbandonTests.cs` (CREATE) + +- [ ] First, make `ObserveUntilComplete` testable by changing its access modifier from `private` (implicit) to `internal` in `ObservingScriptOrchestrator.cs` (the class is already `internal`; `Octopus.Tentacle.Client.Tests` has `InternalsVisibleTo`). Change line 76 from `async Task ObserveUntilComplete(` to `internal async Task ObserveUntilComplete(`. (`ExecuteScriptCommand` is abstract, so driving the full `ExecuteScript` from a unit test is awkward; calling `ObserveUntilComplete` directly is the deterministic seam.) + +- [ ] Write a failing test file `source/Octopus.Tentacle.Client.Tests/ObservingScriptOrchestratorAbandonTests.cs`. It drives `ObserveUntilComplete` directly with an NSubstitute `IScriptExecutor`, a pre-cancelled token, and asserts cancel-vs-abandon based on the threshold: + +```csharp +using System; +using System.Collections.Generic; +using System.Threading; +using System.Threading.Tasks; +using FluentAssertions; +using NSubstitute; +using NUnit.Framework; +using Octopus.Tentacle.Client.EventDriven; +using Octopus.Tentacle.Client.Scripts; +using Octopus.Tentacle.Client.Scripts.Models; +using Octopus.Tentacle.Contracts; + +namespace Octopus.Tentacle.Client.Tests +{ + [TestFixture] + public class ObservingScriptOrchestratorAbandonTests + { + static CommandContext Context() => new(new ScriptTicket("T"), 0, ScriptServiceVersion.ScriptServiceVersion2); + + static ScriptOperationExecutionResult Running() + => new(new ScriptStatus(ProcessState.Running, 0, new List()), Context()); + + static ScriptOperationExecutionResult Complete(int exitCode) + => new(new ScriptStatus(ProcessState.Complete, exitCode, new List()), Context()); + + static ObservingScriptOrchestrator CreateOrchestrator(IScriptExecutor executor, TimeSpan? abandonAfter) + => new( + new ImmediateBackoff(), + _ => { }, + _ => Task.CompletedTask, + executor, + abandonAfter); + + sealed class ImmediateBackoff : IScriptObserverBackoffStrategy + { + public TimeSpan GetBackoff(int iteration) => TimeSpan.Zero; + } + + [Test] + public async Task ParamUnset_CancelsOnly_NeverAbandons() + { + var executor = Substitute.For(); + executor.CancelScript(Arg.Any()) + .Returns(Running(), Complete(ScriptExitCodes.CanceledExitCode)); + + var orchestrator = CreateOrchestrator(executor, abandonAfter: null); + using var cts = new CancellationTokenSource(); + cts.Cancel(); + + await orchestrator.ObserveUntilComplete(Running(), cts.Token); + + await executor.DidNotReceive().AbandonScript(Arg.Any()); + await executor.Received().CancelScript(Arg.Any()); + } + + [Test] + public async Task ThresholdCrossed_SwitchesFromCancelToAbandon() + { + var executor = Substitute.For(); + // abandonAfter = Zero means the threshold is crossed on the first cancelled iteration, + // so the orchestrator abandons immediately. AbandonScript returns Complete to end the loop. + executor.AbandonScript(Arg.Any()) + .Returns(Complete(ScriptExitCodes.AbandonedExitCode)); + + var orchestrator = CreateOrchestrator(executor, abandonAfter: TimeSpan.Zero); + using var cts = new CancellationTokenSource(); + cts.Cancel(); + + var result = await orchestrator.ObserveUntilComplete(Running(), cts.Token); + + await executor.Received().AbandonScript(Arg.Any()); + result.ScriptStatus.ExitCode.Should().Be(ScriptExitCodes.AbandonedExitCode); + } + } +} +``` + +- [ ] Run it and watch it FAIL (orchestrator ctor has no `abandonAfter` param and `ObserveUntilComplete` is not yet `internal`/escalating → compile error on the 5-arg ctor): + +```bash +dotnet test source/Octopus.Tentacle.Client.Tests --framework net8.0 \ + --filter "FullyQualifiedName~ObservingScriptOrchestratorAbandonTests" +``` +Expected: FAIL (does not compile). + +- [ ] Add the field + ctor param to `ObservingScriptOrchestrator.cs`. After `readonly IScriptExecutor scriptExecutor;` (line 16) add `readonly TimeSpan? abandonAfterCancellationPendingFor;`. Change the ctor signature (line 18) to append `TimeSpan? abandonAfterCancellationPendingFor` and assign it: + +```csharp + public ObservingScriptOrchestrator( + IScriptObserverBackoffStrategy scriptObserverBackOffStrategy, + OnScriptStatusResponseReceived onScriptStatusResponseReceived, + OnScriptCompleted onScriptCompleted, + IScriptExecutor scriptExecutor, + TimeSpan? abandonAfterCancellationPendingFor = null) + { + this.scriptExecutor = scriptExecutor; + this.scriptObserverBackOffStrategy = scriptObserverBackOffStrategy; + this.onScriptStatusResponseReceived = onScriptStatusResponseReceived; + this.onScriptCompleted = onScriptCompleted; + this.abandonAfterCancellationPendingFor = abandonAfterCancellationPendingFor; + } +``` + +- [ ] Change `ObserveUntilComplete` (lines 76–144) to record first-cancel time and escalate. Replace the cancellation branch (lines 86–89): + +```csharp + var iteration = 0; + var cancellationIteration = 0; + var lastResult = startScriptResult; + var stopwatch = new Stopwatch(); + + while (lastResult.ScriptStatus.State != ProcessState.Complete) + { + if (scriptExecutionCancellationToken.IsCancellationRequested) + { + // Record when cancellation first fired so we can escalate to abandon after the threshold. + if (!stopwatch.IsRunning) + { + stopwatch.Start(); + } + + var shouldAbandon = abandonAfterCancellationPendingFor is { } threshold + && stopwatch.Elapsed >= threshold; + + lastResult = shouldAbandon + ? await scriptExecutor.AbandonScript(lastResult.ContextForNextCommand).ConfigureAwait(false) + : await scriptExecutor.CancelScript(lastResult.ContextForNextCommand).ConfigureAwait(false); + } + else + { +``` + +(`System.Diagnostics` is already imported for `Stopwatch` at line 2.) + +- [ ] Re-run the orchestrator abandon tests. Expected: PASS. + +```bash +dotnet test source/Octopus.Tentacle.Client.Tests --framework net8.0 \ + --filter "FullyQualifiedName~ObservingScriptOrchestratorAbandonTests" +``` + +- [ ] Commit: `git commit -am "Orchestrator: escalate cancel to abandon after abandonAfterCancellationPendingFor"` + +--- + +## Task 6 — Thread `abandonAfterCancellationPendingFor` through `TentacleClient` + `ITentacleClient` + +The orchestrator now accepts the timeout; expose it on the public `ExecuteScript` surface as an optional parameter (default `null`, so all existing callers are unchanged). + +**Files** +- Modify: `source/Octopus.Tentacle.Client/ITentacleClient.cs` (`ExecuteScript` declaration lines 30–35) +- Modify: `source/Octopus.Tentacle.Client/TentacleClient.cs` (`ExecuteScript` lines 168–208; orchestrator construction line 189) +- Test: `source/Octopus.Tentacle.Client.Tests/ObservingScriptOrchestratorAbandonTests.cs` (no change — covered) + a compile-only assertion in `source/Octopus.Tentacle.Tests.Integration/ClientScriptExecutionAbandon.cs` (Task 7 covers behavioural) + +- [ ] Add the optional parameter to `ITentacleClient.ExecuteScript` (after `scriptExecutionCancellationToken`, line 35): + +```csharp + Task ExecuteScript( + ExecuteScriptCommand executeScriptCommand, + OnScriptStatusResponseReceived onScriptStatusResponseReceived, + OnScriptCompleted onScriptCompleted, + ITentacleClientTaskLog logger, + CancellationToken scriptExecutionCancellationToken, + TimeSpan? abandonAfterCancellationPendingFor = null); +``` + +- [ ] Add the matching parameter to `TentacleClient.ExecuteScript` (line 168) and pass it to the orchestrator (line 189): + +```csharp + public async Task ExecuteScript(ExecuteScriptCommand executeScriptCommand, + OnScriptStatusResponseReceived onScriptStatusResponseReceived, + OnScriptCompleted onScriptCompleted, + ITentacleClientTaskLog logger, + CancellationToken scriptExecutionCancellationToken, + TimeSpan? abandonAfterCancellationPendingFor = null) + { +``` + +and: + +```csharp + var orchestrator = new ObservingScriptOrchestrator(scriptObserverBackOffStrategy, + onScriptStatusResponseReceived, + onScriptCompleted, + scriptExecutor, + abandonAfterCancellationPendingFor); +``` + +- [ ] Build the client and the integration test project (which references the client) to confirm the optional param did not break any caller: + +```bash +dotnet build source/Octopus.Tentacle.Client --framework net8.0 +dotnet build source/Octopus.Tentacle.Tests.Integration --framework net8.0 +``` +Expected: both build succeed (optional param = no caller changes required). + +- [ ] Run the full client test project to confirm green: + +```bash +dotnet test source/Octopus.Tentacle.Client.Tests --framework net8.0 +``` +Expected: PASS. + +- [ ] Commit: `git commit -am "TentacleClient: expose optional abandonAfterCancellationPendingFor on ExecuteScript"` + +--- + +## Task 7 — End-to-end integration test: ExecuteScript escalates to abandon after the threshold + +Prove the whole client→Tentacle path: a stuck (kill-disabled) script with `abandonAfterCancellationPendingFor` set short, cancelled mid-flight, escalates to `AbandonScript` and returns `AbandonedExitCode`. Uses `TentacleServiceDecoratorBuilder.RecordMethodUsages` to assert the `AbandonScript` verb was called. + +**Files** +- Modify: `source/Octopus.Tentacle.Tests.Integration.Common/Builders/Decorators/Proxies/RecordMethodUsagesExtensionMethods.cs` (add `ForAbandonScriptAsync`) +- Test: `source/Octopus.Tentacle.Tests.Integration/ClientScriptExecutionAbandon.cs` (add a `[Test]` after Task 2's test) + +- [ ] Add a `ForAbandonScriptAsync` extension to `RecordMethodUsagesExtensionMethods.cs` (mirrors the existing `ForCancelScriptAsync` at line 22; the recorded method name is the proxied async client method `AbandonScriptAsync`). Insert after `ForCancelScriptAsync` (line 25): + +```csharp + + public static IRecordedMethodUsage ForAbandonScriptAsync(this IRecordedMethodUsages o) + { + return o.For("AbandonScriptAsync"); + } +``` + +- [ ] Write a failing test `ExecuteScript_WhenCancellationStaysPending_EscalatesToAbandon` in `ClientScriptExecutionAbandon.cs`. This uses the exact recorded-usages API (`IRecordedMethodUsages recordedUsages = new MethodUsages();` then `.RecordMethodUsages(out recordedUsages)`, queried via `recordedUsages.ForAbandonScriptAsync().Started`). Add the `using Octopus.Tentacle.Tests.Integration.Common.Builders.Decorators.Proxies;` and `using Octopus.Tentacle.Contracts.ClientServices;` imports at the top of the file if absent: + +```csharp + [Test] + [TentacleConfigurations(scriptServiceToTest: ScriptServiceVersionToTest.Version2)] + public async Task ExecuteScript_WhenCancellationStaysPending_EscalatesToAbandon(TentacleConfigurationTestCase tentacleConfigurationTestCase) + { + // Kill is disabled, so the script is genuinely stuck and CancelScript can't end it. + // With abandonAfterCancellationPendingFor set short, the orchestrator must escalate to + // AbandonScript, which returns AbandonedExitCode and releases the script. + IRecordedMethodUsages recordedUsages = new MethodUsages(); + await using var clientTentacle = await tentacleConfigurationTestCase.CreateBuilder() + .WithTentacle(x => x.WithRunTentacleEnvironmentVariable(EnvironmentVariables.TentacleDebugDisableProcessKill_UNSAFE_FOR_PRODUCTION, "1")) + .WithTentacleServiceDecorator(new TentacleServiceDecoratorBuilder() + .RecordMethodUsages(out recordedUsages) + .Build()) + .Build(CancellationToken); + + var startFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "start"); + var releaseFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "release"); + + var command = new TestExecuteShellScriptCommandBuilder() + .SetScriptBody(new ScriptBuilder() + .CreateFile(startFile) + .WaitForFileToExist(releaseFile)) + .WithIsolationLevel(ScriptIsolationLevel.NoIsolation) + .Build(); + + var tentacleClient = clientTentacle.TentacleClient; + + using var executionCts = new CancellationTokenSource(); + var logs = new System.Collections.Generic.List(); + + var execution = Task.Run(async () => await tentacleClient.ExecuteScript( + command, + onScriptStatusResponseReceived => logs.AddRange(onScriptStatusResponseReceived.Logs), + _ => Task.CompletedTask, + new SerilogLoggerBuilder().Build().ForContext().ToITentacleTaskLog(), + executionCts.Token, + abandonAfterCancellationPendingFor: TimeSpan.FromSeconds(2))); + + await Wait.For(() => File.Exists(startFile), + TimeSpan.FromSeconds(30), + () => throw new Exception("Script did not start"), + CancellationToken); + + executionCts.Cancel(); + + // ExecuteScript throws OperationCanceledException once the token is cancelled. + await FluentActions.Invoking(async () => await execution).Should().ThrowAsync(); + + // The orchestrator must have escalated to AbandonScript after the 2s threshold. + recordedUsages.ForAbandonScriptAsync().Started.Should().BeGreaterThan(0); + + // Cleanup: release + reap the still-alive process (kill was disabled). + File.WriteAllText(releaseFile, ""); + } +``` + +- [ ] Run it. With Tasks 5–6 implemented it should PASS; if abandon was never called it FAILs on the `.Started.Should().BeGreaterThan(0)` assertion (debug Task 5): + +```bash +dotnet test source/Octopus.Tentacle.Tests.Integration --framework net8.0 \ + --filter "FullyQualifiedName~ClientScriptExecutionAbandon.ExecuteScript_WhenCancellationStaysPending_EscalatesToAbandon" +``` +Expected after correct wiring: PASS. (If it fails because abandon was not called, the orchestrator escalation is wrong — debug Task 5.) + +- [ ] Run the whole abandon integration fixture once more for regression: + +```bash +dotnet test source/Octopus.Tentacle.Tests.Integration --framework net8.0 \ + --filter "FullyQualifiedName~ClientScriptExecutionAbandon" +``` +Expected: PASS (all tests). + +- [ ] Commit: `git commit -am "Integration: ExecuteScript escalates stuck cancel to abandon after threshold"` + +--- + +## Task 8 — Full build across all TFMs (net48 trap guard) + +The Core project multi-targets `net48;net8.0;net8.0-windows`. The runner change in Task 1 keeps the generic `TaskCompletionSource` (the non-generic `TaskCompletionSource` does NOT exist on net48), so this must still build on net48. + +**Files** +- (no source change) + +- [ ] Build the Core project on all three TFMs: + +```bash +dotnet build source/Octopus.Tentacle.Core/Octopus.Tentacle.Core.csproj --framework net48 +dotnet build source/Octopus.Tentacle.Core/Octopus.Tentacle.Core.csproj --framework net8.0 +dotnet build source/Octopus.Tentacle.Core/Octopus.Tentacle.Core.csproj --framework net8.0-windows +``` +Expected: all three succeed. + +- [ ] Build the integration test project on net48 + net8.0 (it runs the runner tests on both): + +```bash +dotnet build source/Octopus.Tentacle.Tests.Integration --framework net48 +dotnet build source/Octopus.Tentacle.Tests.Integration --framework net8.0 +``` +Expected: both succeed. + +- [ ] Run the runner fixture on net48 to confirm the keyed-on-token change behaves there too: + +```bash +dotnet test source/Octopus.Tentacle.Tests.Integration --framework net48 \ + --filter "FullyQualifiedName~SilentProcessRunnerFixture.AbandonToken" +``` +Expected: PASS. + +- [ ] No commit needed (no source change). If any TFM failed, fix inline and commit `git commit -am "Fix net48 build for abandon runner change"`. + +--- + +## Task 9 (final) — Apply Change 1 to PR2 branch (#1244) + +PR2 (`jimpelletier/eft-3295-pr2-async-unlink`, #1244) already has the async `WaitForExitAsync` migration and already keys its abandon-catch on `abandon.IsCancellationRequested` — so the *runner* half of Change 1 is already correct there. What PR2 is still missing is the `AbandonScriptAsync` best-effort-kill: its `AbandonScriptAsync` calls only `Abandon()`. Apply the same Cancel-then-Abandon change. PR2's grandchild tests must STAY as "cancel then abandon" (do NOT flip them back to "cancel alone"). + +**Files (on branch `jimpelletier/eft-3295-pr2-async-unlink`)** +- Modify: `source/Octopus.Tentacle.Core/Services/Scripts/ScriptServiceV2.cs` (`AbandonScriptAsync`) +- Test: `source/Octopus.Tentacle.Tests.Integration/ClientScriptExecutionAbandon.cs` (cherry-pick Task 2's `AbandonScript_WithNoPriorCancel_KillsTheProcess`) + +- [ ] Switch to PR2 and confirm the runner is already token-keyed (no runner change needed): + +```bash +git switch jimpelletier/eft-3295-pr2-async-unlink +grep -n "abandon.IsCancellationRequested\|runningScript.Cancel\|runningScript.Abandon" \ + source/Octopus.Tentacle.Core/Util/CommandLine/SilentProcessRunner.cs \ + source/Octopus.Tentacle.Core/Services/Scripts/ScriptServiceV2.cs +``` +Expected: the runner shows `catch (OperationCanceledException) when (abandon.IsCancellationRequested)`; `AbandonScriptAsync` shows only `runningScript.Abandon()` (the gap to fix). + +- [ ] Cherry-pick Task 2's failing test onto PR2: add `AbandonScript_WithNoPriorCancel_KillsTheProcess` (identical body to Task 2) to `ClientScriptExecutionAbandon.cs`. Run it and watch it FAIL: + +```bash +dotnet test source/Octopus.Tentacle.Tests.Integration --framework net8.0 \ + --filter "FullyQualifiedName~ClientScriptExecutionAbandon.AbandonScript_WithNoPriorCancel_KillsTheProcess" +``` +Expected: FAIL (no kill, test times out). + +- [ ] Apply the same `AbandonScriptAsync` change as Task 2 (Cancel then Abandon) to PR2's `ScriptServiceV2.cs`: + +```csharp + if (runningScripts.TryGetValue(command.Ticket, out var runningScript)) + { + runningScript.Cancel(); + runningScript.Abandon(); + } +``` + +- [ ] Re-run the cherry-picked test. Expected: PASS. + +- [ ] Confirm PR2's grandchild tests STAY as "cancel then abandon" — do NOT modify them. Run them to confirm they still pass with the async runner + best-effort-kill abandon: + +```bash +dotnet test source/Octopus.Tentacle.Tests.Integration --framework net8.0 \ + --filter "FullyQualifiedName~SilentProcessRunnerFixture" +``` +Expected: PASS (grandchild tests require cancel-then-abandon on PR2 by design — do not regress them to PR1's "cancel alone" shape). + +- [ ] Build PR2's Core on all TFMs (net48 guard for PR2's `WaitForExitAsync` polyfill): + +```bash +dotnet build source/Octopus.Tentacle.Core/Octopus.Tentacle.Core.csproj --framework net48 +dotnet build source/Octopus.Tentacle.Core/Octopus.Tentacle.Core.csproj --framework net8.0 +dotnet build source/Octopus.Tentacle.Core/Octopus.Tentacle.Core.csproj --framework net8.0-windows +``` +Expected: all succeed. + +- [ ] Commit on PR2: `git commit -am "PR2: AbandonScriptAsync Cancel() then Abandon() so abandon best-effort-kills"` + +- [ ] Switch back to PR1: `git switch jimpelletier/eft-3295-tentacle-script-abandonment-to-release-the-mutex` + +--- + +## Notes / spec gaps surfaced during planning + +- **Capability string mismatch.** Spec Section 1 names the capability `AbandonScriptV2`. The code on this branch advertises `nameof(ScriptServiceV2.AbandonScriptAsync)` = `"AbandonScriptAsync"`. This plan matches the *deployed* string in both the Tentacle advertisement and the client `HasAbandonScriptV2()` check. Renaming to `AbandonScriptV2` is a separate contract decision and is intentionally NOT done here. +- **Most of the contract is already merged.** `AbandonScriptCommandV2`, `IScriptServiceV2.AbandonScript`, `AbandonedExitCode = -48`, `ITentacleClient.AbandonScript`, `RunningScript` abandon catch, and the PR1 runner block already exist. The two deltas in this plan are the only behavioural changes Change 1 and Change 2 require. +- **`process.HasExited` is deliberately NOT read in the PR1 abandon branch.** After `Cancel()→Close()` detaches the Process, reading `HasExited`/`ExitCode` would throw or lie. The "abandon unnecessary" race is handled one layer up by `GetResponse` returning the already-`Complete` `RunningScript`'s real code. diff --git a/docs/superpowers/specs/2026-05-21-tentacle-script-abandon-design.md b/docs/superpowers/specs/2026-05-21-tentacle-script-abandon-design.md new file mode 100644 index 000000000..5a1b8620d --- /dev/null +++ b/docs/superpowers/specs/2026-05-21-tentacle-script-abandon-design.md @@ -0,0 +1,475 @@ +# Tentacle script abandon — design + +**Status:** Draft, ready for implementation planning. Contract aligned with the parallel server-side session. +**Ticket:** [EFT-3295](https://linear.app/octopus/issue/EFT-3295/tentacle-script-abandonment-to-release-the-mutex) +**ADR:** [ADR-042 — Defer server-task Abandoned state](https://github.com/OctopusDeploy/adr/pull/226) +**Parallel work:** Server-side (ProcessExecution layer) is being designed in a separate session and consumes the contract proposed here. + +--- + +## Problem + +When a Tentacle script is hung in a way that resists `Process.Kill` (Philips' case: PowerShell stuck inside CrowdStrike + Rapid7 fighting over the same process; kernel-level uninterruptible wait), today's flow ends with: + +- `ScriptIsolationMutex` stays held → subsequent deployments to that Tentacle queue forever. +- The .NET threadpool thread inside `RunningScript.Execute()` stays parked on `process.WaitForExit()` (synchronous). +- The customer's only recovery is RDP-in-and-kill or reboot. Not acceptable for Philips. + +Server-side detects that cancellation hasn't propagated within its own timeout and tells Tentacle to **abandon** the script. Tentacle stops waiting, releases the mutex, logs honestly, and accepts new work. + +**Abandon now best-effort-kills the process** (see ADR-42 reversal below). The runaway OS process survives only when the kill genuinely can't land — exactly the stuck / re-parented-grandchild case the ticket is about. + +## Abandon semantics (the agreed model) + +- **Cancel** = best-effort kill, then **WAIT** for the script to finish and report its real outcome. +- **Abandon** = best-effort kill, then **DON'T wait** — release the isolation mutex and return `AbandonedExitCode` (-48) even if the script didn't die. + +Both attempt the kill. Only abandon stops waiting and releases the mutex. A process survives abandon only when the kill genuinely can't land (stuck / re-parented grandchild). This closes the abuse vector where someone could use abandon to leave a perfectly killable process running unmanaged. + +## Scope + +In scope: +- `IScriptServiceV2` only (Listening + Polling Tentacles). +- New Halibut RPC verb `AbandonScript`, new exit code `AbandonedExitCode = -48`. +- New optional client parameter `abandonAfterCancellationPendingFor` on `TentacleClient.ExecuteScript`. +- Gated by server-side feature flag (`AbandonTentacleScriptOnCancellationTimeoutFeatureToggle`) for the first release. No Tentacle-side flag — capability advertisement is binary on build version. + +Out of scope: +- SSH targets (different lock model; ticket explicitly defers). +- Kubernetes agent (`IKubernetesScriptServiceV1`): different mechanism, separate stuck-pod work already in flight (`KubernetesPendingPodWatchDog`). Server's capability negotiation handles "don't try abandon on Kubernetes targets" cleanly. +- Old `IScriptService` (V1): no signal that any active Tentacle still negotiates V1. +- Server-task Abandoned UI state — deferred by ADR-042; task continues to surface as Cancelled. + +## Section 0 — Two PRs, sequenced + +The work ships as two PRs in sequence so the async migration is reviewable and mergeable independently of the abandon feature, and so the behavioural change to grandchild handling is gated behind universal server abandon support. + +``` +main ← PR1 (#1226) ← PR2 (#1244, draft) +``` + +### PR1 (#1226) — ships first + +The wait in `SilentProcessRunner` is a **synchronous `process.WaitForExit()` run on a `Task`**, raced against an abandon signal via `Task.WhenAny`. The abandon signal is a `TaskCompletionSource` completed when the abandon token fires. (The non-generic `TaskCompletionSource` does NOT exist on net48, so the generic form is required.) + +**Cancel behaves exactly like main.** `cancel.Register` runs `DoOurBestToCleanUp`, which does best-effort `Hitman.Kill` **and `process.Close()`**. `Close` releases the redirected pipe handles, so the synchronous wait returns even when a re-parented grandchild holds stdout/stderr open. So cancel — including the grandchild case — is handled exactly as it is today, **with no server dependency**. Cancel of a genuinely un-killable script hangs the wait until abandon fires. + +`Close` is safe here precisely because the wait is synchronous; it would race the `Exited` event of the async `WaitForExitAsync`, which PR1 does not use. + +PR1 also adds the full abandon contract surface, the client parameter, capability advertisement, and tests. + +### PR2 (#1244) — draft, ships only after all servers support abandon + +Switches the wait to async `await process.WaitForExitAsync(...)` for performance: no thread is blocked per running script. With the async wait, `Close()` can no longer be used to unblock the grandchild case — it races the `Exited` event the async wait depends on — so `process.Close()` is removed from `DoOurBestToCleanUp`. Consequence: **cancel no longer unsticks a re-parented grandchild on its own; grandchildren require abandon (cancel-then-abandon)**. Cancel returns the real kill exit code instead of -1. + +PR2 must ship only after every server in the fleet supports the abandon escalation, because it makes server-side abandon the only recovery path for the grandchild case. + +The two grandchild tests flip from "cancel alone does not hang" (PR1) to "cancel then abandon" (PR2). + +## Section 1 — Contract surface + +Add a method to existing `IScriptServiceV2`. Do NOT introduce V3 — the convention here is method-addition + capability negotiation. The server-side RPC contract below is **unchanged** by the ADR-42 reversal. + +```csharp +// source/Octopus.Tentacle.Contracts/ScriptServiceV2/IScriptServiceV2.cs +public interface IScriptServiceV2 +{ + ScriptStatusResponseV2 StartScript(StartScriptCommandV2 command); + ScriptStatusResponseV2 GetStatus(ScriptStatusRequestV2 request); + ScriptStatusResponseV2 CancelScript(CancelScriptCommandV2 command); + ScriptStatusResponseV2 AbandonScript(AbandonScriptCommandV2 command); // NEW + void CompleteScript(CompleteScriptCommandV2 command); +} + +// NEW: source/Octopus.Tentacle.Contracts/ScriptServiceV2/AbandonScriptCommandV2.cs +public class AbandonScriptCommandV2 +{ + public AbandonScriptCommandV2(ScriptTicket ticket, long lastLogSequence) { /* … */ } + public ScriptTicket Ticket { get; } + public long LastLogSequence { get; } +} +``` + +**Capability advertisement.** Tentacle's `CapabilitiesServiceV2` advertises `AbandonScriptV2` once the build supports it. Binary on build version, no Tentacle-side toggle. Server's existing `BackwardsCompatibleAsyncCapabilitiesV2Decorator` handles "Tentacle doesn't advertise it → don't call it" for older Tentacles. Server-side `AbandonTentacleScriptOnCancellationTimeoutFeatureToggle` is the only feature-flag off-switch. + +**Why a new verb (not a "force" flag on Cancel).** Different semantics: Cancel = "best-effort kill, then wait for the real outcome". Abandon = "best-effort kill, then stop waiting; release the mutex; return -48". Two verbs map cleanly to ProcessExecution's two-step escalation (cancel first, abandon if cancel doesn't propagate). + +## Section 2 — Where abandon's kill lives, and mutex release + +**The core constraint.** `RunningScript.Execute()` acquires `ScriptIsolationMutex` inside a `using` block that wraps the call into `SilentProcessRunner`. That call blocks (PR1) or awaits (PR2) on the process wait. When the wait never returns: +1. The mutex is welded shut (the `using`'s Dispose never runs). +2. In PR1 the threadpool thread is parked; in PR2 no thread is parked, but the awaiter never completes. + +Abandon solves both by completing the wait via the abandon signal regardless of whether the OS process exited. + +**Where the kill lives.** `ScriptServiceV2.AbandonScriptAsync` calls only `runningScript.Abandon()`. The kill happens in `SilentProcessRunner`'s **abandon branch**: when the abandon token is observed it calls `DoOurBestToCleanUp` (best-effort `Hitman.Kill`, idempotent if cancel already ran it) and then returns `AbandonedExitCode`. Doing the kill there — in the runner, sequentially in the abandon branch — closes the abuse vector for ANY abandon (including a direct RPC with no prior cancel), and is **race-free**. + +> **Why not fire the cancel token from `AbandonScriptAsync` (the originally-proposed shape)?** Firing `Cancel()` then `Abandon()` races: cancel's `Hitman.Kill` makes a killable process exit, which resolves the runner's wait to the *cancel* exit code (`-1`) before the abandon token is observed — the runner returns `-1` instead of `-48`. Reversing the order makes `-48` deterministic but then the kill races the runner disposing the process. Doing the kill sequentially inside the abandon branch avoids both races. (Proven by `AbandonScript_WithNoPriorCancel_KillsTheProcess`.) + +**Deterministic -48 when both tokens are set.** In PR1's `Task.WhenAny`, key the abandoned result on `abandon.IsCancellationRequested` rather than on which task won the race. PR2's async abandon-catch is already deterministic on `abandon.IsCancellationRequested`. + +**PR1 wait shape (synchronous wait raced against abandon):** + +```csharp +using (cancel.Register(() => DoOurBestToCleanUp(process, error))) // Hitman.Kill + process.Close() +{ + process.BeginOutputReadLine(); + process.BeginErrorReadLine(); + + var abandonTcs = new TaskCompletionSource(); + using (abandon.Register(() => abandonTcs.TrySetResult(null))) + { + var waitTask = Task.Run(() => process.WaitForExit()); + await Task.WhenAny(waitTask, abandonTcs.Task).ConfigureAwait(false); + + if (abandon.IsCancellationRequested && !process.HasExited) + { + info("Tentacle has abandoned this script. The underlying script process may still be running on this host."); + SafelyCancelRead(process.CancelErrorRead, debug); + SafelyCancelRead(process.CancelOutputRead, debug); + return ScriptExitCodes.AbandonedExitCode; + } + + // process exited (naturally, or via cancel-triggered Kill+Close releasing the pipes) + SafelyCancelRead(process.CancelErrorRead, debug); + SafelyCancelRead(process.CancelOutputRead, debug); + return SafelyGetExitCode(process); + } +} +``` + +**PR2 wait shape (async wait, no `process.Close()`):** the `Task.Run(WaitForExit)` + `WhenAny` is replaced by `await process.WaitForExitAsync(abandon)`, and `process.Close()` is removed from `DoOurBestToCleanUp`. The abandon-catch keys on `abandon.IsCancellationRequested && !process.HasExited` exactly as above. Cancel returns the real kill exit code via the natural `Exited`-event completion. + +**Diff shape.** The wait method and its callers migrate to async (`ExecuteCommand` → `ExecuteCommandAsync`, returning `Task`). A search across the repo found ~20 call sites; every one migrates. PR1 introduces the async signature with the synchronous-wait-on-a-Task body and a net48 polyfill where `WaitForExitAsync` is unavailable; PR2 swaps the body to the true async wait. + +Production code: +- `source/Octopus.Tentacle.Core/Util/CommandLine/SilentProcessRunner.cs` — the method itself. Async signature, two-token (`cancel`, `abandon`). +- `source/Octopus.Tentacle/Util/ISilentProcessRunner.cs` — interface and in-process wrapper become async. +- `source/Octopus.Tentacle/Util/CommandLineRunner.cs` — caller migration. +- `source/Octopus.Tentacle.Core/Services/Scripts/RunningScript.cs` — `RunScript` → `RunScriptAsync`; ctor takes `abandonToken` alongside `runningScriptToken`; `Execute()` awaits the new path. +- `source/Octopus.Tentacle.Core/Services/Scripts/ScriptServiceV2.cs` — `LaunchShell` passes `abandonToken` from the wrapper. `RunningScriptWrapper` gains `abandonTokenSource`. `AbandonScriptAsync` calls `Abandon()` (the best-effort kill lives in the runner's abandon branch — see above). +- `source/Octopus.Tentacle.Contracts/ScriptServiceV2/` — new `AbandonScriptCommandV2.cs`, interface method on `IScriptServiceV2.cs` (per Section 1). +- `source/Octopus.Tentacle.Contracts/ScriptExitCodes.cs` — add `AbandonedExitCode = -48`. +- Capabilities advertisement (`AbandonScriptV2`). + +Immediate sync callers migrated with `.GetAwaiter().GetResult()` and a sync-boundary comment ("We're on a plain thread-pool worker — when the async work finishes it can resume on any free thread, so the block resolves normally"): +- `PowerShellPrerequisite.Check()`, `KubernetesDirectoryInformationProvider.GetDriveBytesUsingDu()`, `SystemCtlHelper.RunServiceCommand()` (×2), `LinuxServiceConfigurator` (×3), `WindowsServiceConfigurator.Sc()`, `CommandLineRunner.Execute(CommandLineInvocation, ...)`. + +Kubernetes integration test scaffolding (all caller-migration, no logic change): +- `KubeCtlTool.cs`, `DockerImageLoader.cs` (×2), `KubernetesAgentInstaller.cs` (×3), `KubernetesClusterInstaller.cs` (×4), `HelmDownloader.cs`, `ToolDownloader.cs` (all under `source/Octopus.Tentacle.Kubernetes.Tests.Integration/`). + +Tentacle integration test scaffolding (caller migration): +- `PowerShellStartupDetectionTests.cs` (×3), `Util/SilentProcessRunnerFixture.cs`, `Support/TentacleFetchers/LinuxTentacleFetcher.cs` (all under `source/Octopus.Tentacle.Tests.Integration/`). + +**What happens to stdout/stderr after abandon.** Returning `AbandonedExitCode` unwinds the method. The outer `using (var process = new Process())` disposes the Process, which closes our end of the redirected pipes. The OS process may get EPIPE on its next stdout/stderr write. The script's runtime keeps doing whatever it's doing; many scripts ignore broken-pipe errors, and scripts that fail on them already had nowhere to log anyway. The alternative — leaving the Process and its pipes pinned in memory indefinitely — is a resource-accumulation problem. + +**Async correctness watch-outs for the implementation plan:** +- Every new async method gets `.ConfigureAwait(false)`. +- No `.Result` / `.Wait()` on the new path; surface any caller that can't easily go async rather than block-on-async. +- Verify no deadlock under the Tentacle's synchronisation context (none expected, but worth confirming). + +**Rejected alternatives** (documented for the reviewer's benefit): +- **Orphan the Task + release mutex via external Dispose.** Releases the mutex but leaks a threadpool worker per abandon. Tentacle eventually starves the threadpool. +- **Manual `Thread` instead of `Task`.** Same leak problem, just trades threadpool for kernel thread handles + stack memory. +- **`Thread.Abort` / `Thread.Interrupt` / `TerminateThread` P/Invoke.** No safe managed mechanism to release a thread parked in unmanaged code; can corrupt Tentacle's own state. +- **Out-of-process script worker.** Cleanly isolates the stuck-process problem but is a massive refactor far outside EFT-3295's scope. Worth a separate proposal someday. + +## Section 3 — Client parameter (PR1) + +Add one optional parameter, threaded through the client orchestration: + +```csharp +TimeSpan? abandonAfterCancellationPendingFor = null +``` + +It is added to `TentacleClient.ExecuteScript`, `ITentacleClient.ExecuteScript`, and `ObservingScriptOrchestrator`. + +The orchestrator's poll loop (`ObservingScriptOrchestrator.ObserveUntilComplete`) already calls `CancelScript` every iteration while the `CancellationToken` is cancelled. When the parameter is set: +1. Record when cancellation first fired. +2. Once elapsed crosses the threshold AND the `AbandonScriptV2` capability is advertised, call `AbandonScript` instead of `CancelScript`. +3. If the capability is absent, keep cancelling (skip the abandon attempt cleanly). + +Add an `AbandonScript` method to the V2 script executor (`ScriptServiceV2Executor`) with the capability check. `ScriptExecutionResult.ExitCode` is the source of truth (-48 = abandoned). + +**No shared constant in `Octopus.Tentacle.Contracts`.** An earlier ask for a shared abandoned-message constant was retracted; the Tentacle's own script-log line is the only customer-facing message. + +## Section 4 — State, exit code, log wording + +- **Exit code:** `ScriptExitCodes.AbandonedExitCode = -48`. Distinct from `CanceledExitCode (-43)`. Server-side telemetry can tell abandoned from cancelled even though task UI surfaces both as "Cancelled" per ADR-042. +- **State on GetStatus after abandon:** `(ProcessState.Complete, AbandonedExitCode, latestLogs)`. Same shape as Cancel returns today. +- **Honest log line:** `"Tentacle has abandoned this script. The underlying script process may still be running on this host."` Written once, into the workspace script log, near the end of the abandon path. Still accurate after the ADR-42 reversal: the process survives only when the best-effort kill couldn't land. +- **Workspace cleanup on subsequent `CompleteScript`:** targeted best-effort. `CompleteScript` reads the stateStore and checks the persisted exit code. If `AbandonedExitCode`, wrap `workspace.Delete` in try/catch, log a `Warn` to systemLog naming the leaked directory, return success. For any other exit code, `workspace.Delete` is called as today and exceptions propagate. The relaxed-deletion policy applies only to the rare abandon case; bugs that leak handles on normal-completion paths can't hide under a blanket try/catch. No janitor — OS-level state on the host is the customer's problem per the ticket. +- **Idempotency — actual-status return (NOT silent no-op):** + - Abandon called twice on the same already-abandoned ticket → returns the cached `(Complete, AbandonedExitCode, logs)` response. + - Abandon called on a ticket that completed naturally before the abandon arrived (race case the server-side session flagged) → returns `(Complete, realExitCode, logs)` with the **real exit code**, distinct from `AbandonedExitCode`. The server uses this distinction to log *"Script had already completed before abandon was needed"* instead of *"Tentacle abandoned the script"*. + - Abandon called on an unknown ticket (never started, or already cleaned up via `CompleteScript`) → returns `(Complete, UnknownScriptExitCode, [])`, matching Cancel's behaviour for the same case. +- **Race with natural completion:** the wrapper's existing `StartScriptMutex` (or a new dedicated lock) serialises abandon entry. If state is already Complete, abandon returns the cached status per the rules above. + +## Section 5 — Automated test strategy + +### 5.1 `SilentProcessRunner` unit tests + +Style: matches existing `SilentProcessRunnerFixture.cs`. Use short-lived helper scripts/exes as process subjects. + +| Test | Trigger | Verify | +|---|---|---| +| Normal exit | Run a process that exits 0 | Returns 0; no abandon log line captured by the `info` callback spy. | +| Cancel kills process | Long-running process; fire cancel token | Within 1s: process is killed (`process.HasExited == true`). PR1: return is the kill-induced exit code (Linux 137; Windows process-defined) via the `Close`-released pipe path. PR2: return is the real kill exit code. No abandon log line. | +| Abandon while running | Long-running process; fire abandon token | Within ~100ms: returns `AbandonedExitCode`, `info` callback received exactly one call containing "Tentacle has abandoned this script". Assert `process.HasExited == false`; clean up by killing externally. | +| Abandon AFTER natural exit (race) | Process that exits in ~50ms; fire abandon token at the moment exit fires | Return value is the process's real exit code, not `AbandonedExitCode`. No abandon log line. Verifies the `abandon.IsCancellationRequested && !process.HasExited` guard. | +| Both tokens fire | Long-running process; fire cancel; with `cancel.Register` no-op'd to simulate un-killable, fire abandon | `info` callback gets the abandon log line; return value is `AbandonedExitCode`. Verifies deterministic -48 when both tokens are set. | + +**Grandchild (re-parented) tests — differ by PR.** A child process spawns a grandchild that inherits stdout/stderr and outlives the child. +- **PR1:** "cancel alone does not hang" — `cancel.Register`'s `process.Close()` releases the inherited pipe handles, so the synchronous wait returns even though the grandchild holds them open. No abandon needed. +- **PR2:** "cancel then abandon" — `process.Close()` is gone, the async wait depends on the `Exited` event, so cancel alone leaves the wait pending; abandon is required to complete it and return -48. + +**Async-specific timing assertion (PR2):** `WaitForExitAsync(token)` returns within ~50ms of cancellation. Wrap the await in `Stopwatch.StartNew()`; assert elapsed < 100ms. Proves the async wait is independent of process exit. + +**Thread-leak regression test (PR2):** start 50 stuck processes via `ExecuteCommandAsync` (all `await`ed in parallel), fire abandon on all; capture `Process.GetCurrentProcess().Threads.Count` before and 1s after; assert delta ≤ 5 (allow for threadpool jitter). The async path should produce zero parked threads at steady state. (Under PR1 the synchronous-wait-on-a-Task body still parks a worker per running script, so this assertion is PR2-only.) + +### 5.2 `ScriptServiceV2` service-layer tests + +Style: matches existing service-layer fixtures using in-memory script shells and stub workspace factories. + +| Test | Trigger | Verify | +|---|---|---| +| **Mutex release (load-bearing)** | Start `FullIsolation` script; abandon it; immediately start second `FullIsolation` script | Second `StartScript` returns with `State == Running` within 1s. Reading `ScriptIsolationMutex.TaskLock.Report()` between abandon and second-start shows the lock free in that window. | +| Abandon before StartScript | Call AbandonScript with a ticket never seen | Returns `(Complete, UnknownScriptExitCode)`. Matches existing Cancel behaviour for unknown ticket. | +| Abandon after CompleteScript | Start → Complete → Abandon | Returns `(Complete, UnknownScriptExitCode)` (wrapper already removed; stateStore gone). | +| Abandon then Cancel | Abandon, then Cancel same ticket | Cancel returns the cached abandoned response unchanged. Asserts via response equality. | +| **Cancel then Abandon (real flow)** | Long-running script; cancel; `cancel.Register` no-op'd to simulate un-killable; abandon | Final GetStatus returns `(Complete, AbandonedExitCode, logs)`. Log content includes the honest line. Subsequent same-ticket StartScript returns the cached state. | +| **Abandon best-effort-kills (anti-abuse)** | Long-running but killable script; call AbandonScript directly with no prior cancel | The process is killed (the runner's abandon branch ran `DoOurBestToCleanUp → Hitman.Kill`). Final state is `(Complete, AbandonedExitCode, logs)`. Confirms a direct abandon does not leave a killable process running unmanaged. | +| Abandon during StartScript launch | Concurrent: StartScript holding `StartScriptMutex`, AbandonScript called | Abandon serialises behind StartScript via the existing wrapper mutex. Final state is consistent (no half-abandoned wrapper). | +| Capability advertisement | Tentacle build with the abandon feature; query `CapabilitiesServiceV2.GetCapabilities()` | Response includes `AbandonScriptV2`. Builds without the feature do not advertise it. | + +### 5.3 Client orchestrator tests + +| Test | Trigger | Verify | +|---|---|---| +| Param unset → cancel only | `abandonAfterCancellationPendingFor = null`; cancel the token mid-execution | Orchestrator calls `CancelScript` each poll; never calls `AbandonScript`. | +| Threshold crossed + capability present | Param set to a short span; capability advertised; cancellation pending past the span | Orchestrator switches from `CancelScript` to `AbandonScript` after the threshold. `RecordMethodUsages` shows the `AbandonScript` call. | +| Threshold crossed + capability absent | Param set; `AbandonScriptV2` NOT advertised; cancellation pending past the span | Orchestrator keeps calling `CancelScript`; never calls `AbandonScript`; no error. | + +### 5.4 Integration tests (real shells, real processes) + +Style: matches `Octopus.Tentacle.Tests.Integration/ClientScriptExecutionIsolationMutex.cs` (real Tentacle, real script, mutex semantics under test). + +**Timing flakiness: use the existing builders, not raw shell + `Thread.Sleep`.** +- `ScriptBuilder` (`Octopus.Tentacle.CommonTestUtils/Builders/ScriptBuilder.cs`): `.CreateFile(path)` signals "script reached this line"; `.WaitForFileToExist(path)` blocks the script on an event, not a sleep race. +- `TestExecuteShellScriptCommandBuilder` (`Octopus.Tentacle.Tests.Integration/Util/Builders/`): `.SetScriptBody(ScriptBuilder)`, `.WithIsolationLevel(...)`, `.WithIsolationMutexName(...)`, `.Build()`. +- `TentacleConfigurationTestCase.CreateBuilder()` and `ClientAndTentacleBuilder` set up real Tentacle + Halibut. +- `TentacleServiceDecoratorBuilder.RecordMethodUsages(...)` decorates the script service so the test can assert call counts for the new `AbandonScript` verb and capability negotiation. +- `Wait.For(condition, timeout, onFail, ct)` is the event-driven polling helper. Always preferred over `Task.Delay`. + +**Pattern to follow:** mirror `ClientScriptExecutionIsolationMutex.cs`. Stuck-script tests use `ScriptBuilder.WaitForFileToExist(...)` as the "kernel-blocked" simulant rather than `sleep 600`. For the un-killable variant, combine the file-wait with the `Tentacle.Debug.DisableProcessKill` flag so `Hitman` becomes a no-op for the test's duration. + +| Test | Trigger | Verify | +|---|---|---| +| PowerShell + cancel (kill works) | Real PowerShell, `Start-Sleep -Seconds 600`, fire Cancel, normal kill path | Final response is `(Complete, CanceledExitCode)` via the existing path. **Negative check:** abandon log line NOT present. Confirms Cancel isn't regressed into the abandon path. | +| PowerShell + abandon (kill mocked off) | Real PowerShell, sleep; `Hitman` mocked to no-op; fire Cancel; wait; fire AbandonScript | Within 2s of abandon: response is `(Complete, AbandonedExitCode, [...honest log line...])`; mutex is free (start a second `FullIsolation` script that acquires within 1s); the real PowerShell process is still alive (verified via `Process.GetProcessById`). Test cleanup: kill the leftover PowerShell. | +| **Multi-level-deep hang (ticket-mandated)** | bootstrap → Calamari-shim → user script, with `Hitman` no-op flag set | All verifications from the previous row pass through the multi-level launch chain. Confirms abandon works when the stuck process is not the immediate child of Tentacle. | +| **Re-parented grandchild — PR1** | Child spawns a grandchild that inherits stdout/stderr and outlives it; fire Cancel | Cancel alone returns (does NOT hang): `process.Close()` releases the inherited pipes. No abandon needed. | +| **Re-parented grandchild — PR2** | Same setup; fire Cancel, then AbandonScript | Cancel alone leaves the wait pending; abandon completes it and returns `(Complete, AbandonedExitCode)`. The two grandchild tests flip from PR1's "cancel alone does not hang" to PR2's "cancel then abandon". | +| Windows workspace cleanup with open handles | Run the abandon path; leave the simulated zombie holding the workspace log file open; call CompleteScript | CompleteScript returns without exception. Tentacle systemLog contains a `Warn` naming the leaked workspace directory. Workspace dir on disk still exists (assert via `Directory.Exists`). No exception bubbles up. | +| Polling Tentacle variant | Configure test fixture as Polling | All verifications from the kill-mocked-off row pass against a Polling Tentacle. | + +**End-to-end async thread audit (PR2).** Capture `Process.GetCurrentProcess().Threads.Count` 5s into a stuck-script scenario; assert no thread parked attributable to the script pipeline. Most reliable proxy: total thread count not higher than baseline + epsilon. + +**Normal-path timing regression check.** Run a 100-iteration benchmark of normal short-script execution (`Write-Host "x"`); compare median wall-clock time vs. a baseline build. **Verify:** median delta within margin of error. + +## Section 6 — Manual testing plan + +Manual scenarios on a real test Tentacle. All scenarios assume the parallel server-side build is deployed. + +### Setup + +- Test Octopus Server with EFT-3295 server-side build. +- Windows Tentacle (primary) + Linux Tentacle (smoke). +- Debug Tentacle build with `Tentacle.Debug.DisableProcessKill=true` making `Hitman.TryKillProcessAndChildrenRecursively` a no-op — simulant for "kill doesn't work" without engineering real kernel-level waits. +- Server-side feature flag `AbandonTentacleScriptOnCancellationTimeoutFeatureToggle` (default ON, configured on the test Octopus Server). + +### Where to find things (reference for verification steps below) + +- **Tentacle systemLog (Windows):** `C:\Octopus\Logs\OctopusTentacle.txt` (confirm via `Tentacle show-configuration`). +- **Tentacle systemLog (Linux):** `/etc/octopus//Logs/OctopusTentacle.txt`. +- **Tentacle workspace root:** `/Work/`. Each script gets a subdirectory named after its `ScriptTicket`. Inside: `bootstrapRunner.log`, `Output.log`, `script.ps1`/`Bootstrap.sh`, the state store file. +- **Script log in UI:** Octopus Server → the task → expand the deployment step. This is what the customer sees and what gets the honest abandon line. +- **Thread count (Windows):** PowerShell `(Get-Process Tentacle).Threads.Count`, or Process Explorer's Threads tab. +- **Thread count (Linux):** `ps -o nlwp= -p $(pgrep -f Tentacle)`. +- **Capability advertisement:** Tentacle systemLog at startup contains `Negotiated capabilities: [...]`. Or enable Halibut verbose tracing server-side and inspect the `CapabilitiesResponseV2` payload. +- **Mutex state in Tentacle log:** grep for `acquiring isolation mutex` / `Lock acquired` / `Releasing lock` with the relevant task ID. + +### M1 — Regression smoke (flag ON, normal script) + +Deploy `Write-Host "hello"; Start-Sleep 5; Write-Host "done"`. + +**Verify (all must pass):** +1. Octopus UI task status → **Success** (green tick). +2. Script log in UI shows `hello` and `done`; no abandon line. +3. Tentacle systemLog: `grep "abandon" OctopusTentacle.txt` → zero matches for this task ID. +4. Tentacle systemLog shows the normal acquire/release pair for this task ID. +5. Thread count (sampled 5s after task completes) → within ±2 of pre-test baseline. + +### M2 — Cancel still works (flag ON, killable script) + +`DisableProcessKill=false`. Deploy `Start-Sleep -Seconds 300`. Wait ~10s. Click **Cancel** in Octopus UI. + +**Verify:** +1. UI task status transitions to **Cancelled** within 30s. +2. Tentacle systemLog shows the kill attempt followed by mutex release for this task ID. +3. PowerShell process is gone (match by PID captured from Tentacle log at script start). +4. `grep "abandon" OctopusTentacle.txt` → zero matches for this task ID. Cancel path was used, not abandon. +5. Deploy a second project to the same Tentacle → starts immediately (mutex released by the normal Cancel path). + +### M3 — The Philips scenario (flag ON, unkillable script) + +`Tentacle.Debug.DisableProcessKill=true`. Restart Tentacle. Capture thread-count baseline. Deploy `Start-Sleep -Seconds 600`. Note the PowerShell PID from the Tentacle log. Click **Cancel** after ~10s. Wait for the server-side abandon timeout (2 min for first release). + +**Verify (all must pass; this is the load-bearing scenario):** + +1. **Server side called Abandon.** Server log shows an `AbandonScript` call for this task's ticket, timestamped after the Cancel attempt + the abandon timeout. +2. **Honest log line in the customer-visible task log.** Confirm `Tentacle has abandoned this script. The underlying script process may still be running on this host.` is present in the script log section. +3. **Tentacle systemLog records the abandon path.** Shows: AbandonScript invocation received, best-effort kill attempted (no-op'd here), abandon signal fired, mutex released, wrapper removed. +4. **Mutex released — load-bearing check.** Immediately deploy a second trivial project. **Pass:** starts within 5s. **Fail:** queues with "Waiting for the script in task...". +5. **Task UI status = Cancelled** (no separate "Abandoned" state — per ADR-042). +6. **Thread count returned to baseline** (PR2). Sample 10s after the abandon. **Pass:** within ±2 of baseline. (Under PR1 a worker is parked per running script in the normal case, so use the PR2 build for this check.) +7. **The PowerShell process is still alive on the host.** `Get-Process -Id ` returns the process — because the best-effort kill was no-op'd. Kill it manually at end of test for cleanup. (With `DisableProcessKill=false` abandon would have killed it; this scenario specifically simulates the kill not landing.) +8. **Exit code in the task log = -48 (AbandonedExitCode)**, distinct from `-43` (CanceledExitCode). + +### M4 — Repeated abandon (thread/handle-leak check under repetition) + +Capture baseline thread count and Tentacle working-set memory. Run M3 ten times back-to-back (script the loop). + +**Verify:** +1. Sample thread count after each iteration (PR2 build). **Pass:** within ±5 of baseline across all ten runs. **Fail:** monotonic growth. +2. Sample working-set memory after each iteration. **Pass:** within ~50MB of baseline. **Fail:** grows by more than ~10MB per iteration. +3. After all ten runs, deploy a normal project. **Pass:** runs normally. +4. Kill leftover `powershell.exe` / `sleep` processes manually at end of test. + +### M5 — Server-side flag off (Tentacle behaves as today) + +Set `AbandonTentacleScriptOnCancellationTimeoutFeatureToggle` to OFF in the test Octopus Server. Restart Server. Leave Tentacle untouched. + +**Verify:** +1. **Server doesn't dispatch Abandon.** Repeat the M3 setup. Wait past the 2-minute point. Server log: zero `AbandonScript` matches for this task ID. +2. **Tentacle still advertises the capability.** `CapabilitiesResponseV2` still contains `AbandonScriptV2`. The flag lives on the Server, not on Tentacle. +3. **Tentacle stays wedged.** Subsequent deployment queues with "Waiting for the script in task...". Confirms today's behaviour is preserved when Server has the feature off. +4. Recovery: restart Tentacle (the existing workaround). + +### M6 — Workspace cleanup with open handles (Windows-specific) + +Run M3 to completion. Note the `ScriptTicket` from the Tentacle log. + +**Verify:** +1. **Workspace dir still exists.** Listing shows log files present; open handles prevent deletion. +2. **systemLog records the failure.** A `Warn`-level entry names the directory that could not be deleted, with the underlying I/O exception message. +3. **No propagated exception to Server.** `CompleteScript` returns normally; Server log shows successful completion. +4. **Tentacle continues to function.** Deploy a third project (not to the wedged workspace). **Pass:** runs normally. +5. **Manual cleanup works after the zombie process is killed.** Confirms the leak is bounded. + +### M7 — Polling Tentacle variant + +Register a Polling Tentacle. Repeat M3 setup and execution. + +**Verify:** +1. All M3 verification points pass with no Polling-specific differences. The Halibut RPC path is the same — only connection initiation direction differs. +2. **Polling-specific check:** during the abandon, Tentacle's polling loop continues. **Pass:** polling not blocked by the abandon flow. +3. After abandon, the Polling Tentacle picks up the next deployment. **Pass:** new deployment dispatched and runs (mutex released). + +### M8 — Linux smoke + +On a Linux Tentacle, deploy a Bash script: `sleep 600`. Repeat M2 (kill works) and M3 (kill mocked off). + +**Verify:** +1. **M2 on Linux:** `ps -p ` shows the bash/sleep process gone after Cancel. Same outcomes as Windows M2. +2. **M3 on Linux:** all M3 verification points pass. Thread count via `ps -o nlwp= -p $(pgrep -f Tentacle)`. Workspace location: `/etc/octopus//Work//`. +3. **Linux file-handle behaviour differs:** Linux generally allows deletion of files held open by other processes. For M6's analogue on Linux, workspace deletion is more likely to succeed even with the zombie running. Note in test result. +4. Confirms the implementation isn't accidentally Windows-only. + +### M9 — Server escalation ordering + +Server escalation is hardcoded at **2 minutes** post-Cancel for the first release. Ask the server-side session for a debug-build override constant to run this faster. + +**Verify the killable case (no escalation expected):** +1. Run M2 (killable + cancel). Wait at least 3 minutes. +2. Server log: zero `AbandonScript` matches for this task ID. Cancel succeeded inside the 2-minute window. +3. Tentacle log: zero abandon entries for this task ID. + +**Verify the unkillable case (escalation expected):** +4. Run M3 (kill mocked off + cancel). Wait through the 2-minute timeout. +5. Server log: exactly one `AbandonScript` match, timestamped ~2 minutes after the Cancel. +6. Tentacle log: one abandon entry for this task ID. + +**Verify the actual-status race case** (server-side session's idempotency concern): +7. Set up M3, but let the script complete naturally just before the 2-minute timer fires (~110-second script). +8. Server fires AbandonScript anyway because the completion event hasn't reached it yet. +9. Tentacle returns `(Complete, realExitCode, logs)` — NOT `AbandonedExitCode`. +10. Server task log entry: *Script had already completed before abandon was needed.* + +**Bug indicators to flag back to the server session:** +- Server calls AbandonScript on every Cancel (even killable cases) → server's escalation predicate is wrong. +- Server retries AbandonScript multiple times for the same ticket → server idempotency broken. +- Server calls AbandonScript before the 2-minute window → timer is wrong. +- Server calls AbandonScript with the Tentacle capability missing → capability gating broken. + +### Sign-off criteria + +To turn the feature flag on by default in a future release: M1–M5 pass on Windows; M3 + M4 pass on Linux; M7 passes on Polling; M9 confirms server escalation policy; M6 confirms workspace leak is bounded and logged. + +## ADR-42 / "not killed" reversal + +The original ticket and ADR-42 said the runaway process is **not** killed. This design **reverses** that: abandon now best-effort-kills the process (anti-abuse — see Abandon semantics). The Tentacle script-log line remains accurate because the process survives only when the kill genuinely couldn't land. + +Follow-ups required outside this Tentacle work: +- **ADR-42** — update to reflect best-effort-kill-on-abandon. +- **EFT-3295 ticket scope** — update the "do not kill" statement. +- **Docs PR #3175** — update the customer-facing wording to match best-effort-kill-on-abandon. + +**Unchanged:** the server-side RPC contract — `AbandonScript(AbandonScriptCommandV2) → ScriptStatusResponseV2`, exit code -48 — is identical to what was already agreed. + +## Risks and rollout + +- **Feature flag off by default** for the first release. Customer-by-customer opt-in. +- **PR2 sequencing:** PR2 (#1244) must ship only after every server in the fleet supports the abandon escalation, because PR2 removes `process.Close()` and makes server abandon the only recovery for the re-parented-grandchild case. +- **Sequence:** after EFT V1 cleanup closes (target end May 2026), before Task Cap 320, targeting Philips' July self-host release. +- **Telemetry:** count of AbandonScript calls per Tentacle per day. Spike = signal that either Cancel is broken or this feature is masking a different bug. +- **Soak test pre-release:** 1000 normal scripts with the server-side flag ON, verify no resource leak vs. flag OFF baseline. + +## Open questions for external reviewer + +(None remaining. Workspace cleanup policy resolved 2026-05-21 — targeted best-effort gated on `AbandonedExitCode` in the stateStore. No janitor; OS-level state on the host is the customer's responsibility per the ticket.) + +## Coordination — locked with the server-side session (2026-05-21) + +Aligned via Linear thread on EFT-3295 (commenter Jim, both sessions). Items below are locked unless explicitly noted. + +**Contract (final shape):** + +- `ScriptStatusResponseV2 AbandonScript(AbandonScriptCommandV2 command)` on `IScriptServiceV2`. +- `AbandonScriptCommandV2 { ScriptTicket Ticket; long LastLogSequence; }` — same shape as `CancelScriptCommandV2`. Server-side dropped its initial `ServerTaskId` and "cancellation correlation id" proposal; `ScriptTicket` is sufficient. +- Capability name: `AbandonScriptV2`. + +**Idempotency (final):** Tentacle returns actual current status. Already-completed script returns `(Complete, realExitCode, logs)` — distinct from `AbandonedExitCode`, so the server's task log entry can record that the abandon was unnecessary. Unknown/already-cleaned-up ticket returns `(Complete, UnknownScriptExitCode, [])`, matching Cancel's existing shape. + +**Capability check is the primary gate.** Server uses `BackwardsCompatibleAsyncCapabilitiesV2Decorator` to query `AbandonScriptV2` once per session. Capability absent → server does not schedule the abandon dispatch at all. The RPC-fail-then-log path stays as a defensive fallback for capability-cache staleness. + +**One off-switch, server-side:** `AbandonTentacleScriptOnCancellationTimeoutFeatureToggle` (default ON). Governs whether server escalates to AbandonScript at all. No Tentacle-side flag — Tentacle's capability advertisement is binary on build version. + +**Escalation timing (locked for first release):** 2 minutes. Both V1 and V2 execution pipelines escalate to AbandonScript on their next status-poll once cancellation has been pending that long. Hardcoded on the server toggle class, not configurable. Server-side trigger is a polling-loop check, not a delayed NSB message; no new server-side timers. The Tentacle-side contract is unchanged either way. + +**Execution-pipeline scope (server-side, 2026-05-21):** V1 *and* V2 server-side execution pipelines call AbandonScript via the same contract. Philips is V1 self-host so V1 is the urgent path. Doesn't change anything Tentacle is building. + +**Post-abandon flow:** + +1. Server calls `AbandonScript` → gets `ScriptStatusResponseV2`. +2. Server publishes `TentacleScriptAbandonedEvent`. +3. Existing post-cancel path proceeds (eventually calls `CompleteScript` downstream). + +**Task log wording:** + +- Tentacle script log (this doc's Section 4): *Tentacle has abandoned this script. The underlying script process may still be running on this host.* +- Server task log (server session's surface). Server session's working proposal: + - On dispatch: *Cancellation hasn't taken effect on Tentacle after 2 minutes. Abandoning the script to release the script-isolation mutex.* + - On Tentacle returning `AbandonedExitCode`: *Tentacle abandoned the script.* + - On Tentacle returning a real exit code (abandon unnecessary): *Script had already completed before abandon was needed.* +- I pushed back on the dispatch wording — "script-isolation mutex" exposes internal terminology to the customer. Suggested rewrite: *Cancellation hasn't taken effect on Tentacle after 2 minutes. Abandoning the script so this target can accept new deployments.* Server session's call which to ship with. diff --git a/source/Octopus.Manager.Tentacle.Tests/Builders/SetupTentacleWizardModelBuilder.cs b/source/Octopus.Manager.Tentacle.Tests/Builders/SetupTentacleWizardModelBuilder.cs index 0e71b9e1a..b86c59576 100644 --- a/source/Octopus.Manager.Tentacle.Tests/Builders/SetupTentacleWizardModelBuilder.cs +++ b/source/Octopus.Manager.Tentacle.Tests/Builders/SetupTentacleWizardModelBuilder.cs @@ -32,6 +32,8 @@ public SetupTentacleWizardModelBuilder() commandLineRunner = Substitute.For(); commandLineRunner.Execute(Arg.Any(), Arg.Any()).Returns(true); commandLineRunner.Execute(Arg.Any>(), Arg.Any()).Returns(true); + commandLineRunner.ExecuteAsync(Arg.Any(), Arg.Any()).Returns(true); + commandLineRunner.ExecuteAsync(Arg.Any>(), Arg.Any()).Returns(true); } public SetupTentacleWizardModelBuilder WithTelemetryService(ITelemetryService telemetryService) diff --git a/source/Octopus.Manager.Tentacle/PreReq/PowerShellPrerequisite.cs b/source/Octopus.Manager.Tentacle/PreReq/PowerShellPrerequisite.cs index 92c3768f1..4f13e91b4 100644 --- a/source/Octopus.Manager.Tentacle/PreReq/PowerShellPrerequisite.cs +++ b/source/Octopus.Manager.Tentacle/PreReq/PowerShellPrerequisite.cs @@ -1,5 +1,7 @@ using System; using System.IO; +using System.Threading; +using System.Threading.Tasks; using Octopus.Tentacle.Diagnostics; using Octopus.Tentacle.Util; @@ -9,9 +11,24 @@ public class PowerShellPrerequisite : IPrerequisite { public string StatusMessage => "Checking that Windows PowerShell 2.0 is installed..."; + // Why this is sync: IPrerequisite.Check() is part of a sync interface used by + // the WPF installer's prerequisite plumbing. Making it async would mean + // converting the whole IPrerequisite chain, which is a wider refactor than + // this PR. + // + // Why blocking on the async call is safe: PreReqWindow.Start dispatches each + // prerequisite via DispatchHelper.Background, which queues us via + // ThreadPool.QueueUserWorkItem. That's a plain thread-pool worker with no + // SynchronizationContext, so there's nothing for the awaited continuation + // to wait on. + // See https://blog.stephencleary.com/2012/07/dont-block-on-async-code.html public PrerequisiteCheckResult Check() + => CheckAsync().GetAwaiter().GetResult(); + + async Task CheckAsync() { - return CheckPowerShellIsInstalled(out var commandLineOutput) + var (installed, commandLineOutput) = await CheckPowerShellIsInstalledAsync(); + return installed ? PrerequisiteCheckResult.Successful() : PrerequisiteCheckResult.Failed("Windows PowerShell 2.0 or above does not appear to be installed and on the System Path on this machine. Please install Windows PowerShell or add it to the System Path then re-run this setup tool.", helpLink: "http://g.octopushq.com/HowDoIInstallPowerShell", @@ -19,21 +36,21 @@ public PrerequisiteCheckResult Check() commandLineOutput: commandLineOutput); } - static bool CheckPowerShellIsInstalled(out string commandLineOutput) + static async Task<(bool installed, string commandLineOutput)> CheckPowerShellIsInstalledAsync() { var stdOut = new StringWriter(); var stdErr = new StringWriter(); const string powerShellExe = "powershell.exe"; const string arguments = "-NonInteractive -NoProfile -Command \"Write-Output $PSVersionTable.PSVersion\""; - commandLineOutput = $"{powerShellExe} {arguments}"; + var commandLineOutput = $"{powerShellExe} {arguments}"; // Despite our old check conforming to Microsoft's recommendations // for PS version checking around the 1.0/2.0 era, and extending // to detect 3.0, it failed to detect 4. Going the direct route: try { - SilentProcessRunnerExtended.ExecuteCommand( + await SilentProcessRunnerExtended.ExecuteCommandAsync( powerShellExe, arguments, ".", @@ -45,12 +62,12 @@ static bool CheckPowerShellIsInstalled(out string commandLineOutput) commandLineOutput = $"{commandLineOutput}{Environment.NewLine}{stdOut}{Environment.NewLine}{stdErr}"; - return outputText.Contains("Major"); + return (outputText.Contains("Major"), commandLineOutput); } catch (Exception ex) { commandLineOutput = $"{commandLineOutput}{Environment.NewLine}{ex}"; - return false; + return (false, commandLineOutput); } } } diff --git a/source/Octopus.Manager.Tentacle/TentacleConfiguration/SetupWizard/ReviewAndRunScriptTabViewModel.cs b/source/Octopus.Manager.Tentacle/TentacleConfiguration/SetupWizard/ReviewAndRunScriptTabViewModel.cs index 2b8eef4c9..1eb1be871 100644 --- a/source/Octopus.Manager.Tentacle/TentacleConfiguration/SetupWizard/ReviewAndRunScriptTabViewModel.cs +++ b/source/Octopus.Manager.Tentacle/TentacleConfiguration/SetupWizard/ReviewAndRunScriptTabViewModel.cs @@ -46,7 +46,7 @@ public async Task GenerateAndExecuteScript() { var script = GenerateScript(); ContributeSensitiveValues(logger); - success = commandLineRunner.Execute(script, logger); + success = await commandLineRunner.ExecuteAsync(script, logger); } catch (Exception ex) { diff --git a/source/Octopus.Tentacle.Client.Tests/CapabilitiesResponseV2ExtensionMethodsTests.cs b/source/Octopus.Tentacle.Client.Tests/CapabilitiesResponseV2ExtensionMethodsTests.cs new file mode 100644 index 000000000..16ad82db6 --- /dev/null +++ b/source/Octopus.Tentacle.Client.Tests/CapabilitiesResponseV2ExtensionMethodsTests.cs @@ -0,0 +1,33 @@ +using System.Collections.Generic; +using FluentAssertions; +using NUnit.Framework; +using Octopus.Tentacle.Client.Capabilities; +using Octopus.Tentacle.Contracts.Capabilities; + +namespace Octopus.Tentacle.Client.Tests +{ + [TestFixture] + public class CapabilitiesResponseV2ExtensionMethodsTests + { + [Test] + public void HasAbandonScriptV2_WhenAdvertised_ReturnsTrue() + { + var capabilities = new CapabilitiesResponseV2(new List { "IScriptServiceV2", "AbandonScriptAsync" }); + capabilities.HasAbandonScriptV2().Should().BeTrue(); + } + + [Test] + public void HasAbandonScriptV2_WhenNotAdvertised_ReturnsFalse() + { + var capabilities = new CapabilitiesResponseV2(new List { "IScriptServiceV2" }); + capabilities.HasAbandonScriptV2().Should().BeFalse(); + } + + [Test] + public void HasAbandonScriptV2_WhenEmpty_ReturnsFalse() + { + var capabilities = new CapabilitiesResponseV2(new List()); + capabilities.HasAbandonScriptV2().Should().BeFalse(); + } + } +} diff --git a/source/Octopus.Tentacle.Client.Tests/ObservingScriptOrchestratorAbandonTests.cs b/source/Octopus.Tentacle.Client.Tests/ObservingScriptOrchestratorAbandonTests.cs new file mode 100644 index 000000000..8348bdd78 --- /dev/null +++ b/source/Octopus.Tentacle.Client.Tests/ObservingScriptOrchestratorAbandonTests.cs @@ -0,0 +1,74 @@ +using System; +using System.Collections.Generic; +using System.Threading; +using System.Threading.Tasks; +using FluentAssertions; +using NSubstitute; +using NUnit.Framework; +using Octopus.Tentacle.Client.EventDriven; +using Octopus.Tentacle.Client.Scripts; +using Octopus.Tentacle.Contracts; + +namespace Octopus.Tentacle.Client.Tests +{ + [TestFixture] + public class ObservingScriptOrchestratorAbandonTests + { + static CommandContext Context() => new(new ScriptTicket("T"), 0, ScriptServiceVersion.ScriptServiceVersion2); + + static ScriptOperationExecutionResult Running() + => new(new ScriptStatus(ProcessState.Running, 0, new List()), Context()); + + static ScriptOperationExecutionResult Complete(int exitCode) + => new(new ScriptStatus(ProcessState.Complete, exitCode, new List()), Context()); + + static ObservingScriptOrchestrator CreateOrchestrator(IScriptExecutor executor, TimeSpan? abandonAfter) + => new( + new ImmediateBackoff(), + _ => { }, + _ => Task.CompletedTask, + executor, + abandonAfter); + + sealed class ImmediateBackoff : IScriptObserverBackoffStrategy + { + public TimeSpan GetBackoff(int iteration) => TimeSpan.Zero; + } + + [Test] + public async Task ParamUnset_CancelsOnly_NeverAbandons() + { + var executor = Substitute.For(); + executor.CancelScript(Arg.Any()) + .Returns(Running(), Complete(ScriptExitCodes.CanceledExitCode)); + + var orchestrator = CreateOrchestrator(executor, abandonAfter: null); + using var cts = new CancellationTokenSource(); + cts.Cancel(); + + await orchestrator.ObserveUntilComplete(Running(), cts.Token); + + await executor.DidNotReceive().AbandonScript(Arg.Any()); + await executor.Received().CancelScript(Arg.Any()); + } + + [Test] + public async Task ThresholdCrossed_SwitchesFromCancelToAbandon() + { + var executor = Substitute.For(); + // abandonAfter = Zero means the threshold is crossed on the first cancelled iteration, + // so the orchestrator abandons immediately. AbandonScript returns Complete to end the loop. + executor.AbandonScript(Arg.Any()) + .Returns(Complete(ScriptExitCodes.AbandonedExitCode)); + + var orchestrator = CreateOrchestrator(executor, abandonAfter: TimeSpan.Zero); + using var cts = new CancellationTokenSource(); + cts.Cancel(); + + var result = await orchestrator.ObserveUntilComplete(Running(), cts.Token); + + await executor.Received().AbandonScript(Arg.Any()); + result.ScriptStatus.ExitCode.Should().Be(ScriptExitCodes.AbandonedExitCode); + } + } +} diff --git a/source/Octopus.Tentacle.Client.Tests/ScriptServiceV2ExecutorAbandonTests.cs b/source/Octopus.Tentacle.Client.Tests/ScriptServiceV2ExecutorAbandonTests.cs new file mode 100644 index 000000000..4d3dca224 --- /dev/null +++ b/source/Octopus.Tentacle.Client.Tests/ScriptServiceV2ExecutorAbandonTests.cs @@ -0,0 +1,82 @@ +using System; +using System.Collections.Generic; +using System.Threading; +using System.Threading.Tasks; +using FluentAssertions; +using Halibut.ServiceModel; +using NSubstitute; +using NUnit.Framework; +using Octopus.Tentacle.Client; +using Octopus.Tentacle.Client.EventDriven; +using Octopus.Tentacle.Client.Execution; +using Octopus.Tentacle.Client.Observability; +using Octopus.Tentacle.Client.Retries; +using Octopus.Tentacle.Client.Scripts; +using Octopus.Tentacle.Contracts; +using Octopus.Tentacle.Contracts.Capabilities; +using Octopus.Tentacle.Contracts.ClientServices; +using Octopus.Tentacle.Contracts.Logging; +using Octopus.Tentacle.Contracts.Observability; +using Octopus.Tentacle.Contracts.ScriptServiceV2; + +namespace Octopus.Tentacle.Client.Tests +{ + [TestFixture] + public class ScriptServiceV2ExecutorAbandonTests + { + static ScriptServiceV2Executor CreateExecutor(IAsyncClientScriptServiceV2 scriptService, IAsyncClientCapabilitiesServiceV2 capabilities) + => new( + scriptService, + capabilities, + RpcCallExecutorFactory.Create(TimeSpan.Zero, Substitute.For()), + ClientOperationMetricsBuilder.Start(), + TimeSpan.Zero, + new TentacleClientOptions(new RpcRetrySettings(RetriesEnabled: false, RetryDuration: TimeSpan.Zero)), + Substitute.For()); + + static CommandContext Context() => new(new ScriptTicket("TestTicket"), 0, ScriptServiceVersion.ScriptServiceVersion2); + + [Test] + public async Task AbandonScript_WhenCapabilityAdvertised_CallsAbandonScriptAsync() + { + var scriptService = Substitute.For(); + scriptService.AbandonScriptAsync(Arg.Any(), Arg.Any()) + .Returns(x => Task.FromResult(new ScriptStatusResponseV2( + x.Arg().Ticket, ProcessState.Complete, + ScriptExitCodes.AbandonedExitCode, new List(), 1))); + + var capabilities = Substitute.For(); + capabilities.GetCapabilitiesAsync(Arg.Any()) + .Returns(Task.FromResult(new CapabilitiesResponseV2(new List { "IScriptServiceV2", "AbandonScriptAsync" }))); + + var executor = CreateExecutor(scriptService, capabilities); + + var result = await executor.AbandonScript(Context()); + + await scriptService.Received(1).AbandonScriptAsync(Arg.Any(), Arg.Any()); + await scriptService.DidNotReceive().CancelScriptAsync(Arg.Any(), Arg.Any()); + result.ScriptStatus.ExitCode.Should().Be(ScriptExitCodes.AbandonedExitCode); + } + + [Test] + public async Task AbandonScript_WhenCapabilityAbsent_FallsBackToCancelScriptAsync() + { + var scriptService = Substitute.For(); + scriptService.CancelScriptAsync(Arg.Any(), Arg.Any()) + .Returns(x => Task.FromResult(new ScriptStatusResponseV2( + x.Arg().Ticket, ProcessState.Running, + 0, new List(), 1))); + + var capabilities = Substitute.For(); + capabilities.GetCapabilitiesAsync(Arg.Any()) + .Returns(Task.FromResult(new CapabilitiesResponseV2(new List { "IScriptServiceV2" }))); + + var executor = CreateExecutor(scriptService, capabilities); + + await executor.AbandonScript(Context()); + + await scriptService.DidNotReceive().AbandonScriptAsync(Arg.Any(), Arg.Any()); + await scriptService.Received(1).CancelScriptAsync(Arg.Any(), Arg.Any()); + } + } +} diff --git a/source/Octopus.Tentacle.Client/Capabilities/CapabilitiesResponseV2ExtensionMethods.cs b/source/Octopus.Tentacle.Client/Capabilities/CapabilitiesResponseV2ExtensionMethods.cs index bc49a1e95..ad9558c87 100644 --- a/source/Octopus.Tentacle.Client/Capabilities/CapabilitiesResponseV2ExtensionMethods.cs +++ b/source/Octopus.Tentacle.Client/Capabilities/CapabilitiesResponseV2ExtensionMethods.cs @@ -15,5 +15,17 @@ public static bool HasScriptServiceV2(this CapabilitiesResponseV2 capabilities) return capabilities.SupportedCapabilities.Contains(nameof(IScriptServiceV2)); } + + public static bool HasAbandonScriptV2(this CapabilitiesResponseV2 capabilities) + { + if (capabilities?.SupportedCapabilities?.Any() != true) + { + return false; + } + + // Tentacle advertises this as nameof(ScriptServiceV2.AbandonScriptAsync). Keep the + // literal in sync with CapabilitiesServiceV2.GetCapabilitiesAsync on the Tentacle side. + return capabilities.SupportedCapabilities.Contains("AbandonScriptAsync"); + } } } \ No newline at end of file diff --git a/source/Octopus.Tentacle.Client/ITentacleClient.cs b/source/Octopus.Tentacle.Client/ITentacleClient.cs index c79a285fc..297211eff 100644 --- a/source/Octopus.Tentacle.Client/ITentacleClient.cs +++ b/source/Octopus.Tentacle.Client/ITentacleClient.cs @@ -8,6 +8,7 @@ using Octopus.Tentacle.Client.Scripts.Models; using Octopus.Tentacle.Contracts; using Octopus.Tentacle.Contracts.Logging; +using Octopus.Tentacle.Contracts.ScriptServiceV2; namespace Octopus.Tentacle.Client { @@ -31,7 +32,8 @@ Task ExecuteScript( OnScriptStatusResponseReceived onScriptStatusResponseReceived, OnScriptCompleted onScriptCompleted, ITentacleClientTaskLog logger, - CancellationToken scriptExecutionCancellationToken); + CancellationToken scriptExecutionCancellationToken, + TimeSpan? abandonAfterCancellationPendingFor = null); /// /// Start the script. @@ -59,6 +61,17 @@ Task StartScript(ExecuteScriptCommand command, /// The result, which includes the CommandContext for the next command Task CancelScript(CommandContext commandContext, ITentacleClientTaskLog logger); + /// + /// Abandon a running script. Signals Tentacle to release the script's isolation mutex + /// and clean up its workspace without waiting for the underlying process to exit. + /// Used as an escape hatch when CancelScript cannot terminate a stuck process. + /// + /// The ticket of the script to abandon + /// Used to output user orientated log messages + /// Cancels the RPC call + /// The current status snapshot of the script at the time abandon was processed + Task AbandonScript(ScriptTicket scriptTicket, ITentacleClientTaskLog logger, CancellationToken cancellationToken); + /// /// Complete the script. /// diff --git a/source/Octopus.Tentacle.Client/ScriptExecutor.cs b/source/Octopus.Tentacle.Client/ScriptExecutor.cs index 572db403a..1ce5e1f2c 100644 --- a/source/Octopus.Tentacle.Client/ScriptExecutor.cs +++ b/source/Octopus.Tentacle.Client/ScriptExecutor.cs @@ -69,7 +69,15 @@ public async Task CancelScript(CommandContext co return await scriptExecutor.CancelScript(commandContext); } - + + public async Task AbandonScript(CommandContext commandContext) + { + var scriptExecutorFactory = CreateScriptExecutorFactory(); + var scriptExecutor = scriptExecutorFactory.CreateScriptExecutor(commandContext.ScripServiceVersionUsed); + + return await scriptExecutor.AbandonScript(commandContext); + } + public async Task CompleteScript(CommandContext commandContext, CancellationToken cancellationToken) { var scriptExecutorFactory = CreateScriptExecutorFactory(); diff --git a/source/Octopus.Tentacle.Client/Scripts/IScriptExecutor.cs b/source/Octopus.Tentacle.Client/Scripts/IScriptExecutor.cs index 0544ef9ed..48fb6536c 100644 --- a/source/Octopus.Tentacle.Client/Scripts/IScriptExecutor.cs +++ b/source/Octopus.Tentacle.Client/Scripts/IScriptExecutor.cs @@ -32,6 +32,14 @@ Task StartScript(ExecuteScriptCommand command, /// The result, which includes the CommandContext for the next command Task CancelScript(CommandContext commandContext); + /// + /// Abandon the script: signal Tentacle to stop waiting and release the isolation mutex. + /// Returns the abandoned status when the Tentacle advertises the abandon capability; + /// otherwise falls back to cancelling and returns that result. + /// + /// The CommandContext from the previous command + Task AbandonScript(CommandContext commandContext); + /// /// Complete the script. /// diff --git a/source/Octopus.Tentacle.Client/Scripts/KubernetesScriptServiceV1Executor.cs b/source/Octopus.Tentacle.Client/Scripts/KubernetesScriptServiceV1Executor.cs index b5b1b1e44..16dfc559b 100644 --- a/source/Octopus.Tentacle.Client/Scripts/KubernetesScriptServiceV1Executor.cs +++ b/source/Octopus.Tentacle.Client/Scripts/KubernetesScriptServiceV1Executor.cs @@ -184,6 +184,12 @@ async Task CancelScriptAction(CancellationToke return Map(kubernetesScriptStatusResponseV1); } + public Task AbandonScript(CommandContext commandContext) + { + // KubernetesScriptServiceV1 has no abandon verb; degrade to cancel. + return CancelScript(commandContext); + } + public async Task CompleteScript(CommandContext lastStatusResponse, CancellationToken scriptExecutionCancellationToken) { using var activity = TentacleClient.ActivitySource.StartActivity($"{nameof(KubernetesScriptServiceV1Executor)}.{nameof(CompleteScript)}"); diff --git a/source/Octopus.Tentacle.Client/Scripts/ObservingScriptOrchestrator.cs b/source/Octopus.Tentacle.Client/Scripts/ObservingScriptOrchestrator.cs index 209941900..6233f1e1f 100644 --- a/source/Octopus.Tentacle.Client/Scripts/ObservingScriptOrchestrator.cs +++ b/source/Octopus.Tentacle.Client/Scripts/ObservingScriptOrchestrator.cs @@ -14,17 +14,20 @@ sealed class ObservingScriptOrchestrator readonly OnScriptStatusResponseReceived onScriptStatusResponseReceived; readonly OnScriptCompleted onScriptCompleted; readonly IScriptExecutor scriptExecutor; + readonly TimeSpan? abandonAfterCancellationPendingFor; public ObservingScriptOrchestrator( IScriptObserverBackoffStrategy scriptObserverBackOffStrategy, OnScriptStatusResponseReceived onScriptStatusResponseReceived, OnScriptCompleted onScriptCompleted, - IScriptExecutor scriptExecutor) + IScriptExecutor scriptExecutor, + TimeSpan? abandonAfterCancellationPendingFor = null) { this.scriptExecutor = scriptExecutor; this.scriptObserverBackOffStrategy = scriptObserverBackOffStrategy; this.onScriptStatusResponseReceived = onScriptStatusResponseReceived; this.onScriptCompleted = onScriptCompleted; + this.abandonAfterCancellationPendingFor = abandonAfterCancellationPendingFor; } public async Task ExecuteScript(ExecuteScriptCommand command, CancellationToken scriptExecutionCancellationToken) @@ -73,19 +76,31 @@ async Task ObserveUntilCompleteThenFinish( return observingUntilCompleteResult.ScriptStatus; } - async Task ObserveUntilComplete( + internal async Task ObserveUntilComplete( ScriptOperationExecutionResult startScriptResult, CancellationToken scriptExecutionCancellationToken) { var iteration = 0; var cancellationIteration = 0; var lastResult = startScriptResult; + var stopwatch = new Stopwatch(); while (lastResult.ScriptStatus.State != ProcessState.Complete) { if (scriptExecutionCancellationToken.IsCancellationRequested) { - lastResult = await scriptExecutor.CancelScript(lastResult.ContextForNextCommand).ConfigureAwait(false); + // Record when cancellation first fired so we can escalate to abandon after the threshold. + if (!stopwatch.IsRunning) + { + stopwatch.Start(); + } + + var shouldAbandon = abandonAfterCancellationPendingFor is { } threshold + && stopwatch.Elapsed >= threshold; + + lastResult = shouldAbandon + ? await scriptExecutor.AbandonScript(lastResult.ContextForNextCommand).ConfigureAwait(false) + : await scriptExecutor.CancelScript(lastResult.ContextForNextCommand).ConfigureAwait(false); } else { diff --git a/source/Octopus.Tentacle.Client/Scripts/ScriptExecutorFactory.cs b/source/Octopus.Tentacle.Client/Scripts/ScriptExecutorFactory.cs index a67824d6f..d5b0af94a 100644 --- a/source/Octopus.Tentacle.Client/Scripts/ScriptExecutorFactory.cs +++ b/source/Octopus.Tentacle.Client/Scripts/ScriptExecutorFactory.cs @@ -47,6 +47,7 @@ public IScriptExecutor CreateScriptExecutor(ScriptServiceVersion scriptServiceTo { return new ScriptServiceV2Executor( allClients.ScriptServiceV2, + allClients.CapabilitiesServiceV2, rpcCallExecutor, clientOperationMetricsBuilder, onCancellationAbandonCompleteScriptAfter, diff --git a/source/Octopus.Tentacle.Client/Scripts/ScriptServiceV1Executor.cs b/source/Octopus.Tentacle.Client/Scripts/ScriptServiceV1Executor.cs index b777b8e1d..8d0d32c81 100644 --- a/source/Octopus.Tentacle.Client/Scripts/ScriptServiceV1Executor.cs +++ b/source/Octopus.Tentacle.Client/Scripts/ScriptServiceV1Executor.cs @@ -164,6 +164,12 @@ public async Task CancelScript(CommandContext co return Map(response); } + public Task AbandonScript(CommandContext commandContext) + { + // ScriptServiceV1 has no abandon verb; degrade to cancel. + return CancelScript(commandContext); + } + public async Task CompleteScript(CommandContext lastStatusResponse, CancellationToken scriptExecutionCancellationToken) { try diff --git a/source/Octopus.Tentacle.Client/Scripts/ScriptServiceV2Executor.cs b/source/Octopus.Tentacle.Client/Scripts/ScriptServiceV2Executor.cs index c3fc68898..abc3270a3 100644 --- a/source/Octopus.Tentacle.Client/Scripts/ScriptServiceV2Executor.cs +++ b/source/Octopus.Tentacle.Client/Scripts/ScriptServiceV2Executor.cs @@ -3,6 +3,7 @@ using System.Threading.Tasks; using Halibut; using Halibut.ServiceModel; +using Octopus.Tentacle.Client.Capabilities; using Octopus.Tentacle.Client.EventDriven; using Octopus.Tentacle.Client.Execution; using Octopus.Tentacle.Client.Observability; @@ -18,6 +19,7 @@ namespace Octopus.Tentacle.Client.Scripts class ScriptServiceV2Executor : IScriptExecutor { readonly IAsyncClientScriptServiceV2 clientScriptServiceV2; + readonly IAsyncClientCapabilitiesServiceV2 clientCapabilitiesServiceV2; readonly RpcCallExecutor rpcCallExecutor; readonly ClientOperationMetricsBuilder clientOperationMetricsBuilder; readonly TimeSpan onCancellationAbandonCompleteScriptAfter; @@ -26,6 +28,7 @@ class ScriptServiceV2Executor : IScriptExecutor public ScriptServiceV2Executor( IAsyncClientScriptServiceV2 clientScriptServiceV2, + IAsyncClientCapabilitiesServiceV2 clientCapabilitiesServiceV2, RpcCallExecutor rpcCallExecutor, ClientOperationMetricsBuilder clientOperationMetricsBuilder, TimeSpan onCancellationAbandonCompleteScriptAfter, @@ -33,6 +36,7 @@ public ScriptServiceV2Executor( ITentacleClientTaskLog logger) { this.clientScriptServiceV2 = clientScriptServiceV2; + this.clientCapabilitiesServiceV2 = clientCapabilitiesServiceV2; this.rpcCallExecutor = rpcCallExecutor; this.clientOperationMetricsBuilder = clientOperationMetricsBuilder; this.onCancellationAbandonCompleteScriptAfter = onCancellationAbandonCompleteScriptAfter; @@ -173,6 +177,42 @@ async Task CancelScriptAction(CancellationToken ct) return Map(scriptStatusResponseV2); } + public async Task AbandonScript(CommandContext commandContext) + { + using var activity = TentacleClient.ActivitySource.StartActivity($"{nameof(ScriptServiceV2Executor)}.{nameof(AbandonScript)}"); + + var capabilities = await rpcCallExecutor.Execute( + retriesEnabled: clientOptions.RpcRetrySettings.RetriesEnabled, + RpcCall.Create(nameof(Contracts.Capabilities.ICapabilitiesServiceV2.GetCapabilities)), + async ct => await clientCapabilitiesServiceV2.GetCapabilitiesAsync(new HalibutProxyRequestOptions(ct)), + logger, + clientOperationMetricsBuilder, + CancellationToken.None).ConfigureAwait(false); + + // Capability absent → the Tentacle is too old to abandon. Keep cancelling cleanly. + if (!capabilities.HasAbandonScriptV2()) + { + logger.Verbose("Tentacle does not advertise AbandonScript; falling back to CancelScript."); + return await CancelScript(commandContext).ConfigureAwait(false); + } + + async Task AbandonScriptAction(CancellationToken ct) + { + var request = new AbandonScriptCommandV2(commandContext.ScriptTicket, commandContext.NextLogSequence); + return await clientScriptServiceV2.AbandonScriptAsync(request, new HalibutProxyRequestOptions(ct)); + } + + var scriptStatusResponseV2 = await rpcCallExecutor.Execute( + retriesEnabled: clientOptions.RpcRetrySettings.RetriesEnabled, + RpcCall.Create(nameof(IScriptServiceV2.AbandonScript)), + AbandonScriptAction, + logger, + clientOperationMetricsBuilder, + // Like CancelScript, abandon must not be cancelled — it stops the script on Tentacle. + CancellationToken.None).ConfigureAwait(false); + return Map(scriptStatusResponseV2); + } + public async Task CompleteScript(CommandContext lastStatusResponse, CancellationToken scriptExecutionCancellationToken) { using var activity = TentacleClient.ActivitySource.StartActivity($"{nameof(ScriptServiceV2Executor)}.{nameof(CompleteScript)}"); diff --git a/source/Octopus.Tentacle.Client/TentacleClient.cs b/source/Octopus.Tentacle.Client/TentacleClient.cs index 8176cccc0..ca8003d95 100644 --- a/source/Octopus.Tentacle.Client/TentacleClient.cs +++ b/source/Octopus.Tentacle.Client/TentacleClient.cs @@ -16,6 +16,7 @@ using Octopus.Tentacle.Contracts.Capabilities; using Octopus.Tentacle.Contracts.Logging; using Octopus.Tentacle.Contracts.Observability; +using Octopus.Tentacle.Contracts.ScriptServiceV2; using ITentacleClientObserver = Octopus.Tentacle.Contracts.Observability.ITentacleClientObserver; namespace Octopus.Tentacle.Client @@ -168,7 +169,8 @@ public async Task ExecuteScript(ExecuteScriptCommand exec OnScriptStatusResponseReceived onScriptStatusResponseReceived, OnScriptCompleted onScriptCompleted, ITentacleClientTaskLog logger, - CancellationToken scriptExecutionCancellationToken) + CancellationToken scriptExecutionCancellationToken, + TimeSpan? abandonAfterCancellationPendingFor = null) { using var activity = ActivitySource.StartActivity($"{nameof(TentacleClient)}.{nameof(ExecuteScript)}"); activity?.AddTag("octopus.tentacle.script.files", string.Join(",", executeScriptCommand.Files.Select(f => f.Name))); @@ -188,7 +190,8 @@ public async Task ExecuteScript(ExecuteScriptCommand exec var orchestrator = new ObservingScriptOrchestrator(scriptObserverBackOffStrategy, onScriptStatusResponseReceived, onScriptCompleted, - scriptExecutor); + scriptExecutor, + abandonAfterCancellationPendingFor); var result = await orchestrator.ExecuteScript(executeScriptCommand, scriptExecutionCancellationToken); @@ -260,6 +263,36 @@ public async Task CancelScript(CommandContext co return await scriptExecutor.CancelScript(commandContext); } + public async Task AbandonScript(ScriptTicket scriptTicket, ITentacleClientTaskLog logger, CancellationToken cancellationToken) + { + using var activity = ActivitySource.StartActivity($"{nameof(TentacleClient)}.{nameof(AbandonScript)}"); + activity?.AddTag("octopus.tentacle.script.ticket", scriptTicket.TaskId); + + var operationMetricsBuilder = ClientOperationMetricsBuilder.Start(); + + async Task AbandonScriptAction(CancellationToken ct) + { + var request = new AbandonScriptCommandV2(scriptTicket, lastLogSequence: 0); + return await allClients.ScriptServiceV2.AbandonScriptAsync(request, new HalibutProxyRequestOptions(ct)); + } + + try + { + return await rpcCallExecutor.Execute( + retriesEnabled: clientOptions.RpcRetrySettings.RetriesEnabled, + RpcCall.Create(nameof(IScriptServiceV2.AbandonScript)), + AbandonScriptAction, + logger, + operationMetricsBuilder, + cancellationToken).ConfigureAwait(false); + } + catch (Exception e) + { + operationMetricsBuilder.Failure(e, cancellationToken); + throw; + } + } + public async Task CompleteScript(CommandContext commandContext, ITentacleClientTaskLog logger, CancellationToken scriptExecutionCancellationToken) { using var activity = ActivitySource.StartActivity($"{nameof(TentacleClient)}.{nameof(CompleteScript)}"); diff --git a/source/Octopus.Tentacle.Contracts/ClientServices/IAsyncClientScriptServiceV2.cs b/source/Octopus.Tentacle.Contracts/ClientServices/IAsyncClientScriptServiceV2.cs index 996f915c1..0c89048e4 100644 --- a/source/Octopus.Tentacle.Contracts/ClientServices/IAsyncClientScriptServiceV2.cs +++ b/source/Octopus.Tentacle.Contracts/ClientServices/IAsyncClientScriptServiceV2.cs @@ -10,6 +10,7 @@ public interface IAsyncClientScriptServiceV2 Task StartScriptAsync(StartScriptCommandV2 command, HalibutProxyRequestOptions proxyRequestOptions); Task GetStatusAsync(ScriptStatusRequestV2 request, HalibutProxyRequestOptions proxyRequestOptions); Task CancelScriptAsync(CancelScriptCommandV2 command, HalibutProxyRequestOptions proxyRequestOptions); + Task AbandonScriptAsync(AbandonScriptCommandV2 command, HalibutProxyRequestOptions proxyRequestOptions); Task CompleteScriptAsync(CompleteScriptCommandV2 command, HalibutProxyRequestOptions proxyRequestOptions); } } \ No newline at end of file diff --git a/source/Octopus.Tentacle.Contracts/ScriptExitCodes.cs b/source/Octopus.Tentacle.Contracts/ScriptExitCodes.cs index 2e0ce1b15..2319410c4 100644 --- a/source/Octopus.Tentacle.Contracts/ScriptExitCodes.cs +++ b/source/Octopus.Tentacle.Contracts/ScriptExitCodes.cs @@ -12,6 +12,7 @@ public static class ScriptExitCodes public const int UnknownScriptExitCode = -45; public const int UnknownResultExitCode = -46; public const int PowerShellNeverStartedExitCode = -47; + public const int AbandonedExitCode = -48; //Kubernetes Agent public const int KubernetesScriptPodNotFound = -81; diff --git a/source/Octopus.Tentacle.Contracts/ScriptServiceV2/AbandonScriptCommandV2.cs b/source/Octopus.Tentacle.Contracts/ScriptServiceV2/AbandonScriptCommandV2.cs new file mode 100644 index 000000000..66efba446 --- /dev/null +++ b/source/Octopus.Tentacle.Contracts/ScriptServiceV2/AbandonScriptCommandV2.cs @@ -0,0 +1,17 @@ +using System; + +namespace Octopus.Tentacle.Contracts.ScriptServiceV2 +{ + public class AbandonScriptCommandV2 + { + public AbandonScriptCommandV2(ScriptTicket ticket, long lastLogSequence) + { + Ticket = ticket; + LastLogSequence = lastLogSequence; + } + + public ScriptTicket Ticket { get; } + + public long LastLogSequence { get; } + } +} diff --git a/source/Octopus.Tentacle.Contracts/ScriptServiceV2/IScriptServiceV2.cs b/source/Octopus.Tentacle.Contracts/ScriptServiceV2/IScriptServiceV2.cs index 3effc17b8..9858111fa 100644 --- a/source/Octopus.Tentacle.Contracts/ScriptServiceV2/IScriptServiceV2.cs +++ b/source/Octopus.Tentacle.Contracts/ScriptServiceV2/IScriptServiceV2.cs @@ -7,6 +7,7 @@ public interface IScriptServiceV2 ScriptStatusResponseV2 StartScript(StartScriptCommandV2 command); ScriptStatusResponseV2 GetStatus(ScriptStatusRequestV2 request); ScriptStatusResponseV2 CancelScript(CancelScriptCommandV2 command); + ScriptStatusResponseV2 AbandonScript(AbandonScriptCommandV2 command); void CompleteScript(CompleteScriptCommandV2 command); } } \ No newline at end of file diff --git a/source/Octopus.Tentacle.Core/Services/Scripts/RunningScript.cs b/source/Octopus.Tentacle.Core/Services/Scripts/RunningScript.cs index fca5dfa15..7109d0ff1 100644 --- a/source/Octopus.Tentacle.Core/Services/Scripts/RunningScript.cs +++ b/source/Octopus.Tentacle.Core/Services/Scripts/RunningScript.cs @@ -22,18 +22,20 @@ public class RunningScript: IRunningScript readonly IShell shell; readonly string taskId; readonly CancellationToken runningScriptToken; + readonly CancellationToken abandonToken; readonly IReadOnlyDictionary environmentVariables; readonly ILog log; readonly ScriptIsolationMutex scriptIsolationMutex; readonly TimeSpan powerShellStartupTimeout; - public RunningScript(IShell shell, + RunningScript(IShell shell, IScriptWorkspace workspace, IScriptStateStore? stateStore, IScriptLog scriptLog, string taskId, ScriptIsolationMutex scriptIsolationMutex, CancellationToken runningScriptToken, + CancellationToken abandonToken, IReadOnlyDictionary environmentVariables, TimeSpan powerShellStartupTimeout, ILog log @@ -44,6 +46,7 @@ ILog log this.stateStore = stateStore; this.taskId = taskId; this.runningScriptToken = runningScriptToken; + this.abandonToken = abandonToken; this.environmentVariables = environmentVariables; this.log = log; this.scriptIsolationMutex = scriptIsolationMutex; @@ -52,7 +55,7 @@ ILog log this.powerShellStartupTimeout = powerShellStartupTimeout; } - public RunningScript(IShell shell, + RunningScript(IShell shell, IScriptWorkspace workspace, IScriptLog scriptLog, string taskId, @@ -60,10 +63,34 @@ public RunningScript(IShell shell, CancellationToken runningScriptToken, IReadOnlyDictionary environmentVariables, TimeSpan powerShellStartupTimeout, - ILog log) : this(shell, workspace, null, scriptLog, taskId, scriptIsolationMutex, runningScriptToken, environmentVariables, powerShellStartupTimeout, log) + ILog log) : this(shell, workspace, null, scriptLog, taskId, scriptIsolationMutex, runningScriptToken, CancellationToken.None, environmentVariables, powerShellStartupTimeout, log) { } + public static RunningScript Create(IShell shell, + IScriptWorkspace workspace, + IScriptLog scriptLog, + string taskId, + ScriptIsolationMutex scriptIsolationMutex, + CancellationToken runningScriptToken, + IReadOnlyDictionary environmentVariables, + TimeSpan powerShellStartupTimeout, + ILog log) + => new RunningScript(shell, workspace, null, scriptLog, taskId, scriptIsolationMutex, runningScriptToken, CancellationToken.None, environmentVariables, powerShellStartupTimeout, log); + + public static RunningScript CreateAbandonable(IShell shell, + IScriptWorkspace workspace, + IScriptStateStore? stateStore, + IScriptLog scriptLog, + string taskId, + ScriptIsolationMutex scriptIsolationMutex, + CancellationToken runningScriptToken, + CancellationToken abandonToken, + IReadOnlyDictionary environmentVariables, + TimeSpan powerShellStartupTimeout, + ILog log) + => new RunningScript(shell, workspace, stateStore, scriptLog, taskId, scriptIsolationMutex, runningScriptToken, abandonToken, environmentVariables, powerShellStartupTimeout, log); + public ProcessState State { get; private set; } public int ExitCode { get; private set; } @@ -96,14 +123,22 @@ public async Task Execute() exitCode = workspace.ShouldMonitorPowerShellStartup() ? await RunPowershellScriptWithMonitoring(shellPath, writer, runningScriptToken) - : RunScript(shellPath, writer, runningScriptToken); + : await RunScriptAsync(shellPath, writer, runningScriptToken, abandonToken); } } + // Fires when the caller abandoned the script: leave the OS process running and signal the distinct AbandonedExitCode so the server can tell it apart from a cancel. + catch (OperationCanceledException) when (abandonToken.IsCancellationRequested) + { + writer.WriteOutput(ProcessOutputSource.StdOut, "Script execution abandoned."); + exitCode = ScriptExitCodes.AbandonedExitCode; + } + // Fires when the caller cancelled the script and the underlying process honored the cancellation token. catch (OperationCanceledException) { writer.WriteOutput(ProcessOutputSource.StdOut, "Script execution canceled."); exitCode = ScriptExitCodes.CanceledExitCode; } + // Fires when acquiring the isolation mutex timed out before the script could start. catch (TimeoutException) { writer.WriteOutput(ProcessOutputSource.StdOut, "Script execution timed out."); @@ -147,7 +182,7 @@ async Task RunPowershellScriptWithMonitoring(string shellPath, IScriptLogWr var monitor = new PowerShellStartupMonitor(workspace.WorkingDirectory, powerShellStartupTimeout, log, taskId); var monitoringTask = monitor.WaitForStartup(monitoringTaskCts.Token); - var scriptTask = Task.Run(() => RunScript(shellPath, writer, scriptTaskCts.Token), scriptTaskCts.Token); + var scriptTask = Task.Run(async () => await RunScriptAsync(shellPath, writer, scriptTaskCts.Token, abandonToken), scriptTaskCts.Token); var completedTask = await Task.WhenAny(monitoringTask, scriptTask); @@ -222,11 +257,11 @@ void RecordScriptHasCompleted(int exitCode) } } - int RunScript(string shellPath, IScriptLogWriter writer, CancellationToken cancellationToken) + async Task RunScriptAsync(string shellPath, IScriptLogWriter writer, CancellationToken cancellationToken, CancellationToken abandon) { try { - var exitCode = SilentProcessRunner.ExecuteCommand( + var exitCode = await SilentProcessRunner.ExecuteCommandAsync( shellPath, shell.FormatCommandArguments(workspace.BootstrapScriptFilePath, workspace.ScriptArguments, false), workspace.WorkingDirectory, @@ -234,7 +269,8 @@ int RunScript(string shellPath, IScriptLogWriter writer, CancellationToken cance LogScriptOutputTo(writer, ProcessOutputSource.StdOut), LogScriptOutputTo(writer, ProcessOutputSource.StdErr), environmentVariables, - cancellationToken); + cancellationToken, + abandon); return exitCode; } diff --git a/source/Octopus.Tentacle.Core/Services/Scripts/ScriptServiceV2.cs b/source/Octopus.Tentacle.Core/Services/Scripts/ScriptServiceV2.cs index 0a200fc19..669922d9e 100644 --- a/source/Octopus.Tentacle.Core/Services/Scripts/ScriptServiceV2.cs +++ b/source/Octopus.Tentacle.Core/Services/Scripts/ScriptServiceV2.cs @@ -72,7 +72,7 @@ public async Task StartScriptAsync(StartScriptCommandV2 { IScriptWorkspace workspace; - // If the state already exists then this runningScript is already running/has already run and we should not run it again + // StartScript may be called multiple times for the same ticket (e.g. if server retries the tentacle command), so we must guard against actually starting the script twice. if (runningScript.ScriptStateStore.Exists()) { var state = runningScript.ScriptStateStore.Load(); @@ -103,7 +103,8 @@ public async Task StartScriptAsync(StartScriptCommandV2 command.TaskId, workspace, runningScript.ScriptStateStore, - runningScript.CancellationToken); + runningScript.CancellationToken, + runningScript.AbandonToken); runningScript.Process = process; @@ -140,22 +141,70 @@ public async Task CancelScriptAsync(CancelScriptCommandV return GetResponse(command.Ticket, command.LastLogSequence, runningScript?.Process); } - public async Task CompleteScriptAsync(CompleteScriptCommandV2 command, CancellationToken cancellationToken) + public Task AbandonScriptAsync(AbandonScriptCommandV2 command, CancellationToken cancellationToken) { - await Task.CompletedTask; + // Fire the abandon token. SilentProcessRunner's abandon branch best-effort-kills the + // process (anti-abuse) and then stops waiting + releases the mutex + returns -48, so the + // wait returns and GetResponse() below may carry the final exit code (otherwise the server + // gets it on a subsequent GetStatus). The kill lives in the runner — done sequentially in + // the abandon branch — rather than firing Cancel() here, because firing the cancel token + // races the runner's wait resolution (the kill could resolve the wait to the cancel exit + // code before the abandon token is observed). A process survives abandon only when the kill + // genuinely can't land (stuck / re-parented grandchild). + if (runningScripts.TryGetValue(command.Ticket, out var runningScript)) + { + runningScript.Abandon(); + } + + return Task.FromResult(GetResponse(command.Ticket, command.LastLogSequence, runningScript?.Process)); + } + public async Task CompleteScriptAsync(CompleteScriptCommandV2 command, CancellationToken cancellationToken) + { if (runningScripts.TryRemove(command.Ticket, out var runningScript)) { runningScript.Dispose(); } var workspace = workspaceFactory.GetWorkspace(command.Ticket, WorkspaceReadinessCheck.Skip); - await workspace.Delete(cancellationToken); + + // For abandoned scripts (see AbandonScriptCommandV2 and + // https://octopus.com/docs/infrastructure/deployment-targets/tentacle/tentacle-script-abandonment) + // the underlying OS process is, by design, still alive + // and unable to be killed by Tentacle. It may still hold open file handles inside + // the workspace (logs being written to, working files, etc.). workspace.Delete() + // will fail in that case on Windows due to sharing violations and may partially + // delete on Linux. We need to tolerate the failure, which will leave the workspace + // on disk to hopefully be cleaned up by another mechanism (manual cleanup, + // instance restart) etc. This is the best we can do. For all other completion paths + // the process has exited and Delete should succeed; surface any failure there. + if (WasAbandoned(workspace)) + { + try + { + await workspace.Delete(cancellationToken); + } + catch (Exception ex) + { + log.Warn(ex, $"Could not delete abandoned workspace at {workspace.WorkingDirectory}. Leaving on disk; the underlying script process may still hold open file handles."); + } + } + else + { + await workspace.Delete(cancellationToken); + } } - RunningScript LaunchShell(ScriptTicket ticket, string serverTaskId, IScriptWorkspace workspace, IScriptStateStore stateStore, CancellationToken cancellationToken) + bool WasAbandoned(IScriptWorkspace workspace) { - var runningScript = new RunningScript(shell, workspace, stateStore, workspace.CreateLog(), serverTaskId, scriptIsolationMutex, cancellationToken, environmentVariables, powerShellStartupTimeout, log); + var stateStore = scriptStateStoreFactory.Create(workspace); + return stateStore.Exists() + && stateStore.Load().ExitCode == ScriptExitCodes.AbandonedExitCode; + } + + RunningScript LaunchShell(ScriptTicket ticket, string serverTaskId, IScriptWorkspace workspace, IScriptStateStore stateStore, CancellationToken cancellationToken, CancellationToken abandonToken) + { + var runningScript = RunningScript.CreateAbandonable(shell, workspace, stateStore, workspace.CreateLog(), serverTaskId, scriptIsolationMutex, cancellationToken, abandonToken, environmentVariables, powerShellStartupTimeout, log); _ = Task.Run(async () => await runningScript.Execute()); return runningScript; } @@ -204,13 +253,14 @@ public bool IsRunningScript(ScriptTicket ticket) class RunningScriptWrapper : IDisposable { - readonly CancellationTokenSource cancellationTokenSource = new (); + readonly CancellationTokenSource cancellationTokenSource = new(); + readonly CancellationTokenSource abandonTokenSource = new(); public RunningScriptWrapper(ScriptStateStore scriptStateStore) { ScriptStateStore = scriptStateStore; - CancellationToken = cancellationTokenSource.Token; + AbandonToken = abandonTokenSource.Token; } public RunningScript? Process { get; set; } @@ -218,15 +268,15 @@ public RunningScriptWrapper(ScriptStateStore scriptStateStore) public SemaphoreSlim StartScriptMutex { get; } = new(1, 1); public CancellationToken CancellationToken { get; } + public CancellationToken AbandonToken { get; } - public void Cancel() - { - cancellationTokenSource.Cancel(); - } + public void Cancel() => cancellationTokenSource.Cancel(); + public void Abandon() => abandonTokenSource.Cancel(); public void Dispose() { cancellationTokenSource.Dispose(); + abandonTokenSource.Dispose(); } } } diff --git a/source/Octopus.Tentacle.Core/Util/CommandLine/NewFile1.txt b/source/Octopus.Tentacle.Core/Util/CommandLine/NewFile1.txt new file mode 100644 index 000000000..e69de29bb diff --git a/source/Octopus.Tentacle.Core/Util/CommandLine/SilentProcessRunner.cs b/source/Octopus.Tentacle.Core/Util/CommandLine/SilentProcessRunner.cs index 0fcf47f9e..a166ba4db 100644 --- a/source/Octopus.Tentacle.Core/Util/CommandLine/SilentProcessRunner.cs +++ b/source/Octopus.Tentacle.Core/Util/CommandLine/SilentProcessRunner.cs @@ -8,25 +8,29 @@ using System.Runtime.Versioning; using System.Text; using System.Threading; +using System.Threading.Tasks; +using Octopus.Tentacle.Contracts; using Octopus.Tentacle.Core.Diagnostics; +using Octopus.Tentacle.Core.Util; namespace Octopus.Tentacle.Util { public static class SilentProcessRunner { - public static int ExecuteCommand( + public static Task ExecuteCommandAsync( string executable, string arguments, string workingDirectory, Action debug, Action info, Action error, - CancellationToken cancel) + CancellationToken cancel, + CancellationToken abandon) { - return ExecuteCommand(executable, arguments, workingDirectory, debug, info, error, customEnvironmentVariables: null, cancel: cancel); + return ExecuteCommandAsync(executable, arguments, workingDirectory, debug, info, error, customEnvironmentVariables: null, cancel: cancel, abandon: abandon); } - public static int ExecuteCommand( + public static async Task ExecuteCommandAsync( string executable, string arguments, string workingDirectory, @@ -34,7 +38,8 @@ public static int ExecuteCommand( Action info, Action error, IReadOnlyDictionary? customEnvironmentVariables = null, - CancellationToken cancel = default) + CancellationToken cancel = default, + CancellationToken abandon = default) { if (executable == null) throw new ArgumentNullException(nameof(executable)); @@ -135,15 +140,44 @@ void WriteData(Action action, ManualResetEventSlim resetEvent, DataRecei })) { if (cancel.IsCancellationRequested) - DoOurBestToCleanUp(process, error); + DoOurBestToCleanUp(process, error); process.BeginOutputReadLine(); process.BeginErrorReadLine(); - process.WaitForExit(); + // Only `abandon` breaks this wait — cancel does not. + // + // Cancel kills the process via cancel.Register; we then wait here for it to exit + // naturally, so a script that honours the kill returns its real exit code (e.g. 137 + // for SIGKILL, 143 for SIGTERM). A script that will NOT exit — genuinely stuck, OR a + // re-parented grandchild holding our redirected stdout/stderr pipes open so stream + // EOF never arrives — keeps waiting here. The only way to stop waiting on such a + // script is for the caller to abandon it: the abandon token cancels this await. + // + // Unlike PR1, this async wait uses WaitForExitAsync (the Exited event), so Close() + // cannot be used to unblock the grandchild case — it races the Exited event the + // wait depends on. That is why grandchildren now REQUIRE abandon, not cancel alone. + try + { + await WaitForProcessExitAsync(process, abandon).ConfigureAwait(false); + } + catch (OperationCanceledException) when (abandon.IsCancellationRequested) + { + // Abandon best-effort-kills (anti-abuse): kill it if we can, then stop waiting + // and release. Doing the kill here — sequentially, in the abandon branch — is + // race-free. The kill is idempotent if cancel already ran it. The process + // survives abandon only when the kill genuinely can't land (stuck / re-parented + // grandchild). From the caller's perspective abandon is a race against natural + // exit, so returning AbandonedExitCode is acceptable even if the process happened + // to finish at the same moment — that's why we don't check process.HasExited. + DoOurBestToCleanUp(process, error); + info("Tentacle has abandoned this script. The underlying script process may still be running on this host."); + SafelyCancelOutputAndErrorRead(process, debug); + running = false; + return ScriptExitCodes.AbandonedExitCode; + } - SafelyCancelRead(process.CancelErrorRead, debug); - SafelyCancelRead(process.CancelOutputRead, debug); + SafelyCancelOutputAndErrorRead(process, debug); SafelyWaitForAllOutput(outputResetEvent, cancel, debug); SafelyWaitForAllOutput(errorResetEvent, cancel, debug); @@ -180,9 +214,20 @@ static void SafelyWaitForAllOutput(ManualResetEventSlim outputResetEvent, CancellationToken cancel, Action debug) { + // outputResetEvent.Wait is waiting for the OutputDataReceived/ErrorDataReceived + // handlers to signal EOF on the stream (when it receives a null + // DataReceivedEventArgs.Data, .NET's EOF marker). This does NOT close the pipe, + // it just gives the OS up to 5 seconds to deliver the EOF. + // + // If a re-parented grandchild is holding the pipe open, EOF never arrives, the wait + // times out, and we proceed without the final flush of buffered output. The pipe is + // released later by Process.Dispose() at end of ExecuteCommandAsync via the + // `using (var process = new Process())` block. + // + // 5 seconds is somewhat arbitrary — the process has already exited by the time we + // reach here, so under normal circumstances EOF arrives within milliseconds. try { - //5 seconds is a bit arbitrary, but the process should have already exited by now, so unwise to wait too long outputResetEvent.Wait(TimeSpan.FromSeconds(5), cancel); } catch (OperationCanceledException ex) @@ -191,6 +236,17 @@ static void SafelyWaitForAllOutput(ManualResetEventSlim outputResetEvent, } } + static void SafelyCancelOutputAndErrorRead(Process process, Action debug) + { + // Cancel the output/error readers so a late OutputDataReceived/ErrorDataReceived + // callback doesn't try to write to a workspace log that's already been disposed by + // the using-block above; that write would throw ObjectDisposedException. Called in + // both the normal completion path and the abandon path; extracted here so the two + // callers stay consistent. + SafelyCancelRead(process.CancelErrorRead, debug); + SafelyCancelRead(process.CancelOutputRead, debug); + } + static void SafelyCancelRead(Action action, Action debug) { try @@ -221,22 +277,64 @@ static void DoOurBestToCleanUp(Process process, Action error) error($"Failed to kill the launched process: {killProcessException}"); } } - finally + + // NOTE: deliberately no process.Close() here. The async WaitForExitAsync wait depends on the + // Process's Exited event; Close() tears that down (StopWatchingForExit) and races the wait's + // completion. PR1 could Close() because it used a synchronous WaitForExit; this async path + // cannot. The still-pending wait is released when the using(process) disposes on return. + } + + // Single place we block waiting for the spawned process to exit. + // On .NET Framework we use a TaskCompletionSource polyfill because + // Process.WaitForExitAsync doesn't exist there; on .NET 8+ we use the + // framework method directly. + static Task WaitForProcessExitAsync(Process process, CancellationToken cancellationToken) + { +#if NETFRAMEWORK + return WaitForProcessExitAsyncNetFrameworkPolyfill(process, cancellationToken); +#else + return process.WaitForExitAsync(cancellationToken); +#endif + } + +#if NETFRAMEWORK + static Task WaitForProcessExitAsyncNetFrameworkPolyfill(Process process, CancellationToken cancellationToken) + { + // EnableRaisingEvents must be true for the process.Exited handler below to fire. + // On .NET 8+ Process.WaitForExitAsync sets this itself; here on netframework we + // have to set it ourselves before subscribing. + // https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.process.waitforexitasync + process.EnableRaisingEvents = true; + + var tcs = new TaskCompletionSource(TaskCreationOptions.RunContinuationsAsynchronously); + CancellationTokenRegistration registration = default; + + void OnExited(object? sender, EventArgs e) + { + registration.Dispose(); + tcs.TrySetResult(null); + } + + process.Exited += OnExited; + + // Guard against race: process may have already exited before we subscribed. + if (process.HasExited) { - try - { - // When cancelling, close the file handles. - // If the child finishes, but the grandchild holds stdout/stderr and never completes, - // THEN we won't issue a kill to the grandchild but will wait for the grandchild to - // close the handles. - process.Close(); - } - catch (Exception ex) + tcs.TrySetResult(null); + } + + if (cancellationToken.CanBeCanceled) + { + registration = cancellationToken.Register(() => { - error($"Failed to close process resources: {ex.Message}"); - } + process.Exited -= OnExited; + tcs.TrySetCanceled(cancellationToken); + }); } + + return tcs.Task; } +#endif [DllImport("kernel32.dll", SetLastError = true)] #pragma warning disable PC003 // Native API not available in UWP @@ -251,11 +349,18 @@ class Hitman { public static void TryKillProcessAndChildrenRecursively(Process process) { + if (!string.IsNullOrEmpty(Environment.GetEnvironmentVariable(EnvironmentVariables.TentacleDebugDisableProcessKill_UNSAFE_FOR_PRODUCTION))) + { + // Test-only no-op: simulate "kill was attempted but didn't terminate the process". + // Only activated when the test harness sets this env var on the Tentacle process. + return; + } + #if NETFRAMEWORK TryKillWindowsProcessAndChildrenRecursively(process.Id); #endif #if !NETFRAMEWORK - // Since .NET Core 3.0 there is support for killing a process and it's children + // Since .NET Core 3.0 there is support for killing a process and it's children process.Kill(true); #endif } diff --git a/source/Octopus.Tentacle.Core/Util/EnvironmentVariables.cs b/source/Octopus.Tentacle.Core/Util/EnvironmentVariables.cs index 2b5f83a59..faf01eb00 100644 --- a/source/Octopus.Tentacle.Core/Util/EnvironmentVariables.cs +++ b/source/Octopus.Tentacle.Core/Util/EnvironmentVariables.cs @@ -29,6 +29,7 @@ public static class EnvironmentVariables public const string TentacleMachineConfigurationHomeDirectory = "TentacleMachineConfigurationHomeDirectory"; public const string TentaclePollingConnectionCount = "TentaclePollingConnectionCount"; public const string TentaclePowerShellStartupTimeout = "TentaclePowerShellStartupTimeout"; + public const string TentacleDebugDisableProcessKill_UNSAFE_FOR_PRODUCTION = "TentacleDebugDisableProcessKill_UNSAFE_FOR_PRODUCTION"; public const string NfsWatchdogDirectory = "watchdog_directory"; public static string TentacleUseTcpNoDelay = "TentacleUseTcpNoDelay"; public static string TentacleUseAsyncListener = "TentacleUseAsyncListener"; diff --git a/source/Octopus.Tentacle.Kubernetes.Tests.Integration/KubernetesAgent/KubernetesClusterOneTimeSetUp.cs b/source/Octopus.Tentacle.Kubernetes.Tests.Integration/KubernetesAgent/KubernetesClusterOneTimeSetUp.cs index 3854e7c53..4c46902c5 100644 --- a/source/Octopus.Tentacle.Kubernetes.Tests.Integration/KubernetesAgent/KubernetesClusterOneTimeSetUp.cs +++ b/source/Octopus.Tentacle.Kubernetes.Tests.Integration/KubernetesAgent/KubernetesClusterOneTimeSetUp.cs @@ -17,7 +17,7 @@ public async Task OneTimeSetUp() installer = new KubernetesClusterInstaller(KubernetesTestsGlobalContext.Instance.TemporaryDirectory, kindExePath, helmExePath, kubeCtlPath, KubernetesTestsGlobalContext.Instance.Logger); await installer.InstallLatestSupported(); - KubernetesTestsGlobalContext.Instance.TentacleImageAndTag = SetupHelpers.GetTentacleImageAndTag(kindExePath, installer); + KubernetesTestsGlobalContext.Instance.TentacleImageAndTag = await SetupHelpers.GetTentacleImageAndTag(kindExePath, installer); KubernetesTestsGlobalContext.Instance.SetToolExePaths(helmExePath, kubeCtlPath); KubernetesTestsGlobalContext.Instance.KubeConfigPath = installer.KubeConfigPath; } diff --git a/source/Octopus.Tentacle.Kubernetes.Tests.Integration/KubernetesClientCompatibilityTests.cs b/source/Octopus.Tentacle.Kubernetes.Tests.Integration/KubernetesClientCompatibilityTests.cs index a0c6ae878..5d060cb32 100644 --- a/source/Octopus.Tentacle.Kubernetes.Tests.Integration/KubernetesClientCompatibilityTests.cs +++ b/source/Octopus.Tentacle.Kubernetes.Tests.Integration/KubernetesClientCompatibilityTests.cs @@ -165,7 +165,7 @@ async Task SetupCluster(ClusterVersion clusterVersion) clusterInstaller = new KubernetesClusterInstaller(testContext.TemporaryDirectory, kindExePath, helmExePath, kubeCtlPath, testContext.Logger); await clusterInstaller.Install(clusterVersion); - testContext.TentacleImageAndTag = SetupHelpers.GetTentacleImageAndTag(kindExePath, clusterInstaller); + testContext.TentacleImageAndTag = await SetupHelpers.GetTentacleImageAndTag(kindExePath, clusterInstaller); testContext.SetToolExePaths(helmExePath, kubeCtlPath); testContext.KubeConfigPath = clusterInstaller.KubeConfigPath; } diff --git a/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/DockerImageLoader.cs b/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/DockerImageLoader.cs index cf28823aa..9c7fa6412 100644 --- a/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/DockerImageLoader.cs +++ b/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/DockerImageLoader.cs @@ -18,16 +18,16 @@ public DockerImageLoader(TemporaryDirectory temporaryDirectory, ILogger logger, this.kindExePath = kindExePath; } - public string? LoadMostRecentImageIntoKind(string clusterName) + public async Task LoadMostRecentImageIntoKind(string clusterName) { - var mostRecentTag = FindMostRecentTag(); + var mostRecentTag = await FindMostRecentTag(); return !string.IsNullOrWhiteSpace(mostRecentTag) - ? LoadImageIntoKind(mostRecentTag, clusterName) + ? await LoadImageIntoKind(mostRecentTag, clusterName) : null; } - string? FindMostRecentTag() + async Task FindMostRecentTag() { var sb = new StringBuilder(); var tags = new List(); @@ -36,7 +36,7 @@ public DockerImageLoader(TemporaryDirectory temporaryDirectory, ILogger logger, .WriteTo.StringBuilder(sb) .CreateLogger(); - var exitCode = SilentProcessRunner.ExecuteCommand( + var exitCode = await SilentProcessRunner.ExecuteCommandAsync( "docker", "images octopusdeploy/kubernetes-agent-tentacle --format \"{{.Tag}}\"", temporaryDirectory.DirectoryPath, @@ -46,8 +46,7 @@ public DockerImageLoader(TemporaryDirectory temporaryDirectory, ILogger logger, sprLogger.Information(line); tags.Add(line); }, - sprLogger.Error, - CancellationToken.None + sprLogger.Error ); if (exitCode != 0) @@ -55,11 +54,11 @@ public DockerImageLoader(TemporaryDirectory temporaryDirectory, ILogger logger, logger.Error("Failed to get latest image tag from docker"); throw new InvalidOperationException($"Failed to get latest image tag from docker. Logs: {sb}"); } - + return tags.FirstOrDefault(); } - string LoadImageIntoKind(string mostRecentTag, string clusterName) + async Task LoadImageIntoKind(string mostRecentTag, string clusterName) { var image = $"octopusdeploy/kubernetes-agent-tentacle:{mostRecentTag}"; @@ -69,14 +68,13 @@ string LoadImageIntoKind(string mostRecentTag, string clusterName) .WriteTo.StringBuilder(sb) .CreateLogger(); - var exitCode = SilentProcessRunner.ExecuteCommand( + var exitCode = await SilentProcessRunner.ExecuteCommandAsync( kindExePath, $"load docker-image {image} --name={clusterName}", temporaryDirectory.DirectoryPath, sprLogger.Debug, sprLogger.Information, - sprLogger.Error, - CancellationToken.None + sprLogger.Error ); if (exitCode != 0) diff --git a/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/KubernetesAgentInstaller.cs b/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/KubernetesAgentInstaller.cs index 79de31e5f..9828a8581 100644 --- a/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/KubernetesAgentInstaller.cs +++ b/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/KubernetesAgentInstaller.cs @@ -56,14 +56,13 @@ public async Task InstallAgent(int listeningPort, string? tentacleImageA .MinimumLevel.Debug() .CreateLogger(); - var exitCode = SilentProcessRunner.ExecuteCommand( + var exitCode = await SilentProcessRunner.ExecuteCommandAsync( helmExePath, arguments, temporaryDirectory.DirectoryPath, sprLogger.Debug, sprLogger.Information, - sprLogger.Error, - CancellationToken.None); + sprLogger.Error); sw.Stop(); @@ -169,7 +168,7 @@ async Task GetAgentThumbprint() .MinimumLevel.Debug() .CreateLogger(); - var exitCode = SilentProcessRunner.ExecuteCommand( + var exitCode = await SilentProcessRunner.ExecuteCommandAsync( kubeCtlExePath, //get the generated thumbprint from the config map $"get cm tentacle-config --namespace {Namespace} --kubeconfig=\"{kubeConfigPath}\" -o \"jsonpath={{.data['Tentacle\\.CertificateThumbprint']}}\"", @@ -180,8 +179,7 @@ async Task GetAgentThumbprint() sprLogger.Information(x); thumbprint = x; }, - sprLogger.Error, - CancellationToken.None); + sprLogger.Error); if (exitCode != 0) { @@ -219,14 +217,21 @@ public void Dispose() NamespaceFlag, AgentName); - var exitCode = SilentProcessRunner.ExecuteCommand( + // Why this is sync: IDisposable.Dispose() must return sync. + // + // Why blocking on the async call is safe: this only runs under NUnit, + // which dispatches us on a worker thread with no SynchronizationContext. + // + // Why low risk: this is test code. The worst case for a wrong call here + // is a hung test, not a production incident. + // See https://blog.stephencleary.com/2012/07/dont-block-on-async-code.html + var exitCode = SilentProcessRunner.ExecuteCommandAsync( helmExePath, uninstallArgs, temporaryDirectory.DirectoryPath, logger.Debug, logger.Information, - logger.Error, - CancellationToken.None); + logger.Error).GetAwaiter().GetResult(); if (exitCode != 0) { diff --git a/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/KubernetesClusterInstaller.cs b/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/KubernetesClusterInstaller.cs index 31b196718..d231d0ff4 100644 --- a/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/KubernetesClusterInstaller.cs +++ b/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/KubernetesClusterInstaller.cs @@ -53,15 +53,14 @@ async Task InstallCluster(ClusterVersion clusterVersion) var sw = new Stopwatch(); sw.Restart(); - var exitCode = SilentProcessRunner.ExecuteCommand( + var exitCode = await SilentProcessRunner.ExecuteCommandAsync( kindExePath, //we give the cluster a unique name $"create cluster --name={clusterName} --config=\"{configFilePath}\" --kubeconfig=\"{kubeConfigName}\"", tempDir.DirectoryPath, logger.Debug, logger.Information, - logger.Error, - CancellationToken.None); + logger.Error); sw.Stop(); @@ -92,15 +91,14 @@ async Task SetLocalhostRouting() .WriteTo.StringBuilder(sb) .CreateLogger(); - var exitCode = SilentProcessRunner.ExecuteCommand( + var exitCode = await SilentProcessRunner.ExecuteCommandAsync( kubeCtlPath, //we give the cluster a unique name $"apply -n default -f \"{manifestFilePath}\" --kubeconfig=\"{KubeConfigPath}\"", tempDir.DirectoryPath, sprLogger.Debug, sprLogger.Information, - sprLogger.Error, - CancellationToken.None); + sprLogger.Error); if (exitCode != 0) { @@ -143,14 +141,13 @@ async Task InstallNfsCsiDriver() .WriteTo.StringBuilder(sb) .CreateLogger(); - var exitCode = SilentProcessRunner.ExecuteCommand( + var exitCode = await SilentProcessRunner.ExecuteCommandAsync( helmExePath, installArgs, tempDir.DirectoryPath, sprLogger.Debug, sprLogger.Information, - sprLogger.Error, - CancellationToken.None); + sprLogger.Error); if (exitCode != 0) { @@ -174,15 +171,19 @@ string BuildNfsCsiDriverInstallArguments() public void Dispose() { - var exitCode = SilentProcessRunner.ExecuteCommand( + // We're in IDisposable.Dispose(). Dispose() must return synchronously, so we + // block on the async call with .GetAwaiter().GetResult(). This is sync-over-async + // but is safe because the NUnit test runner dispatches us on a worker thread + // without a captured SynchronizationContext, so no deadlock. + // See https://blog.stephencleary.com/2012/07/dont-block-on-async-code.html + var exitCode = SilentProcessRunner.ExecuteCommandAsync( kindExePath, //delete the cluster for this test run $"delete cluster --name={clusterName}", tempDir.DirectoryPath, logger.Debug, logger.Information, - logger.Error, - CancellationToken.None); + logger.Error).GetAwaiter().GetResult(); if (exitCode != 0) { diff --git a/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/SetupHelpers.cs b/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/SetupHelpers.cs index f5ed3dca0..5e05de365 100644 --- a/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/SetupHelpers.cs +++ b/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/SetupHelpers.cs @@ -46,7 +46,7 @@ public static TentacleClient BuildTentacleClient(Uri uri, string? thumbprint, Ha builder.Build()); } - public static string? GetTentacleImageAndTag(string kindExePath, KubernetesClusterInstaller clusterInstaller) + public static async Task GetTentacleImageAndTag(string kindExePath, KubernetesClusterInstaller clusterInstaller) { if (clusterInstaller == null) { @@ -68,9 +68,9 @@ public static TentacleClient BuildTentacleClient(Uri uri, string? thumbprint, Ha { //if we should use the latest locally build image, load the tag from docker and load it into kind var imageLoader = new DockerImageLoader(KubernetesTestsGlobalContext.Instance.TemporaryDirectory, KubernetesTestsGlobalContext.Instance.Logger, kindExePath); - imageAndTag = imageLoader.LoadMostRecentImageIntoKind(clusterInstaller.ClusterName); + imageAndTag = await imageLoader.LoadMostRecentImageIntoKind(clusterInstaller.ClusterName); } - + if(imageAndTag is not null) KubernetesTestsGlobalContext.Instance.Logger.Information("Using tentacle image: {ImageAndTag}", imageAndTag); diff --git a/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/Tooling/HelmDownloader.cs b/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/Tooling/HelmDownloader.cs index f28857be9..8c8126f96 100644 --- a/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/Tooling/HelmDownloader.cs +++ b/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/Tooling/HelmDownloader.cs @@ -27,7 +27,7 @@ protected override string BuildDownloadUrl(Architecture processArchitecture, Ope static string GetArchitectureLabel(Architecture processArchitecture) => processArchitecture == Architecture.Arm64 ? "arm64" : "amd64"; - protected override string PostDownload(string targetDirectory, string downloadFilePath, Architecture processArchitecture, OperatingSystem operatingSystem) + protected override async Task PostDownload(string targetDirectory, string downloadFilePath, Architecture processArchitecture, OperatingSystem operatingSystem) { var architecture = GetArchitectureLabel(processArchitecture); var osName = GetOsName(operatingSystem); @@ -43,7 +43,7 @@ protected override string PostDownload(string targetDirectory, string downloadFi else { //everything else is tar.gz - ExtractTarGzip(downloadFilePath, extractionDir); + await ExtractTarGzip(downloadFilePath, extractionDir); } //move the extracted helm executable to the root target directory @@ -66,7 +66,7 @@ static string GetOsName(OperatingSystem operatingSystem) _ => throw new ArgumentOutOfRangeException(nameof(operatingSystem), operatingSystem, null) }; - void ExtractTarGzip(string gzArchiveName, string destFolder) + async Task ExtractTarGzip(string gzArchiveName, string destFolder) { if (!Directory.Exists(destFolder)) { @@ -79,14 +79,13 @@ void ExtractTarGzip(string gzArchiveName, string destFolder) // Falling back to good old fashioned `tar` does the job nicely :) using var tmp = new TemporaryDirectory(); - var exitCode = SilentProcessRunner.ExecuteCommand( + var exitCode = await SilentProcessRunner.ExecuteCommandAsync( "tar", $"xzvf \"{gzArchiveName}\" -C \"{destFolder}\"", tmp.DirectoryPath, Logger.Debug, Logger.Information, - Logger.Error, - CancellationToken.None); + Logger.Error); if (exitCode != 0) { diff --git a/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/Tooling/ToolDownloader.cs b/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/Tooling/ToolDownloader.cs index f89ae9037..b91c4409b 100644 --- a/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/Tooling/ToolDownloader.cs +++ b/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Setup/Tooling/ToolDownloader.cs @@ -37,19 +37,18 @@ public async Task Download(string targetDirectory, CancellationToken can Logger.Information("Downloading {DownloadUrl} to {DownloadFilePath}", downloadUrl, downloadFilePath); await OctopusPackageDownloader.DownloadPackage(downloadUrl, downloadFilePath, Logger, cancellationToken); - downloadFilePath = PostDownload(targetDirectory, downloadFilePath, RuntimeInformation.ProcessArchitecture, os); + downloadFilePath = await PostDownload(targetDirectory, downloadFilePath, RuntimeInformation.ProcessArchitecture, os); //if this is not running on windows, chmod the tool to be executable if (os is not OperatingSystem.Windows) { - var exitCode = SilentProcessRunner.ExecuteCommand( + var exitCode = await SilentProcessRunner.ExecuteCommandAsync( "chmod", $"+x \"{downloadFilePath}\"", targetDirectory, Logger.Debug, Logger.Information, - Logger.Error, - CancellationToken.None); + Logger.Error); if (exitCode != 0) { @@ -62,12 +61,12 @@ public async Task Download(string targetDirectory, CancellationToken can protected abstract string BuildDownloadUrl(Architecture processArchitecture, OperatingSystem operatingSystem); - protected virtual string PostDownload(string downloadDirectory, string downloadFilePath, Architecture processArchitecture, OperatingSystem operatingSystem) + protected virtual Task PostDownload(string downloadDirectory, string downloadFilePath, Architecture processArchitecture, OperatingSystem operatingSystem) { var targetFilename = Path.Combine(downloadDirectory, ExecutableName); File.Move(downloadFilePath, targetFilename); - return targetFilename; + return Task.FromResult(targetFilename); } static OperatingSystem GetOperationSystem() diff --git a/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Tooling/KubeCtlTool.cs b/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Tooling/KubeCtlTool.cs index cae324e54..0b3354159 100644 --- a/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Tooling/KubeCtlTool.cs +++ b/source/Octopus.Tentacle.Kubernetes.Tests.Integration/Tooling/KubeCtlTool.cs @@ -25,10 +25,10 @@ public KubeCtlTool(TemporaryDirectory temporaryDirectory, string kubeCtlExePath, public Task ExecuteNamespacedCommand(string command, CancellationToken cancellationToken = default) { - return Task.Run(() => ExecuteCommand($"{command} --namespace {ns}", cancellationToken), cancellationToken); + return ExecuteCommand($"{command} --namespace {ns}", cancellationToken); } - KubeCtlCommandResult ExecuteCommand(string command, CancellationToken cancellationToken = default) + async Task ExecuteCommand(string command, CancellationToken cancellationToken = default) { var sb = new StringBuilder(); var sprLogger = new LoggerConfiguration() @@ -40,7 +40,7 @@ KubeCtlCommandResult ExecuteCommand(string command, CancellationToken cancellati var stdOut = new List(); var stdErr = new List(); - var exitCode = SilentProcessRunner.ExecuteCommand( + var exitCode = await SilentProcessRunner.ExecuteCommandAsync( kubeCtlExePath, $"{command} --kubeconfig=\"{kubeConfigPath}\"", temporaryDirectory.DirectoryPath, @@ -55,7 +55,7 @@ KubeCtlCommandResult ExecuteCommand(string command, CancellationToken cancellati sprLogger.Error(y); stdErr.Add(y); }, - cancellationToken); + cancel: cancellationToken); return new (exitCode, stdOut, stdErr); } diff --git a/source/Octopus.Tentacle.Tests.Integration.Common/Builders/Decorators/Proxies/RecordMethodUsagesExtensionMethods.cs b/source/Octopus.Tentacle.Tests.Integration.Common/Builders/Decorators/Proxies/RecordMethodUsagesExtensionMethods.cs index 643999ffb..b2a4a943b 100644 --- a/source/Octopus.Tentacle.Tests.Integration.Common/Builders/Decorators/Proxies/RecordMethodUsagesExtensionMethods.cs +++ b/source/Octopus.Tentacle.Tests.Integration.Common/Builders/Decorators/Proxies/RecordMethodUsagesExtensionMethods.cs @@ -24,6 +24,11 @@ public static IRecordedMethodUsage ForCancelScriptAsync(this IRecordedMethodUsag return o.For("CancelScriptAsync"); } + public static IRecordedMethodUsage ForAbandonScriptAsync(this IRecordedMethodUsages o) + { + return o.For("AbandonScriptAsync"); + } + public static IRecordedMethodUsage ForCompleteScriptAsync(this IRecordedMethodUsages o) { return o.For("CompleteScriptAsync"); diff --git a/source/Octopus.Tentacle.Tests.Integration.Common/Builders/Decorators/ScriptServiceV2DecoratorBuilder.cs b/source/Octopus.Tentacle.Tests.Integration.Common/Builders/Decorators/ScriptServiceV2DecoratorBuilder.cs index 17b499a06..5cd747cf0 100644 --- a/source/Octopus.Tentacle.Tests.Integration.Common/Builders/Decorators/ScriptServiceV2DecoratorBuilder.cs +++ b/source/Octopus.Tentacle.Tests.Integration.Common/Builders/Decorators/ScriptServiceV2DecoratorBuilder.cs @@ -188,6 +188,11 @@ public async Task CancelScriptAsync(CancelScriptCommandV return await cancelScriptFunc(inner, command, options); } + public async Task AbandonScriptAsync(AbandonScriptCommandV2 command, HalibutProxyRequestOptions options) + { + return await inner.AbandonScriptAsync(command, options); + } + public async Task CompleteScriptAsync(CompleteScriptCommandV2 command, HalibutProxyRequestOptions options) { await completeScriptAction(inner, command, options); diff --git a/source/Octopus.Tentacle.Tests.Integration/CapabilitiesServiceV2Test.cs b/source/Octopus.Tentacle.Tests.Integration/CapabilitiesServiceV2Test.cs index 6f9b7300a..f52ef1b59 100644 --- a/source/Octopus.Tentacle.Tests.Integration/CapabilitiesServiceV2Test.cs +++ b/source/Octopus.Tentacle.Tests.Integration/CapabilitiesServiceV2Test.cs @@ -9,6 +9,7 @@ using Octopus.Tentacle.Contracts.Capabilities; using Octopus.Tentacle.Contracts.KubernetesScriptServiceV1; using Octopus.Tentacle.Contracts.ScriptServiceV2; +using Octopus.Tentacle.Core.Services.Scripts; using Octopus.Tentacle.Tests.Integration.Common.Builders.Decorators; using Octopus.Tentacle.Tests.Integration.Support; using Octopus.Tentacle.Tests.Integration.Util.Builders; @@ -31,15 +32,10 @@ public async Task CapabilitiesFromAnOlderTentacleWhichHasNoCapabilitiesService_W capabilities.Should().Contain(nameof(IScriptService)); capabilities.Should().Contain(nameof(IFileTransferService)); - //all versions have ScriptServiceV1 & IFileTransferService - var expectedCapabilitiesCount = 2; if (version.HasScriptServiceV2()) { capabilities.Should().Contain(nameof(IScriptServiceV2)); - expectedCapabilitiesCount++; } - - capabilities.Count.Should().Be(expectedCapabilitiesCount); } [Test] @@ -55,17 +51,23 @@ public async Task CapabilitiesServiceDoesNotReturnKubernetesScriptServiceForNonK capabilities.Should().Contain(nameof(IScriptService)); capabilities.Should().Contain(nameof(IFileTransferService)); - //all versions have ScriptServiceV1 & IFileTransferService - var expectedCapabilitiesCount = 2; if (version.HasScriptServiceV2()) { capabilities.Should().Contain(nameof(IScriptServiceV2)); - expectedCapabilitiesCount++; } capabilities.Should().NotContain(nameof(IKubernetesScriptServiceV1)); + } + + [Test] + [TentacleConfigurations] + public async Task LatestTentacle_AdvertisesAbandonScriptCapability(TentacleConfigurationTestCase tentacleConfigurationTestCase) + { + await using var clientAndTentacle = await tentacleConfigurationTestCase.CreateLegacyBuilder().Build(CancellationToken); + + var capabilities = (await clientAndTentacle.TentacleClient.CapabilitiesServiceV2.GetCapabilitiesAsync(new(CancellationToken))).SupportedCapabilities; - capabilities.Count.Should().Be(expectedCapabilitiesCount); + capabilities.Should().Contain(nameof(ScriptServiceV2.AbandonScriptAsync)); } [Test] diff --git a/source/Octopus.Tentacle.Tests.Integration/ClientScriptExecutionAbandon.cs b/source/Octopus.Tentacle.Tests.Integration/ClientScriptExecutionAbandon.cs new file mode 100644 index 000000000..ac946b292 --- /dev/null +++ b/source/Octopus.Tentacle.Tests.Integration/ClientScriptExecutionAbandon.cs @@ -0,0 +1,238 @@ +using System; +using System.IO; +using System.Linq; +using System.Threading.Tasks; +using FluentAssertions; +using NUnit.Framework; +using Octopus.Tentacle.Client; +using Octopus.Tentacle.Contracts; +using Octopus.Tentacle.Contracts.ClientServices; +using Octopus.Tentacle.Core.Util; +using Octopus.Tentacle.Tests.Integration.Common.Builders.Decorators; +using Octopus.Tentacle.Tests.Integration.Common.Builders.Decorators.Proxies; +using Octopus.Tentacle.Tests.Integration.Support; +using Octopus.Tentacle.Tests.Integration.Util; +using Octopus.Tentacle.Tests.Integration.Util.Builders; + +namespace Octopus.Tentacle.Tests.Integration +{ + [IntegrationTestTimeout] + public class ClientScriptExecutionAbandon : IntegrationTest + { + [Test] + [TentacleConfigurations(scriptServiceToTest: ScriptServiceVersionToTest.Version2)] + public async Task AbandonScript_WhenCancelFailsToKillProcess_ReturnsAbandonedExitCode(TentacleConfigurationTestCase tentacleConfigurationTestCase) + { + // TentacleDebugDisableProcessKill_UNSAFE_FOR_PRODUCTION=1 makes Hitman a no-op, so + // CancelScript cannot actually terminate the underlying script process. The script + // becomes genuinely "stuck" from Tentacle's perspective. AbandonScript should then + // return promptly with AbandonedExitCode without waiting for the process to exit. + await using var clientTentacle = await tentacleConfigurationTestCase.CreateBuilder() + .WithTentacle(x => x.WithRunTentacleEnvironmentVariable(EnvironmentVariables.TentacleDebugDisableProcessKill_UNSAFE_FOR_PRODUCTION, "1")) + .Build(CancellationToken); + + var startFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "start"); + var releaseFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "release"); + + var firstCommand = new TestExecuteShellScriptCommandBuilder() + .SetScriptBody(new ScriptBuilder() + .CreateFile(startFile) + .WaitForFileToExist(releaseFile)) + .WithIsolationLevel(ScriptIsolationLevel.NoIsolation) + .Build(); + + var tentacleClient = clientTentacle.TentacleClient; + + var scriptExecution = Task.Run(async () => await tentacleClient.ExecuteScript(firstCommand, CancellationToken)); + + await Wait.For(() => File.Exists(startFile), + TimeSpan.FromSeconds(30), + () => throw new Exception("Script did not start"), + CancellationToken); + + // Cancel: Hitman is a no-op so the process keeps running. + await tentacleClient.CancelScript(firstCommand.ScriptTicket); + await Task.Delay(TimeSpan.FromSeconds(1)); + + // Abandon: fires the abandon token. The RPC returns the current status snapshot + // immediately, so we poll GetStatus until the script reaches Complete state. + await tentacleClient.AbandonScript(firstCommand.ScriptTicket, CancellationToken); + + ScriptStatus abandonResponse = null!; + await Wait.For(async () => + { + abandonResponse = await tentacleClient.GetStatus(firstCommand.ScriptTicket, CancellationToken); + return abandonResponse.State == ProcessState.Complete; + }, + TimeSpan.FromSeconds(30), + () => throw new Exception("Abandoned script did not reach Complete state within 30s"), + CancellationToken); + abandonResponse.ExitCode.Should().Be(ScriptExitCodes.AbandonedExitCode); + + // Release the script process so it exits cleanly and stops leaking. + File.WriteAllText(releaseFile, ""); + await scriptExecution; + } + + [Test] + [TentacleConfigurations(scriptServiceToTest: ScriptServiceVersionToTest.Version2)] + public async Task AbandonScript_ReleasesIsolationMutexEvenWhileProcessIsStillRunning(TentacleConfigurationTestCase tentacleConfigurationTestCase) + { + // The whole reason Tentacle needs an abandon RPC is to release the isolation mutex + // when CancelScript can't unstick the script. This test proves that contract: a + // FullIsolation script gets stuck (because TentacleDebugDisableProcessKill_UNSAFE_FOR_PRODUCTION + // makes cancel a no-op), abandon is called, and a second FullIsolation script with + // the same mutex name must then be able to acquire the mutex and run. + await using var clientTentacle = await tentacleConfigurationTestCase.CreateBuilder() + .WithTentacle(x => x.WithRunTentacleEnvironmentVariable(EnvironmentVariables.TentacleDebugDisableProcessKill_UNSAFE_FOR_PRODUCTION, "1")) + .Build(CancellationToken); + + var startFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "start"); + var releaseFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "release"); + + const string sharedMutex = "abandon-test-mutex"; + + var firstCommand = new TestExecuteShellScriptCommandBuilder() + .SetScriptBody(new ScriptBuilder() + .CreateFile(startFile) + .WaitForFileToExist(releaseFile)) + .WithIsolationLevel(ScriptIsolationLevel.FullIsolation) + .WithIsolationMutexName(sharedMutex) + .Build(); + + var tentacleClient = clientTentacle.TentacleClient; + + var firstScriptExecution = Task.Run(async () => await tentacleClient.ExecuteScript(firstCommand, CancellationToken)); + + await Wait.For(() => File.Exists(startFile), + TimeSpan.FromSeconds(30), + () => throw new Exception("First script did not start"), + CancellationToken); + + await tentacleClient.CancelScript(firstCommand.ScriptTicket); + await Task.Delay(TimeSpan.FromSeconds(1)); + + await tentacleClient.AbandonScript(firstCommand.ScriptTicket, CancellationToken); + + // Second FullIsolation script with the SAME mutex name. If the abandon released + // the mutex, this script can acquire it and run to completion. Otherwise it would + // block waiting for the (still-alive) first script's mutex hold. + var secondStartFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "second-start"); + var secondCommand = new TestExecuteShellScriptCommandBuilder() + .SetScriptBody(new ScriptBuilder().CreateFile(secondStartFile)) + .WithIsolationLevel(ScriptIsolationLevel.FullIsolation) + .WithIsolationMutexName(sharedMutex) + .Build(); + + var (secondResult, _) = await tentacleClient.ExecuteScript(secondCommand, CancellationToken); + secondResult.ExitCode.Should().Be(0); + File.Exists(secondStartFile).Should().BeTrue("second script should have run after the mutex was released"); + + File.WriteAllText(releaseFile, ""); + await firstScriptExecution; + } + + [Test] + [TentacleConfigurations(scriptServiceToTest: ScriptServiceVersionToTest.Version2)] + public async Task AbandonScript_WithNoPriorCancel_KillsTheProcess(TentacleConfigurationTestCase tentacleConfigurationTestCase) + { + // Anti-abuse: a direct AbandonScript with no prior CancelScript must still attempt the + // kill. The runner's abandon branch best-effort-kills (DoOurBestToCleanUp) before returning + // -48. Kill is NOT disabled here, so the underlying process must actually die (the execution drains). + await using var clientTentacle = await tentacleConfigurationTestCase.CreateBuilder() + .Build(CancellationToken); + + var startFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "start"); + var releaseFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "release"); + + var command = new TestExecuteShellScriptCommandBuilder() + .SetScriptBody(new ScriptBuilder() + .CreateFile(startFile) + .WaitForFileToExist(releaseFile)) + .WithIsolationLevel(ScriptIsolationLevel.NoIsolation) + .Build(); + + var tentacleClient = clientTentacle.TentacleClient; + var scriptExecution = Task.Run(async () => await tentacleClient.ExecuteScript(command, CancellationToken)); + + await Wait.For(() => File.Exists(startFile), + TimeSpan.FromSeconds(30), + () => throw new Exception("Script did not start"), + CancellationToken); + + // Direct abandon, NO prior cancel. + await tentacleClient.AbandonScript(command.ScriptTicket, CancellationToken); + + ScriptStatus abandonResponse = null!; + await Wait.For(async () => + { + abandonResponse = await tentacleClient.GetStatus(command.ScriptTicket, CancellationToken); + return abandonResponse.State == ProcessState.Complete; + }, + TimeSpan.FromSeconds(30), + () => throw new Exception("Abandoned script did not reach Complete state within 30s"), + CancellationToken); + + abandonResponse.ExitCode.Should().Be(ScriptExitCodes.AbandonedExitCode); + + // We never wrote releaseFile, so the script only completes because the abandon branch + // killed it. Draining the execution confirms the process is gone. + await scriptExecution; + } + + [Test] + [TentacleConfigurations(scriptServiceToTest: ScriptServiceVersionToTest.Version2)] + public async Task ExecuteScript_WhenCancellationStaysPending_EscalatesToAbandon(TentacleConfigurationTestCase tentacleConfigurationTestCase) + { + // Kill is disabled, so the script is genuinely stuck and CancelScript can't end it. + // With abandonAfterCancellationPendingFor set short, the orchestrator must escalate to + // AbandonScript, which returns AbandonedExitCode and releases the script. + IRecordedMethodUsages recordedUsages = new MethodUsages(); + await using var clientTentacle = await tentacleConfigurationTestCase.CreateBuilder() + .WithTentacle(x => x.WithRunTentacleEnvironmentVariable(EnvironmentVariables.TentacleDebugDisableProcessKill_UNSAFE_FOR_PRODUCTION, "1")) + .WithTentacleServiceDecorator(new TentacleServiceDecoratorBuilder() + .RecordMethodUsages(out recordedUsages) + .Build()) + .Build(CancellationToken); + + var startFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "start"); + var releaseFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "release"); + + var command = new TestExecuteShellScriptCommandBuilder() + .SetScriptBody(new ScriptBuilder() + .CreateFile(startFile) + .WaitForFileToExist(releaseFile)) + .WithIsolationLevel(ScriptIsolationLevel.NoIsolation) + .Build(); + + var tentacleClient = clientTentacle.TentacleClient; + + using var executionCts = new System.Threading.CancellationTokenSource(); + var logs = new System.Collections.Generic.List(); + + var execution = Task.Run(async () => await tentacleClient.ExecuteScript( + command, + onScriptStatusResponseReceived => logs.AddRange(onScriptStatusResponseReceived.Logs), + _ => Task.CompletedTask, + new SerilogLoggerBuilder().Build().ForContext().ToITentacleTaskLog(), + executionCts.Token, + abandonAfterCancellationPendingFor: TimeSpan.FromSeconds(2))); + + await Wait.For(() => File.Exists(startFile), + TimeSpan.FromSeconds(30), + () => throw new Exception("Script did not start"), + CancellationToken); + + executionCts.Cancel(); + + // ExecuteScript throws OperationCanceledException once the token is cancelled. + await FluentActions.Invoking(async () => await execution).Should().ThrowAsync(); + + // The orchestrator must have escalated to AbandonScript after the 2s threshold. + recordedUsages.ForAbandonScriptAsync().Started.Should().BeGreaterThan(0); + + // Cleanup: release + reap the still-alive process (kill was disabled). + File.WriteAllText(releaseFile, ""); + } + } +} diff --git a/source/Octopus.Tentacle.Tests.Integration/PowerShellStartupDetectionTests.cs b/source/Octopus.Tentacle.Tests.Integration/PowerShellStartupDetectionTests.cs index 814523b42..05fce368e 100644 --- a/source/Octopus.Tentacle.Tests.Integration/PowerShellStartupDetectionTests.cs +++ b/source/Octopus.Tentacle.Tests.Integration/PowerShellStartupDetectionTests.cs @@ -271,14 +271,13 @@ public async Task WhenPowerShellNeverStarts_WeShouldDetectTheScriptDidNotStart_A var args = shell.FormatCommandArguments(workspace.BootstrapScriptFilePath, null, allowInteractive: false); var directOutput = new List(); - var directExitCode = SilentProcessRunner.ExecuteCommand( + var directExitCode = await SilentProcessRunner.ExecuteCommandAsync( shell.GetFullPath(), args, workspace.WorkingDirectory, _ => { }, line => directOutput.Add(line), - line => directOutput.Add(line), - CancellationToken.None); + line => directOutput.Add(line)); var directOutputText = string.Join("\n", directOutput); Logger.Information("Direct invocation output:\n{Output}", directOutputText); @@ -338,14 +337,13 @@ public async Task WhenPowerShellNeverStarts_AndWorkspaceIsDeletedBeforeScriptRun var args = shell.FormatCommandArguments(bootstrapScriptFilePath, null, allowInteractive: false); var directOutput = new List(); - var directExitCode = SilentProcessRunner.ExecuteCommand( + var directExitCode = await SilentProcessRunner.ExecuteCommandAsync( shell.GetFullPath(), args, workspace.WorkingDirectory, _ => { }, line => directOutput.Add(line), - line => directOutput.Add(line), - CancellationToken.None); + line => directOutput.Add(line)); var directOutputText = string.Join("\n", directOutput); Logger.Information("Direct invocation output:\n{Output}", directOutputText); @@ -370,15 +368,16 @@ static IShell GetShellForCurrentPlatform() // First check if pwsh is available try { - var result = SilentProcessRunner.ExecuteCommand( + var result = SilentProcessRunner.ExecuteCommandAsync( "which", "pwsh", Environment.CurrentDirectory, _ => { }, _ => { }, _ => { }, - new Dictionary(), - CancellationToken.None); + customEnvironmentVariables: new Dictionary()) + // Safe: static helper, no synchronisation context. + .GetAwaiter().GetResult(); if (result == 0) { diff --git a/source/Octopus.Tentacle.Tests.Integration/Startup/LinuxConfigureServiceHelperFixture.cs b/source/Octopus.Tentacle.Tests.Integration/Startup/LinuxConfigureServiceHelperFixture.cs index bb8e45aef..7750865fb 100644 --- a/source/Octopus.Tentacle.Tests.Integration/Startup/LinuxConfigureServiceHelperFixture.cs +++ b/source/Octopus.Tentacle.Tests.Integration/Startup/LinuxConfigureServiceHelperFixture.cs @@ -4,6 +4,7 @@ using System.Linq; using System.Reflection; using System.Text; +using System.Threading.Tasks; using FluentAssertions; using NUnit.Framework; using Octopus.Tentacle.CommonTestUtils.Diagnostics; @@ -21,22 +22,22 @@ public class LinuxConfigureServiceHelperFixture { [Test] [RequiresSudoOnLinux] - public void CanInstallServiceAsRoot() + public async Task CanInstallServiceAsRoot() { - CanInstallService(null, null); + await CanInstallService(null, null); } [Test] [RequiresSudoOnLinux] - public void CanInstallServiceAsUser() + public async Task CanInstallServiceAsUser() { var user = new LinuxTestUserPrincipal("octo-shared-svc-test"); - CanInstallService(user.UserName, user.Password); + await CanInstallService(user.UserName, user.Password); } [Test] [RequiresSudoOnLinux] - public void CannotWriteToServiceFileAsUser() + public async Task CannotWriteToServiceFileAsUser() { const string serviceName = "OctopusShared.ServiceHelperTest"; const string instance = "TestCannotWriteToServiceFileInstance"; @@ -47,7 +48,7 @@ public void CannotWriteToServiceFileAsUser() WriteUnixFile(scriptPath); var chmodCmd = new CommandLineInvocation("/bin/bash", $"-c \"chmod 777 {scriptPath}\""); - chmodCmd.ExecuteCommand(); + await chmodCmd.ExecuteCommandAsync(); var configureServiceHelper = new LinuxServiceConfigurator(log); @@ -66,11 +67,11 @@ public void CannotWriteToServiceFileAsUser() serviceConfigurationState); var statCmd = new CommandLineInvocation("/bin/bash", $"-c \"stat -c '%A' /etc/systemd/system/{instance}.service\""); - var result = statCmd.ExecuteCommand(); + var result = await statCmd.ExecuteCommandAsync(); result.Infos.Single().Should().Be("-rw-r--r--"); // Service file should only be writeable for the root user } - void CanInstallService(string username, string password) + async Task CanInstallService(string username, string password) { const string serviceName = "OctopusShared.ServiceHelperTest"; const string instance = "TestInstance"; @@ -81,7 +82,7 @@ void CanInstallService(string username, string password) WriteUnixFile(scriptPath); var commandLineInvocation = new CommandLineInvocation("/bin/bash", $"-c \"chmod 777 {scriptPath}\""); - commandLineInvocation.ExecuteCommand(); + await commandLineInvocation.ExecuteCommandAsync(); var configureServiceHelper = new LinuxServiceConfigurator(log); @@ -100,19 +101,19 @@ void CanInstallService(string username, string password) serviceConfigurationState); //Check that the systemd unit service file has been written - Assert.IsTrue(DoesServiceUnitFileExist(instance), "The service unit file has not been created"); + Assert.IsTrue(await DoesServiceUnitFileExist(instance), "The service unit file has not been created"); - var status = GetServiceStatus(instance); + var status = await GetServiceStatus(instance); status["ActiveState"].Should().Be("active"); status["SubState"].Should().Be("running"); status["LoadState"].Should().Be("loaded"); status["User"].Should().Be(username ?? "root"); //Check that the Service is running - Assert.IsTrue(IsServiceRunning(instance), "The service is not running"); + Assert.IsTrue(await IsServiceRunning(instance), "The service is not running"); //Check that the service is enabled to run on startup - Assert.IsTrue(IsServiceEnabled(instance), "The service has not been enabled to run on startup"); + Assert.IsTrue(await IsServiceEnabled(instance), "The service has not been enabled to run on startup"); var stopServiceConfigurationState = new ServiceConfigurationState { @@ -127,13 +128,13 @@ void CanInstallService(string username, string password) stopServiceConfigurationState); //Check that the Service has stopped - Assert.IsFalse(IsServiceRunning(instance), "The service has not been stopped"); + Assert.IsFalse(await IsServiceRunning(instance), "The service has not been stopped"); //Check that the service is disabled - Assert.IsFalse(IsServiceEnabled(instance), "The service has not been disabled"); + Assert.IsFalse(await IsServiceEnabled(instance), "The service has not been disabled"); //Check that the service is disabled - Assert.IsFalse(DoesServiceUnitFileExist(instance), "The service unit file still exists"); + Assert.IsFalse(await DoesServiceUnitFileExist(instance), "The service unit file still exists"); } void WriteUnixFile(string path) @@ -148,10 +149,10 @@ void WriteUnixFile(string path) } } - Dictionary GetServiceStatus(string serviceName) + async Task> GetServiceStatus(string serviceName) { var commandLineInvocation = new CommandLineInvocation("/bin/bash", $"-c \"systemctl show {serviceName}\""); - var result = commandLineInvocation.ExecuteCommand(); + var result = await commandLineInvocation.ExecuteCommandAsync(); Console.WriteLine($"Status of service {serviceName}"); foreach (var info in result.Infos) Console.WriteLine(info); @@ -160,28 +161,28 @@ Dictionary GetServiceStatus(string serviceName) .ToDictionary(x => x[0], x => x[1]); } - bool IsServiceRunning(string serviceName) + async Task IsServiceRunning(string serviceName) { - var result = RunBashCommand($"systemctl is-active --quiet {serviceName}"); + var result = await RunBashCommand($"systemctl is-active --quiet {serviceName}"); return result.ExitCode == 0; } - bool IsServiceEnabled(string serviceName) + async Task IsServiceEnabled(string serviceName) { - var result = RunBashCommand($"systemctl is-enabled --quiet {serviceName}"); + var result = await RunBashCommand($"systemctl is-enabled --quiet {serviceName}"); return result.ExitCode == 0; } - bool DoesServiceUnitFileExist(string serviceName) + async Task DoesServiceUnitFileExist(string serviceName) { - var result = RunBashCommand($"ls /etc/systemd/system | grep {serviceName}.service"); + var result = await RunBashCommand($"ls /etc/systemd/system | grep {serviceName}.service"); return result.ExitCode == 0; } - CmdResult RunBashCommand(string command) + async Task RunBashCommand(string command) { var commandLineInvocation = new CommandLineInvocation("/bin/bash", $"-c \"{command}\""); - return commandLineInvocation.ExecuteCommand(); + return await commandLineInvocation.ExecuteCommandAsync(); } } } diff --git a/source/Octopus.Tentacle.Tests.Integration/Support/TentacleFetchers/LinuxTentacleFetcher.cs b/source/Octopus.Tentacle.Tests.Integration/Support/TentacleFetchers/LinuxTentacleFetcher.cs index b6613ff5b..2bf480ebd 100644 --- a/source/Octopus.Tentacle.Tests.Integration/Support/TentacleFetchers/LinuxTentacleFetcher.cs +++ b/source/Octopus.Tentacle.Tests.Integration/Support/TentacleFetchers/LinuxTentacleFetcher.cs @@ -44,11 +44,11 @@ async Task DownloadAndExtractFromUrl(string directoryPath, string url) var extractionDirectory = new DirectoryInfo(Path.Combine(directoryPath, "extracted")); - ExtractTarGzip(downloadFilePath, extractionDirectory.FullName, logger); + await ExtractTarGzipAsync(downloadFilePath, extractionDirectory.FullName, logger); return Path.Combine(extractionDirectory.FullName, "tentacle", "Tentacle"); } - public static void ExtractTarGzip(string gzArchiveName, string destFolder, ILogger logger) + public static async Task ExtractTarGzipAsync(string gzArchiveName, string destFolder, ILogger logger) { if (!Directory.Exists(destFolder)) { @@ -62,14 +62,13 @@ public static void ExtractTarGzip(string gzArchiveName, string destFolder, ILogg using var tmp = new TemporaryDirectory(); Action log = s => logger.Information(s); - var exitCode = SilentProcessRunner.ExecuteCommand( + var exitCode = await SilentProcessRunner.ExecuteCommandAsync( "tar", $"xzvf \"{gzArchiveName}\" -C \"{destFolder}\"", tmp.DirectoryPath, log, log, - log, - CancellationToken.None); + log); if (exitCode != 0) { diff --git a/source/Octopus.Tentacle.Tests.Integration/Support/TentacleFetchers/NugetTentacleFetcher.cs b/source/Octopus.Tentacle.Tests.Integration/Support/TentacleFetchers/NugetTentacleFetcher.cs index dc6f3851e..fff2c6bc9 100644 --- a/source/Octopus.Tentacle.Tests.Integration/Support/TentacleFetchers/NugetTentacleFetcher.cs +++ b/source/Octopus.Tentacle.Tests.Integration/Support/TentacleFetchers/NugetTentacleFetcher.cs @@ -61,7 +61,7 @@ async Task DownloadAndExtractFromUrl(string directoryPath, Version versi var tentacleFolder = Path.Combine(directoryPath, "tentacle"); if (tentacleArtifact.EndsWith(".tar.gz")) { - LinuxTentacleFetcher.ExtractTarGzip(tentacleArtifact, tentacleFolder, logger); + await LinuxTentacleFetcher.ExtractTarGzipAsync(tentacleArtifact, tentacleFolder, logger); } else { diff --git a/source/Octopus.Tentacle.Tests.Integration/Support/TestDecoratorsAreCalledInTheCorrectOrder.cs b/source/Octopus.Tentacle.Tests.Integration/Support/TestDecoratorsAreCalledInTheCorrectOrder.cs index 09cbc58fb..fc3fd090b 100644 --- a/source/Octopus.Tentacle.Tests.Integration/Support/TestDecoratorsAreCalledInTheCorrectOrder.cs +++ b/source/Octopus.Tentacle.Tests.Integration/Support/TestDecoratorsAreCalledInTheCorrectOrder.cs @@ -105,6 +105,11 @@ public Task CancelScriptAsync(CancelScriptCommandV2 comm throw new NotImplementedException(); } + public Task AbandonScriptAsync(AbandonScriptCommandV2 command, HalibutProxyRequestOptions proxyRequestOptions) + { + throw new NotImplementedException(); + } + public async Task CompleteScriptAsync(CompleteScriptCommandV2 command, HalibutProxyRequestOptions proxyRequestOptions) { await Task.CompletedTask; diff --git a/source/Octopus.Tentacle.Tests.Integration/Util/Builders/TentacleClientExtensionMethods.cs b/source/Octopus.Tentacle.Tests.Integration/Util/Builders/TentacleClientExtensionMethods.cs index c0975c788..bf6ba7343 100644 --- a/source/Octopus.Tentacle.Tests.Integration/Util/Builders/TentacleClientExtensionMethods.cs +++ b/source/Octopus.Tentacle.Tests.Integration/Util/Builders/TentacleClientExtensionMethods.cs @@ -5,10 +5,12 @@ using System.Threading.Tasks; using Halibut; using Octopus.Tentacle.Client; +using Octopus.Tentacle.Client.EventDriven; using Octopus.Tentacle.Client.Scripts; using Octopus.Tentacle.Client.Scripts.Models; using Octopus.Tentacle.Contracts; using Octopus.Tentacle.Contracts.Logging; +using Octopus.Tentacle.Contracts.ScriptServiceV2; using Octopus.Tentacle.Tests.Integration.Support; using Octopus.Tentacle.Tests.Integration.Support.ExtensionMethods; @@ -56,6 +58,42 @@ public static async Task UploadFile( return result; } + public static async Task AbandonScript( + this TentacleClient tentacleClient, + ScriptTicket scriptTicket, + CancellationToken token, + ITentacleClientTaskLog? log = null) + { + return await tentacleClient.AbandonScript(scriptTicket, + new SerilogLoggerBuilder().Build().ForContext().ToITentacleTaskLog().Chain(log), + token).ConfigureAwait(false); + } + + public static async Task CancelScript( + this TentacleClient tentacleClient, + ScriptTicket scriptTicket, + ITentacleClientTaskLog? log = null) + { + var commandContext = new CommandContext(scriptTicket, 0, ScriptServiceVersion.ScriptServiceVersion2); + var result = await tentacleClient.CancelScript(commandContext, + new SerilogLoggerBuilder().Build().ForContext().ToITentacleTaskLog().Chain(log)) + .ConfigureAwait(false); + return result.ScriptStatus; + } + + public static async Task GetStatus( + this TentacleClient tentacleClient, + ScriptTicket scriptTicket, + CancellationToken token, + ITentacleClientTaskLog? log = null) + { + var commandContext = new CommandContext(scriptTicket, 0, ScriptServiceVersion.ScriptServiceVersion2); + var result = await tentacleClient.GetStatus(commandContext, + new SerilogLoggerBuilder().Build().ForContext().ToITentacleTaskLog().Chain(log), + token).ConfigureAwait(false); + return result.ScriptStatus; + } + public static async Task DownloadFile( this TentacleClient tentacleClient, string remotePath, diff --git a/source/Octopus.Tentacle.Tests.Integration/Util/LinuxTestUserPrincipal.cs b/source/Octopus.Tentacle.Tests.Integration/Util/LinuxTestUserPrincipal.cs index d3f738261..241457316 100644 --- a/source/Octopus.Tentacle.Tests.Integration/Util/LinuxTestUserPrincipal.cs +++ b/source/Octopus.Tentacle.Tests.Integration/Util/LinuxTestUserPrincipal.cs @@ -22,7 +22,13 @@ public LinuxTestUserPrincipal(string username) static void RunCommand(string arguments, bool failOnNonZeroExitCode = true) { var commandLineInvocation = new CommandLineInvocation("/bin/bash", arguments); - var result = commandLineInvocation.ExecuteCommand(); + // We're in a synchronous test helper called from the LinuxTestUserPrincipal + // constructor. Constructors must return synchronously, so we block on the + // async call with .GetAwaiter().GetResult(). This is sync-over-async but is + // safe because the NUnit test runner dispatches us on a worker thread without + // a captured SynchronizationContext, so no deadlock. + // See https://blog.stephencleary.com/2012/07/dont-block-on-async-code.html + var result = commandLineInvocation.ExecuteCommandAsync().GetAwaiter().GetResult(); foreach (var line in result.Errors) Console.WriteLine(line); diff --git a/source/Octopus.Tentacle.Tests.Integration/Util/RunningScriptFixture.cs b/source/Octopus.Tentacle.Tests.Integration/Util/RunningScriptFixture.cs index c0e3b05ad..2c6b9ceec 100644 --- a/source/Octopus.Tentacle.Tests.Integration/Util/RunningScriptFixture.cs +++ b/source/Octopus.Tentacle.Tests.Integration/Util/RunningScriptFixture.cs @@ -67,7 +67,7 @@ public void SetUpLocal() scriptLog = new TestScriptLog(); cancellationTokenSource = new CancellationTokenSource(TimeSpan.FromSeconds(10)); scriptIsolationMutex = new ScriptIsolationMutex(); - runningScript = new RunningScript(shell, + runningScript = RunningScript.Create(shell, workspace, scriptLog, taskId, @@ -169,7 +169,7 @@ public async Task CancellationToken_ShouldKillTheProcess() ? (new PowerShell(), "Start-Sleep -seconds") : (new Bash() as IShell, "sleep"); - var script = new RunningScript(shell, + var script = RunningScript.Create(shell, workspace, scriptLog, taskId, diff --git a/source/Octopus.Tentacle.Tests.Integration/Util/SilentProcessRunnerFixture.cs b/source/Octopus.Tentacle.Tests.Integration/Util/SilentProcessRunnerFixture.cs index da9247fd6..01fa0c168 100644 --- a/source/Octopus.Tentacle.Tests.Integration/Util/SilentProcessRunnerFixture.cs +++ b/source/Octopus.Tentacle.Tests.Integration/Util/SilentProcessRunnerFixture.cs @@ -7,6 +7,7 @@ using FluentAssertions; using NUnit.Framework; using Octopus.Tentacle.CommonTestUtils; +using Octopus.Tentacle.Contracts; using Octopus.Tentacle.Tests.Integration.Support; using Octopus.Tentacle.Tests.Integration.Support.TestAttributes; using Octopus.Tentacle.Util; @@ -125,34 +126,34 @@ public void CancellationToken_ShouldForceKillTheProcess() [Test] [WindowsTest] - public async Task CancellationToken_WhenGrandchildHoldsRedirectedPipes_ShouldNotHang() + public async Task CancelThenAbandon_WhenGrandchildHoldsRedirectedPipes_StopsWaiting() { using var tempDir = new TemporaryDirectory(); var grandchildPidFile = Path.Combine(tempDir.DirectoryPath, "grandchild.pid"); - // This test reproduces the cancel-time hang. - // The issue was SilentProcessRunner will wait for stdout/stderr pipes to be closed. - // The pipes can be inherited by grandchildren, and remain open even after the - // child process has died. + // The scenario: + // * We launch a process that spawns a child, which spawns a long-running grandchild + // and then immediately exits. + // * Because the child exited BEFORE our cancel fires, the grandchild has been + // re-parented (PPID broken). Process.Kill(entireProcessTree:true) follows PPID + // links, so it does NOT find the grandchild — Kill kills nothing beyond the + // (already-dead) child. + // * The grandchild inherited our redirected stdout/stderr pipes and holds them open, + // so the stream readers never see EOF and WaitForExitAsync never completes on its own. // - // Normally Process.Kill(entireProcessTree:true) would kill the entire process tree. - // However if the child process dies THEN we issue the kill command, we do NOT see any - // process under the child and so the Kill() command completes. This leaves the grandchild - // running, holding our pipes we are waiting on. + // Under this design only `abandon` breaks that wait — cancel does not. So cancel alone + // cannot stop such a script; the caller must abandon, which returns AbandonedExitCode and + // leaves the grandchild running. This test asserts cancel alone keeps waiting, then abandon + // stops it promptly. // - // The test stacks three processes: PowerShell (the child) launches cmd.exe (a - // throwaway middle layer) which does `start /b ping` to background ping (the - // grandchild) and then exits. cmd exiting before we cancel is what breaks the - // PPID chain — without that, ping would still be a direct child of PowerShell - // and Kill(true) would find it. - // - // Two non-obvious bits below, both load-bearing: + // Two non-obvious bits in the PowerShell script below, both load-bearing: // * $psi.RedirectStandardInput = $true — we don't use stdin, but redirecting // any stream is what flips bInheritHandles=true in .NET's Process.Start. That // is what makes cmd (and by extension ping) inherit our pipe write-ends. // Without this the grandchild doesn't hold our pipes and there is no bug to // reproduce. - // * The WMI lookup — we need the grandchild's PID so the test can clean it up. + // * The WMI lookup — we need the grandchild's PID so the test can clean it up + // afterwards (otherwise we'd leak a long-running ping on the CI host). var psScript = @" $pidFile = 'PIDFILE_PLACEHOLDER' $pingPath = Join-Path $env:WINDIR 'System32\PING.EXE' @@ -184,30 +185,39 @@ public async Task CancellationToken_WhenGrandchildHoldsRedirectedPipes_ShouldNot try { - using (var cts = new CancellationTokenSource()) + using (var cancelCts = new CancellationTokenSource()) + using (var abandonCts = new CancellationTokenSource()) { - var task = Task.Run(() => Execute( + var task = Task.Run(async () => await SilentProcessRunner.ExecuteCommandAsync( "powershell.exe", $"-NoProfile -NonInteractive -EncodedCommand {encoded}", "", - out _, - out _, - out _, - cts.Token)); + debug: _ => { }, + info: _ => { }, + error: _ => { }, + customEnvironmentVariables: null, + cancel: cancelCts.Token, + abandon: abandonCts.Token)); // Wait for the grandchild to actually be spawned before cancelling - await WaitForGrandchildSpawnAsync(grandchildPidFile, TimeSpan.FromSeconds(60)); - - var sw = Stopwatch.StartNew(); - cts.Cancel(); - - var completed = task.Wait(TimeSpan.FromSeconds(30)); - sw.Stop(); - - completed.Should().BeTrue( - $"ExecuteCommand should return shortly after cancellation even when a grandchild " + - $"holds the redirected pipes. Without proactively closing the redirected streams " + - $"after Kill, Process.WaitForExit() blocks indefinitely. Elapsed since cancel: {sw.Elapsed.TotalSeconds:F1}s"); + await WaitForPidFileAsync(grandchildPidFile, TimeSpan.FromSeconds(60)); + + // Cancel alone cannot stop the wait: Kill(entireProcessTree) can't reach the + // re-parented grandchild, and the grandchild holds our redirected pipes open so + // stream EOF never arrives. WaitForExitAsync keeps waiting — until abandon. + cancelCts.Cancel(); + var stoppedByCancel = task.Wait(TimeSpan.FromSeconds(5)); + stoppedByCancel.Should().BeFalse( + "cancel cannot stop a script whose re-parented grandchild holds the redirected " + + "pipes open; abandon is required to stop waiting"); + + // Abandon stops the wait promptly and returns the distinct exit code, leaving the + // grandchild running (pipes released by Process.Dispose at end of method). + abandonCts.Cancel(); + var stoppedByAbandon = task.Wait(TimeSpan.FromSeconds(30)); + stoppedByAbandon.Should().BeTrue( + "abandon should stop the wait promptly even while the grandchild holds the pipes"); + (await task).Should().Be(ScriptExitCodes.AbandonedExitCode); } } finally @@ -217,24 +227,21 @@ public async Task CancellationToken_WhenGrandchildHoldsRedirectedPipes_ShouldNot } [Test] - public async Task CancellationToken_WhenUnixGrandchildHoldsRedirectedPipes_ShouldNotHang() + public async Task CancelThenAbandon_WhenUnixGrandchildHoldsRedirectedPipes_StopsWaiting() { if (PlatformDetection.IsRunningOnWindows) Assert.Ignore("Unix-only repro (Mac/Linux). The Windows equivalent is covered by the [WindowsTest] above."); - // This test reproduces the cancel-time hang. - // The issue is SilentProcessRunner will wait for stdout/stderr pipes to be closed. - // The pipes can be inherited by grandchildren, and remain open even after the - // child process has died. + // * We start sh, which backgrounds a long sleep (the grandchild) and exits immediately. + // The grandchild is re-parented to init/launchd and inherits our redirected + // stdout/stderr pipes, holding them open — so stream EOF never arrives. + // * Process.Kill(entireProcessTree:true) follows PPID links, so by cancel time the + // orphaned grandchild is invisible to Kill — it keeps running and keeps the pipes open. // - // Normally Process.Kill(entireProcessTree:true) would kill the entire process tree. - // However if the child process dies THEN we issue the kill command, we do NOT see any - // process under the child and so the Kill() command completes. This leaves the grandchild - // running, holding our pipes we are waiting on. - // - // The test is simple we run "sh", the child process, and tell it to yeet - // sleep, the grandchild, into the background. The grandchild now keeps - // running holding on to our pipes we will wait for an EOF on. + // Under this design cancel does NOT break the wait — only `abandon` does. So cancelling a + // script whose grandchild holds the pipes cannot stop it on its own; the caller must + // abandon, which returns AbandonedExitCode and leaves the grandchild running. This test + // asserts exactly that: cancel alone keeps waiting, then abandon stops it promptly. using var tempDir = new TemporaryDirectory(); var grandchildPidFile = Path.Combine(tempDir.DirectoryPath, "grandchild.pid"); @@ -242,40 +249,155 @@ public async Task CancellationToken_WhenUnixGrandchildHoldsRedirectedPipes_Shoul try { - using (var cts = new CancellationTokenSource()) + using (var cancelCts = new CancellationTokenSource()) + using (var abandonCts = new CancellationTokenSource()) { - var task = Task.Run(() => Execute( + var task = Task.Run(async () => await SilentProcessRunner.ExecuteCommandAsync( "/bin/sh", - $"-c \"{grandchild}\"", // We run this in background and `sh` does not wait for it. + $"-c \"{grandchild}\"", // backgrounds the grandchild; `sh` does not wait for it. "", - out _, - out _, - out _, - cts.Token)); + debug: _ => { }, + info: _ => { }, + error: _ => { }, + customEnvironmentVariables: null, + cancel: cancelCts.Token, + abandon: abandonCts.Token)); + + await WaitForPidFileAsync(grandchildPidFile, TimeSpan.FromSeconds(60)); + + // Cancel alone cannot stop the wait: Kill(entireProcessTree) can't reach the + // re-parented grandchild, and the grandchild holds our redirected pipes open so + // stream EOF never arrives. WaitForExitAsync keeps waiting — until abandon. + cancelCts.Cancel(); + var stoppedByCancel = task.Wait(TimeSpan.FromSeconds(5)); + stoppedByCancel.Should().BeFalse( + "cancel cannot stop a script whose re-parented grandchild holds the redirected " + + "pipes open; abandon is required to stop waiting"); + + // Abandon stops the wait promptly and returns the distinct exit code, leaving the + // grandchild running (pipes released by Process.Dispose at end of method). + abandonCts.Cancel(); + var stoppedByAbandon = task.Wait(TimeSpan.FromSeconds(30)); + stoppedByAbandon.Should().BeTrue( + "abandon should stop the wait promptly even while the grandchild holds the pipes"); + (await task).Should().Be(ScriptExitCodes.AbandonedExitCode); + } + } + finally + { + TryKillGrandchild(grandchildPidFile); + } + } - await WaitForGrandchildSpawnAsync(grandchildPidFile, TimeSpan.FromSeconds(30)); + [Test] + public async Task AbandonToken_ReturnsAbandonedExitCode() + { + // Abandon stops waiting and returns the distinct AbandonedExitCode. Abandon also + // best-effort-kills the process — the kill itself is asserted by + // ClientScriptExecutionAbandon.AbandonScript_WithNoPriorCancel_KillsTheProcess, and the + // un-killable (survives) case by AbandonScript_WhenCancelFailsToKillProcess. (We do NOT + // disable kill here: setting that env var is process-wide and leaks into the Tentacle + // subprocesses other integration tests spawn.) + using var tempDir = new TemporaryDirectory(); + var pidFile = Path.Combine(tempDir.DirectoryPath, "process.pid"); + + var abandonCommand = PlatformDetection.IsRunningOnWindows ? "powershell.exe" : "/bin/bash"; + var arguments = PlatformDetection.IsRunningOnWindows + ? $"-NoProfile -NonInteractive -Command \"$PID | Out-File -FilePath '{pidFile}' -Encoding ASCII; Start-Sleep -Seconds 300\"" + : $"-c \"echo $$ > '{pidFile}' && sleep 300\""; - var sw = Stopwatch.StartNew(); - cts.Cancel(); + using var cancelCts = new CancellationTokenSource(); + using var abandonCts = new CancellationTokenSource(); - // Cancel should be super quick, if it takes a long time, then we have an issue where we are waiting for the grandchild. - var completed = task.Wait(TimeSpan.FromSeconds(30)); - sw.Stop(); + var infoMessages = new StringBuilder(); - completed.Should().BeTrue( - $"ExecuteCommand should return shortly after cancellation even when a Unix " + - $"grandchild (reparented to init/launchd) holds the redirected pipes. " + - $"Without proactively closing the redirected streams after Kill, " + - $"Process.WaitForExit() blocks indefinitely. Elapsed since cancel: {sw.Elapsed.TotalSeconds:F1}s"); + var task = Task.Run(async () => await SilentProcessRunner.ExecuteCommandAsync( + abandonCommand, + arguments, + Environment.CurrentDirectory, + debug: _ => { }, + info: msg => { lock (infoMessages) infoMessages.AppendLine(msg); }, + error: _ => { }, + customEnvironmentVariables: null, + cancel: cancelCts.Token, + abandon: abandonCts.Token)); + + // Wait deterministically for the process to write its PID before we abandon + await WaitForPidFileAsync(pidFile, TimeSpan.FromSeconds(30)); + abandonCts.Cancel(); + + try + { + var exitCode = await task; + exitCode.Should().Be(ScriptExitCodes.AbandonedExitCode); + infoMessages.ToString().Should().Contain("Tentacle has abandoned this script"); + } + finally + { + // Force-kill the script process if it's somehow still around, to avoid leaking on CI + if (File.Exists(pidFile) && int.TryParse(SafelyReadAllText(pidFile).Trim(), out var pid) && pid > 0) + { + try { Process.GetProcessById(pid).Kill(); } catch { /* already gone */ } } } + } + + [Test] + public async Task AbandonToken_WhenCancelAlsoRequested_StillReturnsAbandonedExitCode() + { + // Once abandon is requested, the runner returns AbandonedExitCode even though cancel is + // also requested — the abandon branch resolves the wait and returns -48. + // + // We fire abandon FIRST so this is deterministic. The exit code is NOT deterministic if + // cancel and abandon race on a *killable* process: cancel's kill makes WaitForExitAsync + // complete with the killed exit code before abandon is observed. That race does not happen + // in production — cancel and abandon arrive as separate, sequential RPCs, and the server + // only sends abandon AFTER cancel has failed to unstick the script. The deterministic + // both-tokens case (cancel fired, kill genuinely fails, abandon -> -48) is covered + // end-to-end by ClientScriptExecutionAbandon.AbandonScript_WhenCancelFailsToKillProcess. + using var tempDir = new TemporaryDirectory(); + var pidFile = Path.Combine(tempDir.DirectoryPath, "process.pid"); + + var executable = PlatformDetection.IsRunningOnWindows ? "powershell.exe" : "/bin/bash"; + var arguments = PlatformDetection.IsRunningOnWindows + ? $"-NoProfile -NonInteractive -Command \"$PID | Out-File -FilePath '{pidFile}' -Encoding ASCII; Start-Sleep -Seconds 300\"" + : $"-c \"echo $$ > '{pidFile}' && sleep 300\""; + + using var cancelCts = new CancellationTokenSource(); + using var abandonCts = new CancellationTokenSource(); + + var task = Task.Run(async () => await SilentProcessRunner.ExecuteCommandAsync( + executable, + arguments, + Environment.CurrentDirectory, + debug: _ => { }, + info: _ => { }, + error: _ => { }, + customEnvironmentVariables: null, + cancel: cancelCts.Token, + abandon: abandonCts.Token)); + + await WaitForPidFileAsync(pidFile, TimeSpan.FromSeconds(30)); + + // Abandon first (resolves the wait to -48), then cancel is also requested. + abandonCts.Cancel(); + cancelCts.Cancel(); + + try + { + var exitCode = await task; + exitCode.Should().Be(ScriptExitCodes.AbandonedExitCode); + } finally { - TryKillGrandchild(grandchildPidFile); + if (File.Exists(pidFile) && int.TryParse(SafelyReadAllText(pidFile).Trim(), out var pid) && pid > 0) + { + try { Process.GetProcessById(pid).Kill(); } catch { /* already gone */ } + } } } - static async Task WaitForGrandchildSpawnAsync(string pidFile, TimeSpan timeout) + static async Task WaitForPidFileAsync(string pidFile, TimeSpan timeout) { var deadline = DateTime.UtcNow + timeout; while (DateTime.UtcNow < deadline) @@ -285,8 +407,8 @@ static async Task WaitForGrandchildSpawnAsync(string pidFile, TimeSpan timeout) await Task.Delay(50); } throw new TimeoutException( - $"Test setup failed: the grandchild PID was never written to '{pidFile}'. " + - $"The grandchild-pipe scenario is not being exercised."); + $"Test setup failed: a valid PID was never written to '{pidFile}'. " + + $"The scenario under test is not being exercised."); } static string SafelyReadAllText(string path) @@ -418,7 +540,18 @@ static int Execute( var debug = new StringBuilder(); var info = new StringBuilder(); var error = new StringBuilder(); - var exitCode = SilentProcessRunner.ExecuteCommand( + + // Why this is sync: Execute is a test helper that returns int and uses + // out parameters — both force the signature to be sync. It's invoked + // directly from sync NUnit test methods. + // + // Why blocking on the async call is safe: NUnit dispatches us on a + // worker thread with no SynchronizationContext. + // + // Why low risk: this is test code. The worst case for a wrong call here + // is a hung test, not a production incident. + // See https://blog.stephencleary.com/2012/07/dont-block-on-async-code.html + var exitCode = SilentProcessRunner.ExecuteCommandAsync( command, arguments, workingDirectory, @@ -437,7 +570,7 @@ static int Execute( Console.WriteLine($"{DateTime.UtcNow} ERR: {x}"); error.Append(x); }, - cancel); + cancel: cancel).GetAwaiter().GetResult(); debugMessages = debug; infoMessages = info; diff --git a/source/Octopus.Tentacle.Tests/Capabilities/CapabilitiesServiceV2Fixture.cs b/source/Octopus.Tentacle.Tests/Capabilities/CapabilitiesServiceV2Fixture.cs index 0c6326a05..e219353cd 100644 --- a/source/Octopus.Tentacle.Tests/Capabilities/CapabilitiesServiceV2Fixture.cs +++ b/source/Octopus.Tentacle.Tests/Capabilities/CapabilitiesServiceV2Fixture.cs @@ -6,6 +6,7 @@ using Octopus.Tentacle.Contracts; using Octopus.Tentacle.Contracts.KubernetesScriptServiceV1; using Octopus.Tentacle.Contracts.ScriptServiceV2; +using Octopus.Tentacle.Core.Services.Scripts; using Octopus.Tentacle.Kubernetes; using Octopus.Tentacle.Services.Capabilities; @@ -20,8 +21,7 @@ public async Task CapabilitiesAreReturned() .GetCapabilitiesAsync(CancellationToken.None)) .SupportedCapabilities; - capabilities.Should().BeEquivalentTo(nameof(IScriptService), nameof(IFileTransferService), nameof(IScriptServiceV2)); - capabilities.Count.Should().Be(3); + capabilities.Should().BeEquivalentTo(nameof(IScriptService), nameof(IFileTransferService), nameof(IScriptServiceV2), nameof(ScriptServiceV2.AbandonScriptAsync)); capabilities.Should().NotContainMatch("IKubernetesScriptService*"); } @@ -36,11 +36,30 @@ public async Task OnlyKubernetesScriptServicesAreReturnedWhenRunningAsKubernetes .SupportedCapabilities; capabilities.Should().BeEquivalentTo(nameof(IFileTransferService), nameof(IKubernetesScriptServiceV1)); - capabilities.Count.Should().Be(2); capabilities.Should().NotContainMatch("IScriptService*"); Environment.SetEnvironmentVariable(KubernetesConfig.NamespaceVariableName, null); } + + [Test] + public async Task GetCapabilities_OnNonKubernetesTentacle_AdvertisesAbandonScriptV2() + { + var service = new CapabilitiesServiceV2(); + var response = await service.GetCapabilitiesAsync(CancellationToken.None); + response.SupportedCapabilities.Should().Contain(nameof(ScriptServiceV2.AbandonScriptAsync)); + } + + [Test] + public async Task GetCapabilities_OnKubernetesTentacle_DoesNotAdvertiseAbandonScriptV2() + { + Environment.SetEnvironmentVariable(KubernetesConfig.NamespaceVariableName, "ABC"); + + var service = new CapabilitiesServiceV2(); + var response = await service.GetCapabilitiesAsync(CancellationToken.None); + response.SupportedCapabilities.Should().NotContain(nameof(ScriptServiceV2.AbandonScriptAsync)); + + Environment.SetEnvironmentVariable(KubernetesConfig.NamespaceVariableName, null); + } } } \ No newline at end of file diff --git a/source/Octopus.Tentacle.Tests/Integration/ScriptServiceV2Fixture.cs b/source/Octopus.Tentacle.Tests/Integration/ScriptServiceV2Fixture.cs index d5c771eef..9317a874e 100644 --- a/source/Octopus.Tentacle.Tests/Integration/ScriptServiceV2Fixture.cs +++ b/source/Octopus.Tentacle.Tests/Integration/ScriptServiceV2Fixture.cs @@ -1,4 +1,4 @@ -using System; +using System; using System.Collections.Concurrent; using System.Collections.Generic; using System.Diagnostics; @@ -16,6 +16,7 @@ using Octopus.Tentacle.Core.Diagnostics; using Octopus.Tentacle.Core.Services.Scripts; using Octopus.Tentacle.Core.Services.Scripts.Locking; +using Octopus.Tentacle.Core.Services.Scripts.Logging; using Octopus.Tentacle.Core.Services.Scripts.Security.Masking; using Octopus.Tentacle.Core.Services.Scripts.Shell; using Octopus.Tentacle.Core.Services.Scripts.StateStore; @@ -29,30 +30,12 @@ namespace Octopus.Tentacle.Tests.Integration [TestFixture] public class ScriptServiceV2Fixture { - ScriptServiceV2 service = null!; - ScriptWorkspaceFactory workspaceFactory = null!; - ScriptStateStoreFactory stateStoreFactory = null!; - - [SetUp] - public void SetUp() - { - var homeConfiguration = Substitute.For(); - homeConfiguration.HomeDirectory.Returns(Environment.CurrentDirectory); - - var octopusPhysicalFileSystem = new OctopusPhysicalFileSystem(Substitute.For()); - workspaceFactory = new ScriptWorkspaceFactory(octopusPhysicalFileSystem, homeConfiguration, new SensitiveValueMasker()); - stateStoreFactory = new ScriptStateStoreFactory(octopusPhysicalFileSystem); - service = new ScriptServiceV2( - PlatformDetection.IsRunningOnWindows ? (IShell)new PowerShell() : new Bash(), - workspaceFactory, - stateStoreFactory, - new ScriptIsolationMutex(), - Substitute.For()); - } - [Test] public async Task ShouldExecuteAScriptSuccessfully() { + var builder = new ScriptServiceV2Builder(); + var service = builder.Build(); + var windowsScript = "& ping.exe localhost -n 1"; var bashScript = "ping localhost -c 1"; @@ -63,7 +46,7 @@ public async Task ShouldExecuteAScriptSuccessfully() .Build(); var startScriptResponse = await service.StartScriptAsync(startScriptCommand, CancellationToken.None); - var (logs, finalResponse) = await RunUntilScriptCompletes(startScriptCommand, startScriptResponse); + var (logs, finalResponse) = await RunUntilScriptCompletes(service, startScriptCommand, startScriptResponse); finalResponse.State.Should().Be(ProcessState.Complete); finalResponse.ExitCode.Should().Be(0); @@ -73,6 +56,9 @@ public async Task ShouldExecuteAScriptSuccessfully() [Test] public async Task ShouldReturnANonZeroExitCodeForAFailingScript() { + var builder = new ScriptServiceV2Builder(); + var service = builder.Build(); + var windowsScript = "& ping.exe nope -n 1"; var bashScript = "ping nope -c 1"; @@ -83,7 +69,7 @@ public async Task ShouldReturnANonZeroExitCodeForAFailingScript() .Build(); var startScriptResponse = await service.StartScriptAsync(startScriptCommand, CancellationToken.None); - var (logs, finalResponse) = await RunUntilScriptCompletes(startScriptCommand, startScriptResponse); + var (logs, finalResponse) = await RunUntilScriptCompletes(service, startScriptCommand, startScriptResponse); finalResponse.State.Should().Be(ProcessState.Complete); finalResponse.ExitCode.Should().NotBe(0); @@ -93,6 +79,9 @@ public async Task ShouldReturnANonZeroExitCodeForAFailingScript() [Test] public async Task ShouldExecuteMultipleScriptsConcurrently() { + var builder = new ScriptServiceV2Builder(); + var service = builder.Build(); + var bashScript = "sleep 10"; var windowsScript = "Start-Sleep -Seconds 10"; @@ -114,7 +103,7 @@ public async Task ShouldExecuteMultipleScriptsConcurrently() }); await Task.WhenAll(tasks); - + var startDuration = started.Elapsed; startDuration.Should().BeLessThan(TimeSpan.FromSeconds(5)); @@ -122,7 +111,7 @@ public async Task ShouldExecuteMultipleScriptsConcurrently() foreach (var script in scripts) { - var (logs, finalResponse) = await RunUntilScriptCompletes(script.Command, script.Response); + var (logs, finalResponse) = await RunUntilScriptCompletes(service, script.Command, script.Response); finalResponse.State.Should().Be(ProcessState.Complete); finalResponse.ExitCode.Should().Be(0); @@ -136,6 +125,9 @@ public async Task ShouldExecuteMultipleScriptsConcurrently() [Test] public async Task ShouldStartExecuteAScriptQuickly() { + var builder = new ScriptServiceV2Builder(); + var service = builder.Build(); + var script = "echo \"finished\""; var startScriptCommand = new StartScriptCommandV2Builder() @@ -170,6 +162,9 @@ public async Task ShouldStartExecuteAScriptQuickly() [Test] public async Task ShouldExecuteALongRunningScriptSuccessfully() { + var builder = new ScriptServiceV2Builder(); + var service = builder.Build(); + var bashScript = "sleep 10"; var windowsScript = "Start-Sleep -Seconds 10"; @@ -180,7 +175,7 @@ public async Task ShouldExecuteALongRunningScriptSuccessfully() .Build(); var startScriptResponse = await service.StartScriptAsync(startScriptCommand, CancellationToken.None); - var (_, finalResponse) = await RunUntilScriptCompletes(startScriptCommand, startScriptResponse); + var (_, finalResponse) = await RunUntilScriptCompletes(service, startScriptCommand, startScriptResponse); finalResponse.State.Should().Be(ProcessState.Complete); finalResponse.ExitCode.Should().Be(0); @@ -189,6 +184,9 @@ public async Task ShouldExecuteALongRunningScriptSuccessfully() [Test] public async Task StartScriptShouldWaitForAShortScriptToFinish() { + var builder = new ScriptServiceV2Builder(); + var service = builder.Build(); + var startScriptCommand = new StartScriptCommandV2Builder() .WithScriptBody("echo \"finished\"") .WithDurationStartScriptCanWaitForScriptToFinish(TimeSpan.FromSeconds(5)) @@ -207,6 +205,9 @@ public async Task StartScriptShouldWaitForAShortScriptToFinish() [Test] public async Task StartScriptShouldNotWaitForALongScriptToFinish() { + var builder = new ScriptServiceV2Builder(); + var service = builder.Build(); + var bashScript = "sleep 10"; var windowsScript = "Start-Sleep -Seconds 10"; @@ -217,7 +218,7 @@ public async Task StartScriptShouldNotWaitForALongScriptToFinish() .Build(); var startScriptResponse = await service.StartScriptAsync(startScriptCommand, CancellationToken.None); - await RunUntilScriptCompletes(startScriptCommand, startScriptResponse); + await RunUntilScriptCompletes(service, startScriptCommand, startScriptResponse); startScriptResponse.State.Should().Be(ProcessState.Running); startScriptResponse.ExitCode.Should().Be(0); @@ -227,6 +228,9 @@ public async Task StartScriptShouldNotWaitForALongScriptToFinish() [Test] public async Task StartScriptShouldNotStartTheScriptForTheSameScriptTicketMoreThanOnce_SequentialRequests() { + var builder = new ScriptServiceV2Builder(); + var service = builder.Build(); + // Arrange var scriptTicket = new ScriptTicket(Guid.NewGuid().ToString()); var script1 = GetStartScriptCommandForScriptThatCreatesAFile(scriptTicket); @@ -236,7 +240,7 @@ public async Task StartScriptShouldNotStartTheScriptForTheSameScriptTicketMoreTh await service.StartScriptAsync(script1.StartScriptCommand, CancellationToken.None); var startScriptResponse = await service.StartScriptAsync(script2.StartScriptCommand, CancellationToken.None); - var (_, finalResponse) = await RunUntilScriptCompletes(script2.StartScriptCommand, startScriptResponse); + var (_, finalResponse) = await RunUntilScriptCompletes(service, script2.StartScriptCommand, startScriptResponse); // Assert finalResponse.State.Should().Be(ProcessState.Complete); @@ -249,6 +253,9 @@ public async Task StartScriptShouldNotStartTheScriptForTheSameScriptTicketMoreTh [Test] public async Task StartScriptShouldNotStartTheScriptForTheSameScriptTicketMoreThanOnce_ConcurrentRequests() { + var builder = new ScriptServiceV2Builder(); + var service = builder.Build(); + // Arrange var scriptTicket = new ScriptTicket(Guid.NewGuid().ToString()); @@ -271,6 +278,7 @@ public async Task StartScriptShouldNotStartTheScriptForTheSameScriptTicketMoreTh await Task.WhenAll(tasks); var (_, finalResponse) = await RunUntilScriptCompletes( + service, scripts[0].StartScriptCommand, new ScriptStatusResponseV2(scripts[0].StartScriptCommand.ScriptTicket, ProcessState.Pending, 0, new List(), 0)); @@ -292,6 +300,9 @@ public async Task StartScriptShouldNotStartTheScriptForTheSameScriptTicketMoreTh [Test] public async Task CancelScriptShouldCancelAnExecutingScript() { + var builder = new ScriptServiceV2Builder(); + var service = builder.Build(); + var bashScript = "sleep 60"; var windowsScript = "Start-Sleep -Seconds 60"; @@ -337,6 +348,10 @@ public async Task CancelScriptShouldCancelAnExecutingScript() [Test] public async Task CompleteScriptShouldCleanupTheWorkspace() { + var builder = new ScriptServiceV2Builder(); + var service = builder.Build(); + var workspaceFactory = (ScriptWorkspaceFactory)builder.WorkspaceFactory; + var script = "echo \"finished\""; var startScriptCommand = new StartScriptCommandV2Builder() @@ -351,13 +366,14 @@ public async Task CompleteScriptShouldCleanupTheWorkspace() var startScriptResponse = await service.StartScriptAsync(startScriptCommand, CancellationToken.None); Directory.Exists(workspaceDirectory).Should().BeTrue(); - await RunUntilScriptCompletes(startScriptCommand, startScriptResponse); + await RunUntilScriptCompletes(service, startScriptCommand, startScriptResponse); Directory.Exists(workspaceDirectory).Should().BeFalse(); } [Test] public async Task GetStatusShouldReturnAnExitCodeOf45ForAnUnknownScriptTicket() { + var service = new ScriptServiceV2Builder().Build(); var response = await service.GetStatusAsync(new ScriptStatusRequestV2(new ScriptTicket("nope"), 0), CancellationToken.None); response.ExitCode.Should().Be(-45); @@ -366,6 +382,7 @@ public async Task GetStatusShouldReturnAnExitCodeOf45ForAnUnknownScriptTicket() [Test] public async Task CancelScriptShouldReturnAnExitCodeOf45ForAnUnknownScriptTicket() { + var service = new ScriptServiceV2Builder().Build(); var response = await service.CancelScriptAsync(new CancelScriptCommandV2(new ScriptTicket("nope"), 0), CancellationToken.None); response.ExitCode.Should().Be(-45); @@ -374,43 +391,53 @@ public async Task CancelScriptShouldReturnAnExitCodeOf45ForAnUnknownScriptTicket [Test] public async Task CompleteScriptShouldNotErrorForAnUnknownScriptTicket() { + var service = new ScriptServiceV2Builder().Build(); await service.CompleteScriptAsync(new CompleteScriptCommandV2(new ScriptTicket("nope")), CancellationToken.None); } [Test] public async Task GetStatusShouldReturnAnExitCodeOf46ForAScriptWithAnUnknownResult() { + var builder = new ScriptServiceV2Builder(); + var service = builder.Build(); + var request = new ScriptStatusRequestV2(new ScriptTicket($"did-not-finish-{Guid.NewGuid()}"), 0); var ticket = request.Ticket; - SetupScriptState(ticket); + SetupScriptState(builder.WorkspaceFactory, builder.StateStoreFactory, ticket); var response = await service.GetStatusAsync(request, CancellationToken.None); response.ExitCode.Should().Be(-46); - await CleanupWorkspace(ticket, CancellationToken.None); + await CleanupWorkspace(builder.WorkspaceFactory, ticket, CancellationToken.None); } [Test] public async Task CancelScriptShouldReturnAnExitCodeOf46ForAScriptWithAnUnknownResult() { + var builder = new ScriptServiceV2Builder(); + var service = builder.Build(); + var request = new CancelScriptCommandV2(new ScriptTicket($"did-not-finish-{Guid.NewGuid()}"), 0); var ticket = request.Ticket; - SetupScriptState(ticket); + SetupScriptState(builder.WorkspaceFactory, builder.StateStoreFactory, ticket); var response = await service.CancelScriptAsync(request, CancellationToken.None); response.ExitCode.Should().Be(-46); - await CleanupWorkspace(ticket, CancellationToken.None); + await CleanupWorkspace(builder.WorkspaceFactory, ticket, CancellationToken.None); } [Test] public async Task CompleteScriptShouldNotErrorForAScriptWithAnUnknownResult() { + var builder = new ScriptServiceV2Builder(); + var service = builder.Build(); + var request = new CompleteScriptCommandV2(new ScriptTicket($"did-not-finish-{Guid.NewGuid()}")); var ticket = request.Ticket; - SetupScriptState(ticket); + SetupScriptState(builder.WorkspaceFactory, builder.StateStoreFactory, ticket); await service.CompleteScriptAsync(request, CancellationToken.None); } @@ -418,6 +445,9 @@ public async Task CompleteScriptShouldNotErrorForAScriptWithAnUnknownResult() [Test] public async Task ShouldStoreTheStateOfTheScriptInTheScriptStateStore() { + var builder = new ScriptServiceV2Builder(); + var service = builder.Build(); + var testStarted = DateTimeOffset.UtcNow; var bashScript = "sleep 10"; @@ -429,13 +459,13 @@ public async Task ShouldStoreTheStateOfTheScriptInTheScriptStateStore() .WithDurationStartScriptCanWaitForScriptToFinish(null) .Build(); - var scriptStateStore = SetupScriptStateStore(startScriptCommand.ScriptTicket); + var scriptStateStore = SetupScriptStateStore(builder.WorkspaceFactory, builder.StateStoreFactory, startScriptCommand.ScriptTicket); var startScriptResponse = await service.StartScriptAsync(startScriptCommand, CancellationToken.None); var runningScriptState = scriptStateStore.Load(); - var (logs, finalResponse) = await RunUntilScriptFinishes(startScriptCommand, startScriptResponse); + var (logs, finalResponse) = await RunUntilScriptFinishes(service, startScriptCommand, startScriptResponse); var testFinished = DateTimeOffset.UtcNow; var finishedScriptState = scriptStateStore.Load(); @@ -461,6 +491,8 @@ public async Task ShouldStoreTheStateOfTheScriptInTheScriptStateStore() [Test] public async Task ScriptTicketCasingShouldNotAffectCommands() { + var service = new ScriptServiceV2Builder().Build(); + // Arrange var startScriptCommand = new StartScriptCommandV2Builder() .WithScriptBody("echo \"finished\"") @@ -483,22 +515,364 @@ public async Task ScriptTicketCasingShouldNotAffectCommands() lowerCaseResponse.ExitCode.Should().Be(0); } + [Test] + public async Task AbandonScript_OnUnknownTicket_ReturnsCompleteWithUnknownScriptExitCode() + { + var service = new ScriptServiceV2Builder().Build(); + + var ticket = new ScriptTicket("unknown-ticket-" + Guid.NewGuid().ToString("N")); + var response = await service.AbandonScriptAsync(new AbandonScriptCommandV2(ticket, 0), CancellationToken.None); + + response.State.Should().Be(ProcessState.Complete); + response.ExitCode.Should().Be(ScriptExitCodes.UnknownScriptExitCode); + } + + [Test] + public async Task AbandonScript_OnRunningScript_FiresAbandonToken_ReturnsAbandonedExitCode() + { + var service = new ScriptServiceV2Builder().Build(); + + var startCommand = new StartScriptCommandV2Builder() + .WithScriptBodyForCurrentOs("Start-Sleep -Seconds 60", "sleep 60") + .WithIsolation(ScriptIsolationLevel.NoIsolation) + .WithDurationStartScriptCanWaitForScriptToFinish(null) + .Build(); + + await service.StartScriptAsync(startCommand, CancellationToken.None); + + // Wait for the script to reach Running state + ScriptStatusResponseV2 status; + var deadline = DateTime.UtcNow.AddSeconds(30); + do + { + status = await service.GetStatusAsync(new ScriptStatusRequestV2(startCommand.ScriptTicket, 0), CancellationToken.None); + if (status.State == ProcessState.Running) break; + await Task.Delay(50); + } while (DateTime.UtcNow < deadline); + status.State.Should().Be(ProcessState.Running, "script should have reached Running state within 30 seconds"); + + // Fire abandon + await service.AbandonScriptAsync(new AbandonScriptCommandV2(startCommand.ScriptTicket, 0), CancellationToken.None); + + // Poll until the script completes (the abandon token causes the process runner to return AbandonedExitCode) + ScriptStatusResponseV2 finalResponse; + var completionDeadline = DateTime.UtcNow.AddSeconds(30); + do + { + finalResponse = await service.GetStatusAsync(new ScriptStatusRequestV2(startCommand.ScriptTicket, 0), CancellationToken.None); + if (finalResponse.State == ProcessState.Complete) break; + await Task.Delay(100); + } while (DateTime.UtcNow < completionDeadline); + + finalResponse.State.Should().Be(ProcessState.Complete); + finalResponse.ExitCode.Should().Be(ScriptExitCodes.AbandonedExitCode); + } + + [Test] + public async Task AbandonScript_OnAlreadyCompletedScript_ReturnsRealExitCode() + { + var service = new ScriptServiceV2Builder().Build(); + + var startCommand = new StartScriptCommandV2Builder() + .WithScriptBody("echo \"finished\"") + .WithIsolation(ScriptIsolationLevel.NoIsolation) + .WithDurationStartScriptCanWaitForScriptToFinish(null) + .Build(); + + await service.StartScriptAsync(startCommand, CancellationToken.None); + + // Wait for the script to complete + ScriptStatusResponseV2 status; + var deadline = DateTime.UtcNow.AddSeconds(30); + do + { + status = await service.GetStatusAsync(new ScriptStatusRequestV2(startCommand.ScriptTicket, 0), CancellationToken.None); + if (status.State == ProcessState.Complete) break; + await Task.Delay(50); + } while (DateTime.UtcNow < deadline); + status.State.Should().Be(ProcessState.Complete); + + var abandonResponse = await service.AbandonScriptAsync(new AbandonScriptCommandV2(startCommand.ScriptTicket, 0), CancellationToken.None); + abandonResponse.ExitCode.Should().Be(0, "real exit code should be returned, not AbandonedExitCode"); + } + + [Test] + public async Task CompleteScript_AfterAbandon_WhenWorkspaceDeleteFails_LogsWarnAndReturnsNormally() + { + var deleteException = new IOException("file in use"); + var builder = new ScriptServiceV2Builder(); + var (throwingFactory, mockLog) = BuildFactoryWithThrowingDelete(builder.WorkspaceFactory, deleteException); + var serviceUnderTest = builder + .WithWorkspaceFactory(throwingFactory) + .WithLog(mockLog) + .Build(); + + var startCommand = new StartScriptCommandV2Builder() + .WithScriptBodyForCurrentOs("Start-Sleep -Seconds 60", "sleep 60") + .WithIsolation(ScriptIsolationLevel.NoIsolation) + .WithDurationStartScriptCanWaitForScriptToFinish(null) + .Build(); + + await serviceUnderTest.StartScriptAsync(startCommand, CancellationToken.None); + + // Wait for Running + ScriptStatusResponseV2 status; + var runningDeadline = DateTime.UtcNow.AddSeconds(30); + do + { + status = await serviceUnderTest.GetStatusAsync(new ScriptStatusRequestV2(startCommand.ScriptTicket, 0), CancellationToken.None); + if (status.State == ProcessState.Running) break; + await Task.Delay(50); + } while (DateTime.UtcNow < runningDeadline); + status.State.Should().Be(ProcessState.Running, "script should have reached Running state within 30 seconds"); + + await serviceUnderTest.AbandonScriptAsync(new AbandonScriptCommandV2(startCommand.ScriptTicket, 0), CancellationToken.None); + + // Poll until Complete + var completeDeadline = DateTime.UtcNow.AddSeconds(30); + do + { + status = await serviceUnderTest.GetStatusAsync(new ScriptStatusRequestV2(startCommand.ScriptTicket, 0), CancellationToken.None); + if (status.State == ProcessState.Complete) break; + await Task.Delay(50); + } while (DateTime.UtcNow < completeDeadline); + status.State.Should().Be(ProcessState.Complete); + status.ExitCode.Should().Be(ScriptExitCodes.AbandonedExitCode); + + Func complete = async () => await serviceUnderTest.CompleteScriptAsync(new CompleteScriptCommandV2(startCommand.ScriptTicket), CancellationToken.None); + + await complete.Should().NotThrowAsync(); + mockLog.Received().Warn(deleteException, Arg.Is(m => m.Contains("Could not delete") && m.Contains(startCommand.ScriptTicket.TaskId))); + } + + [Test] + public async Task CompleteScript_AfterNormalCompletion_WhenWorkspaceDeleteFails_PropagatesException() + { + var deleteException = new IOException("file in use"); + var builder = new ScriptServiceV2Builder(); + var (throwingFactory, mockLog) = BuildFactoryWithThrowingDelete(builder.WorkspaceFactory, deleteException); + var serviceUnderTest = builder + .WithWorkspaceFactory(throwingFactory) + .WithLog(mockLog) + .Build(); + + var startCommand = new StartScriptCommandV2Builder() + .WithScriptBody("echo \"finished\"") + .WithIsolation(ScriptIsolationLevel.NoIsolation) + .WithDurationStartScriptCanWaitForScriptToFinish(null) + .Build(); + + await serviceUnderTest.StartScriptAsync(startCommand, CancellationToken.None); + + // Poll until natural completion + ScriptStatusResponseV2 status; + var deadline = DateTime.UtcNow.AddSeconds(30); + do + { + status = await serviceUnderTest.GetStatusAsync(new ScriptStatusRequestV2(startCommand.ScriptTicket, 0), CancellationToken.None); + if (status.State == ProcessState.Complete) break; + await Task.Delay(50); + } while (DateTime.UtcNow < deadline); + status.State.Should().Be(ProcessState.Complete); + status.ExitCode.Should().Be(0, "the script exited cleanly, not via abandon"); + + Func complete = async () => await serviceUnderTest.CompleteScriptAsync(new CompleteScriptCommandV2(startCommand.ScriptTicket), CancellationToken.None); + + await complete.Should().ThrowAsync(); + } + + /// + /// Builds an IScriptWorkspaceFactory decorator over the supplied workspaceFactory whose returned + /// workspaces forward every member except Delete(CancellationToken), which throws the supplied + /// exception. Also returns a mock ISystemLog for assertion. + /// + static (IScriptWorkspaceFactory factory, ISystemLog log) BuildFactoryWithThrowingDelete(IScriptWorkspaceFactory workspaceFactory, Exception deleteException) + { + var throwingFactory = new DeleteThrowingScriptWorkspaceFactory(workspaceFactory, deleteException); + var fakeLog = Substitute.For(); + return (throwingFactory, fakeLog); + } + + /// + /// Builder for ScriptServiceV2 SUT construction. Defaults match what the previous [SetUp] + /// produced; tests opt into mock overrides via the chainable With* methods. The + /// WorkspaceFactory and StateStoreFactory properties materialize on first access so tests + /// can grab them before Build() (e.g. to wrap the default factory in a decorator). + /// + class ScriptServiceV2Builder + { + IScriptWorkspaceFactory? workspaceFactory; + ScriptStateStoreFactory? stateStoreFactory; + ISystemLog? log; + IShell? shell; + ScriptIsolationMutex? mutex; + OctopusPhysicalFileSystem? cachedFileSystem; + + public IScriptWorkspaceFactory WorkspaceFactory => workspaceFactory ??= BuildDefaultWorkspaceFactory(); + public ScriptStateStoreFactory StateStoreFactory => stateStoreFactory ??= new ScriptStateStoreFactory(FileSystem); + + OctopusPhysicalFileSystem FileSystem => cachedFileSystem ??= new OctopusPhysicalFileSystem(Substitute.For()); + + ScriptWorkspaceFactory BuildDefaultWorkspaceFactory() + { + var homeConfiguration = Substitute.For(); + homeConfiguration.HomeDirectory.Returns(Environment.CurrentDirectory); + return new ScriptWorkspaceFactory(FileSystem, homeConfiguration, new SensitiveValueMasker()); + } + + public ScriptServiceV2Builder WithWorkspaceFactory(IScriptWorkspaceFactory factory) + { + workspaceFactory = factory; + return this; + } + + public ScriptServiceV2Builder WithStateStoreFactory(ScriptStateStoreFactory factory) + { + stateStoreFactory = factory; + return this; + } + + public ScriptServiceV2Builder WithLog(ISystemLog log) + { + this.log = log; + return this; + } + + public ScriptServiceV2Builder WithShell(IShell shell) + { + this.shell = shell; + return this; + } + + public ScriptServiceV2Builder WithMutex(ScriptIsolationMutex mutex) + { + this.mutex = mutex; + return this; + } + + public ScriptServiceV2 Build() + { + return new ScriptServiceV2( + shell ?? (PlatformDetection.IsRunningOnWindows ? (IShell)new PowerShell() : new Bash()), + WorkspaceFactory, + StateStoreFactory, + mutex ?? new ScriptIsolationMutex(), + log ?? Substitute.For()); + } + } + + /// + /// IScriptWorkspaceFactory decorator that wraps every workspace it returns in a + /// DeleteThrowingScriptWorkspace so that Delete throws the configured exception while all + /// other members forward to the real workspace. + /// + class DeleteThrowingScriptWorkspaceFactory : IScriptWorkspaceFactory + { + readonly IScriptWorkspaceFactory inner; + readonly Exception deleteException; + + public DeleteThrowingScriptWorkspaceFactory(IScriptWorkspaceFactory inner, Exception deleteException) + { + this.inner = inner; + this.deleteException = deleteException; + } + + public IScriptWorkspace GetWorkspace(ScriptTicket ticket, WorkspaceReadinessCheck readinessCheck) + => new DeleteThrowingScriptWorkspace(inner.GetWorkspace(ticket, readinessCheck), deleteException); + + public async Task PrepareWorkspace( + ScriptTicket ticket, + string scriptBody, + Dictionary scripts, + ScriptIsolationLevel isolationLevel, + TimeSpan scriptMutexAcquireTimeout, + string? scriptMutexName, + string[]? scriptArguments, + List files, + CancellationToken cancellationToken) + { + var workspace = await inner.PrepareWorkspace(ticket, scriptBody, scripts, isolationLevel, scriptMutexAcquireTimeout, scriptMutexName, scriptArguments, files, cancellationToken); + return new DeleteThrowingScriptWorkspace(workspace, deleteException); + } + + public List GetUncompletedWorkspaces() + => inner.GetUncompletedWorkspaces().Select(w => (IScriptWorkspace)new DeleteThrowingScriptWorkspace(w, deleteException)).ToList(); + } + + /// + /// IScriptWorkspace decorator that forwards every member to an inner real workspace, except + /// Delete(CancellationToken), which throws the configured exception. Used to exercise the + /// CompleteScript abandon-aware tolerance of Delete failures without disturbing anything else + /// StartScript / RunningScript may touch on the workspace. + /// + class DeleteThrowingScriptWorkspace : IScriptWorkspace + { + readonly IScriptWorkspace inner; + readonly Exception deleteException; + + public DeleteThrowingScriptWorkspace(IScriptWorkspace inner, Exception deleteException) + { + this.inner = inner; + this.deleteException = deleteException; + } + + public ScriptTicket ScriptTicket => inner.ScriptTicket; + public string WorkingDirectory => inner.WorkingDirectory; + public string BootstrapScriptFilePath => inner.BootstrapScriptFilePath; + public string LogFilePath => inner.LogFilePath; + + public string[]? ScriptArguments + { + get => inner.ScriptArguments; + set => inner.ScriptArguments = value; + } + + public ScriptIsolationLevel IsolationLevel + { + get => inner.IsolationLevel; + set => inner.IsolationLevel = value; + } + + public TimeSpan ScriptMutexAcquireTimeout + { + get => inner.ScriptMutexAcquireTimeout; + set => inner.ScriptMutexAcquireTimeout = value; + } + + public string? ScriptMutexName + { + get => inner.ScriptMutexName; + set => inner.ScriptMutexName = value; + } + + public bool ShouldMonitorPowerShellStartup() => inner.ShouldMonitorPowerShellStartup(); + public void BootstrapScript(string scriptBody) => inner.BootstrapScript(scriptBody); + public string ResolvePath(string fileName) => inner.ResolvePath(fileName); + public IScriptLog CreateLog() => inner.CreateLog(); + public void WriteFile(string filename, string contents) => inner.WriteFile(filename, contents); + public void CopyFile(string sourceFilePath, string destFileName, bool overwrite) => inner.CopyFile(sourceFilePath, destFileName, overwrite); + public void CheckReadiness() => inner.CheckReadiness(); + public string? TryReadFile(string filename) => inner.TryReadFile(filename); + + public Task Delete(CancellationToken cancellationToken) => throw deleteException; + } + // TODO - Test the stateStore is updated. - private void SetupScriptState(ScriptTicket ticket) + static void SetupScriptState(IScriptWorkspaceFactory workspaceFactory, ScriptStateStoreFactory stateStoreFactory, ScriptTicket ticket) { - var stateWorkspace = SetupScriptStateStore(ticket); + var stateWorkspace = SetupScriptStateStore(workspaceFactory, stateStoreFactory, ticket); stateWorkspace.Create(); } - private ScriptStateStore SetupScriptStateStore(ScriptTicket ticket) + static ScriptStateStore SetupScriptStateStore(IScriptWorkspaceFactory workspaceFactory, ScriptStateStoreFactory stateStoreFactory, ScriptTicket ticket) { var workspace = workspaceFactory.GetWorkspace(ticket, WorkspaceReadinessCheck.Perform); var stateWorkspace = stateStoreFactory.Create(workspace); return stateWorkspace; } - private async Task CleanupWorkspace(ScriptTicket ticket, CancellationToken cancellationToken) + static async Task CleanupWorkspace(IScriptWorkspaceFactory workspaceFactory, ScriptTicket ticket, CancellationToken cancellationToken) { var workspace = workspaceFactory.GetWorkspace(ticket, WorkspaceReadinessCheck.Skip); await workspace.Delete(cancellationToken); @@ -527,9 +901,9 @@ private async Task CleanupWorkspace(ScriptTicket ticket, CancellationToken cance return (startScriptCommand, new FileInfo(filePath)); } - async Task<(List, ScriptStatusResponseV2)> RunUntilScriptCompletes(StartScriptCommandV2 startScriptCommand, ScriptStatusResponseV2 response) + static async Task<(List, ScriptStatusResponseV2)> RunUntilScriptCompletes(ScriptServiceV2 service, StartScriptCommandV2 startScriptCommand, ScriptStatusResponseV2 response) { - var (logs, lastResponse) = await RunUntilScriptFinishes(startScriptCommand, response); + var (logs, lastResponse) = await RunUntilScriptFinishes(service, startScriptCommand, response); await service.CompleteScriptAsync(new CompleteScriptCommandV2(startScriptCommand.ScriptTicket), CancellationToken.None); @@ -538,7 +912,7 @@ private async Task CleanupWorkspace(ScriptTicket ticket, CancellationToken cance return (logs, lastResponse); } - async Task<(List logs, ScriptStatusResponseV2 response)> RunUntilScriptFinishes(StartScriptCommandV2 startScriptCommand, ScriptStatusResponseV2 response) + static async Task<(List logs, ScriptStatusResponseV2 response)> RunUntilScriptFinishes(ScriptServiceV2 service, StartScriptCommandV2 startScriptCommand, ScriptStatusResponseV2 response) { var logs = new List(response.Logs); @@ -557,7 +931,7 @@ private async Task CleanupWorkspace(ScriptTicket ticket, CancellationToken cance return (logs, response); } - void WriteLogsToConsole(List logs) + static void WriteLogsToConsole(List logs) { foreach (var log in logs) { @@ -576,4 +950,4 @@ public StartScriptCommandAndResponse(StartScriptCommandV2 command) public ScriptStatusResponseV2? Response { get; set; } } } -} \ No newline at end of file +} diff --git a/source/Octopus.Tentacle.Tests/Kubernetes/KubernetesDirectoryInformationProviderFixture.cs b/source/Octopus.Tentacle.Tests/Kubernetes/KubernetesDirectoryInformationProviderFixture.cs index 60799927e..b4b7e337e 100644 --- a/source/Octopus.Tentacle.Tests/Kubernetes/KubernetesDirectoryInformationProviderFixture.cs +++ b/source/Octopus.Tentacle.Tests/Kubernetes/KubernetesDirectoryInformationProviderFixture.cs @@ -1,6 +1,7 @@ using System; using System.Collections.Generic; using System.Threading; +using System.Threading.Tasks; using FluentAssertions; using Microsoft.Extensions.Caching.Memory; using Microsoft.Extensions.Internal; @@ -24,27 +25,29 @@ public void DuOutputParses() { const ulong usedSize = 500 * Megabyte; var spr = Substitute.For(); - spr.When(x => x.ExecuteCommand("du", "-s -B 1 /octopus", "/", Arg.Any>(), Arg.Any>())) - .Do(x => + spr.ExecuteCommandAsync("du", "-s -B 1 /octopus", "/", Arg.Any>(), Arg.Any>()) + .Returns(callInfo => { - x.ArgAt>(3).Invoke($"{usedSize}\t/octopus"); + callInfo.ArgAt>(3).Invoke($"{usedSize}\t/octopus"); + return Task.FromResult(0); }); var memoryCache = new MemoryCache(new MemoryCacheOptions()); var sut = new KubernetesDirectoryInformationProvider(Substitute.For(), spr, memoryCache); sut.GetPathUsedBytes("/octopus").Should().Be(usedSize); } - + [Test] public void DuOutputParsesWithMultipleLines() { const ulong usedSize = 500 * Megabyte; var spr = Substitute.For(); - spr.When(x => x.ExecuteCommand("du", "-s -B 1 /octopus", "/", Arg.Any>(), Arg.Any>())) - .Do(x => + spr.ExecuteCommandAsync("du", "-s -B 1 /octopus", "/", Arg.Any>(), Arg.Any>()) + .Returns(callInfo => { - x.ArgAt>(3).Invoke($"500\t/octopus/extradir"); - x.ArgAt>(3).Invoke($"{usedSize}\t/octopus"); - x.ArgAt>(3).Invoke($"{usedSize+1000}\tTotal"); + callInfo.ArgAt>(3).Invoke($"500\t/octopus/extradir"); + callInfo.ArgAt>(3).Invoke($"{usedSize}\t/octopus"); + callInfo.ArgAt>(3).Invoke($"{usedSize+1000}\tTotal"); + return Task.FromResult(0); }); var memoryCache = new MemoryCache(new MemoryCacheOptions()); var sut = new KubernetesDirectoryInformationProvider(Substitute.For(), spr, memoryCache); @@ -56,18 +59,18 @@ public void IfDuFailsWeStillGetData() { const ulong usedSize = 500 * Megabyte; var spr = Substitute.For(); - spr.When(x => x.ExecuteCommand("du", "-s -B 1 /octopus", "/", Arg.Any>(), Arg.Any>())) - .Do(x => + spr.ExecuteCommandAsync("du", "-s -B 1 /octopus", "/", Arg.Any>(), Arg.Any>()) + .Returns(callInfo => { - x.ArgAt>(3).Invoke($"500\t/octopus"); - x.ArgAt>(3).Invoke($"{usedSize}\t/octopus"); + callInfo.ArgAt>(3).Invoke($"500\t/octopus"); + callInfo.ArgAt>(3).Invoke($"{usedSize}\t/octopus"); + return Task.FromResult(1); }); - spr.ReturnsForAll(1); var memoryCache = new MemoryCache(new MemoryCacheOptions()); var sut = new KubernetesDirectoryInformationProvider(Substitute.For(), spr, memoryCache); sut.GetPathUsedBytes("/octopus").Should().Be(usedSize); } - + [Test] public void IfDuFailsWeLogCorrectly() { @@ -75,22 +78,22 @@ public void IfDuFailsWeLogCorrectly() var systemLog = new InMemoryLog(); var spr = Substitute.For(); - spr.When(x => x.ExecuteCommand("du", "-s -B 1 /octopus", "/", Arg.Any>(), Arg.Any>())) - .Do(x => + spr.ExecuteCommandAsync("du", "-s -B 1 /octopus", "/", Arg.Any>(), Arg.Any>()) + .Returns(callInfo => { // stdout - x.ArgAt>(3).Invoke("500\t/octopus"); - x.ArgAt>(3).Invoke($"{usedSize}\t/octopus"); - + callInfo.ArgAt>(3).Invoke("500\t/octopus"); + callInfo.ArgAt>(3).Invoke($"{usedSize}\t/octopus"); + // stderr - x.ArgAt>(4).Invoke("no permission for foo"); - x.ArgAt>(4).Invoke("also no permission for bar"); + callInfo.ArgAt>(4).Invoke("no permission for foo"); + callInfo.ArgAt>(4).Invoke("also no permission for bar"); + return Task.FromResult(1); }); - spr.ReturnsForAll(1); var memoryCache = new MemoryCache(new MemoryCacheOptions()); var sut = new KubernetesDirectoryInformationProvider(systemLog, spr, memoryCache); sut.GetPathUsedBytes("/octopus").Should().Be(usedSize); - + systemLog.GetLogsForCategory(LogCategory.Warning).Should().Contain("Could not reliably get disk space using du. Getting best approximation..."); systemLog.GetLogsForCategory(LogCategory.Info).Should().Contain($"Du stdout returned 500\t/octopus, {usedSize}\t/octopus"); systemLog.GetLogsForCategory(LogCategory.Info).Should().Contain("Du stderr returned no permission for foo, also no permission for bar"); @@ -100,56 +103,61 @@ public void IfDuFailsWeLogCorrectly() public void IfDuFailsCompletelyReturnNull() { var spr = Substitute.For(); - spr.ReturnsForAll(1); + spr.ExecuteCommandAsync(Arg.Any(), Arg.Any(), Arg.Any(), Arg.Any>(), Arg.Any>()) + .Returns(Task.FromResult(1)); var memoryCache = new MemoryCache(new MemoryCacheOptions()); var sut = new KubernetesDirectoryInformationProvider(Substitute.For(), spr, memoryCache); sut.GetPathUsedBytes("/octopus").Should().Be(null); } - + [Test] public void ReturnedValueShouldBeCached() { var spr = Substitute.For(); - spr.ReturnsForAll(1); + spr.ExecuteCommandAsync(Arg.Any(), Arg.Any(), Arg.Any(), Arg.Any>(), Arg.Any>()) + .Returns(Task.FromResult(1)); var baseTime = DateTimeOffset.UtcNow; var clock = new TestClock(baseTime); var memoryCache = new MemoryCache(new MemoryCacheOptions(){ Clock = clock}); var sut = new KubernetesDirectoryInformationProvider(Substitute.For(), spr, memoryCache); sut.GetPathUsedBytes("/octopus").Should().Be(null); - + const ulong usedSize = 500 * Megabyte; - spr.When(x => x.ExecuteCommand("du", "-s -B 1 /octopus", "/", Arg.Any>(), Arg.Any>())) - .Do(x => + spr.ExecuteCommandAsync("du", "-s -B 1 /octopus", "/", Arg.Any>(), Arg.Any>()) + .Returns(callInfo => { - x.ArgAt>(3).Invoke($"123\t/octopus"); - x.ArgAt>(3).Invoke($"{usedSize}\t/octopus"); + callInfo.ArgAt>(3).Invoke($"123\t/octopus"); + callInfo.ArgAt>(3).Invoke($"{usedSize}\t/octopus"); + return Task.FromResult(0); }); clock.UtcNow = baseTime + TimeSpan.FromSeconds(29); sut.GetPathUsedBytes("/octopus").Should().Be(null); } - + [Test] public void DuCacheExpiresAfter30Seconds() { var spr = Substitute.For(); - spr.ReturnsForAll(1); + spr.ExecuteCommandAsync(Arg.Any(), Arg.Any(), Arg.Any(), Arg.Any>(), Arg.Any>()) + .Returns(Task.FromResult(1)); var baseTime = DateTimeOffset.UtcNow; var clock = new TestClock(baseTime); var memoryCache = new MemoryCache(new MemoryCacheOptions(){ Clock = clock}); var sut = new KubernetesDirectoryInformationProvider(Substitute.For(), spr, memoryCache); sut.GetPathUsedBytes("/octopus").Should().Be(null); - + const ulong usedSize = 500 * Megabyte; - spr.When(x => x.ExecuteCommand("du", "-s -B 1 /octopus", "/", Arg.Any>(), Arg.Any>())) - .Do(x => + spr.ExecuteCommandAsync("du", "-s -B 1 /octopus", "/", Arg.Any>(), Arg.Any>()) + .Returns(callInfo => { - x.ArgAt>(3).Invoke($"123\t/octopus"); - x.ArgAt>(3).Invoke($"{usedSize}\t/octopus"); + callInfo.ArgAt>(3).Invoke($"123\t/octopus"); + callInfo.ArgAt>(3).Invoke($"{usedSize}\t/octopus"); + return Task.FromResult(0); }); clock.UtcNow = baseTime + TimeSpan.FromSeconds(30); - + sut.GetPathUsedBytes("/octopus").Should().Be(usedSize); } diff --git a/source/Octopus.Tentacle.Tests/Util/LinuxTestUserPrincipal.cs b/source/Octopus.Tentacle.Tests/Util/LinuxTestUserPrincipal.cs index 83da2cb93..219fa5b6a 100644 --- a/source/Octopus.Tentacle.Tests/Util/LinuxTestUserPrincipal.cs +++ b/source/Octopus.Tentacle.Tests/Util/LinuxTestUserPrincipal.cs @@ -19,10 +19,19 @@ public LinuxTestUserPrincipal(string username) public string UserName { get; } + // Why this is sync: RunCommand is called from the constructor, which can't + // be async. + // + // Why blocking on the async call is safe: this only runs under NUnit, which + // dispatches us on a worker thread with no SynchronizationContext. + // + // Why low risk: this is test code. The worst case for a wrong call here is + // a hung test, not a production incident. + // See https://blog.stephencleary.com/2012/07/dont-block-on-async-code.html static void RunCommand(string arguments, bool failOnNonZeroExitCode = true) { var commandLineInvocation = new CommandLineInvocation("/bin/bash", arguments); - var result = commandLineInvocation.ExecuteCommand(); + var result = commandLineInvocation.ExecuteCommandAsync().GetAwaiter().GetResult(); foreach (var line in result.Errors) Console.WriteLine(line); diff --git a/source/Octopus.Tentacle/Kubernetes/KubernetesDirectoryInformationProvider.cs b/source/Octopus.Tentacle/Kubernetes/KubernetesDirectoryInformationProvider.cs index 9154bdfff..7609ff106 100644 --- a/source/Octopus.Tentacle/Kubernetes/KubernetesDirectoryInformationProvider.cs +++ b/source/Octopus.Tentacle/Kubernetes/KubernetesDirectoryInformationProvider.cs @@ -1,6 +1,7 @@ using System; using System.Collections.Generic; using System.Linq; +using System.Threading.Tasks; using Octopus.Client.Extensions; using Microsoft.Extensions.Caching.Memory; using Octopus.Tentacle.Core.Diagnostics; @@ -11,15 +12,16 @@ namespace Octopus.Tentacle.Kubernetes public interface IKubernetesDirectoryInformationProvider { public ulong? GetPathUsedBytes(string directoryPath); + public Task GetPathUsedBytesAsync(string directoryPath); public ulong? GetPathTotalBytes(); } - + public class KubernetesDirectoryInformationProvider : IKubernetesDirectoryInformationProvider { readonly ISystemLog log; readonly ISilentProcessRunner silentProcessRunner; readonly IMemoryCache directoryInformationCache; - + //30s gives us fairly up to date information, but doesn't impact performance too much. //For 50 concurrent Cloud deployments: //No cache: 30min ea. @@ -36,12 +38,25 @@ public KubernetesDirectoryInformationProvider(ISystemLog log, ISilentProcessRunn this.directoryInformationCache = directoryInformationCache; } + // Why this is sync: the only caller is EnsureDiskHasEnoughFreeSpace, which + // overrides a sync method on the IOctopusFileSystem chain. Making that + // whole chain async is a wider refactor than this PR. New code should + // call GetPathUsedBytesAsync directly instead of going through here. + // + // Why blocking on the async call is safe: .GetAwaiter().GetResult() can + // deadlock when the calling thread has a SynchronizationContext. The + // Kubernetes agent is a console app and doesn't set one up, so there's + // nothing for the awaited continuation to wait on. + // See https://blog.stephencleary.com/2012/07/dont-block-on-async-code.html public ulong? GetPathUsedBytes(string directoryPath) + => GetPathUsedBytesAsync(directoryPath).GetAwaiter().GetResult(); + + public async Task GetPathUsedBytesAsync(string directoryPath) { - return directoryInformationCache.GetOrCreate(directoryPath, e => + return await directoryInformationCache.GetOrCreateAsync(directoryPath, async e => { e.SetAbsoluteExpiration(CacheExpiry); - return GetDriveBytesUsingDu(directoryPath); + return await GetDriveBytesUsingDuAsync(directoryPath); }); } @@ -49,13 +64,13 @@ public KubernetesDirectoryInformationProvider(ISystemLog log, ISilentProcessRunn { return KubernetesUtilities.GetResourceBytes(KubernetesConfig.PersistentVolumeSize); } - - - ulong? GetDriveBytesUsingDu(string directoryPath) + + + async Task GetDriveBytesUsingDuAsync(string directoryPath) { var stdOut = new List(); var stdErr = new List(); - var exitCode = silentProcessRunner.ExecuteCommand("du", $"-s -B 1 {directoryPath}", "/", stdOut.Add, stdErr.Add); + var exitCode = await silentProcessRunner.ExecuteCommandAsync("du", $"-s -B 1 {directoryPath}", "/", stdOut.Add, stdErr.Add); if (exitCode != 0) { @@ -71,4 +86,4 @@ public KubernetesDirectoryInformationProvider(ISystemLog log, ISilentProcessRunn return null; } } -} \ No newline at end of file +} diff --git a/source/Octopus.Tentacle/Kubernetes/KubernetesPhysicalFileSystem.cs b/source/Octopus.Tentacle/Kubernetes/KubernetesPhysicalFileSystem.cs index e1bc2378d..45d0570a5 100644 --- a/source/Octopus.Tentacle/Kubernetes/KubernetesPhysicalFileSystem.cs +++ b/source/Octopus.Tentacle/Kubernetes/KubernetesPhysicalFileSystem.cs @@ -1,5 +1,6 @@ using System; using System.IO; +using System.Threading.Tasks; using Octopus.Tentacle.Core.Diagnostics; using Octopus.Tentacle.Core.Util; using Octopus.Tentacle.Util; @@ -46,6 +47,21 @@ public override void EnsureDiskHasEnoughFreeSpace(string directoryPath, long req public (ulong freeSpaceBytes, ulong totalSpaceBytes)? GetStorageInformation() { var bytesUsed = directoryInformationProvider.GetPathUsedBytes(HomeDir); + return BuildStorageInformation(bytesUsed); + } + + public async Task<(ulong freeSpaceBytes, ulong totalSpaceBytes)?> GetStorageInformationAsync() + { + var bytesUsed = await directoryInformationProvider.GetPathUsedBytesAsync(HomeDir); + return BuildStorageInformation(bytesUsed); + } + + // Shared sync assembler used by both GetStorageInformation (sync, for the + // IOctopusFileSystem override path) and GetStorageInformationAsync (for + // CreateScriptContainer). Pulled out to keep the post-fetch logic DRY + // across the sync/async pair we need. + (ulong freeSpaceBytes, ulong totalSpaceBytes)? BuildStorageInformation(ulong? bytesUsed) + { var bytesTotal = directoryInformationProvider.GetPathTotalBytes(); if (bytesUsed.HasValue && bytesTotal.HasValue) { diff --git a/source/Octopus.Tentacle/Kubernetes/KubernetesScriptPodCreator.cs b/source/Octopus.Tentacle/Kubernetes/KubernetesScriptPodCreator.cs index 894a7d7ab..7a1439ad1 100644 --- a/source/Octopus.Tentacle/Kubernetes/KubernetesScriptPodCreator.cs +++ b/source/Octopus.Tentacle/Kubernetes/KubernetesScriptPodCreator.cs @@ -300,7 +300,7 @@ void LogVerboseToBothLogs(string message, InMemoryTentacleScriptLog tentacleScri protected async Task CreateScriptContainer(StartKubernetesScriptCommandV1 command, string podName, string scriptName, string homeDir, string workspacePath, string[]? scriptArguments, InMemoryTentacleScriptLog tentacleScriptLog, V1Container? containerSpec) { - var spaceInformation = kubernetesPhysicalFileSystem.GetStorageInformation(); + var spaceInformation = await kubernetesPhysicalFileSystem.GetStorageInformationAsync(); var commandString = string.Join(" ", new[] { diff --git a/source/Octopus.Tentacle/Services/Capabilities/CapabilitiesServiceV2.cs b/source/Octopus.Tentacle/Services/Capabilities/CapabilitiesServiceV2.cs index 7dda62791..a4537f26a 100644 --- a/source/Octopus.Tentacle/Services/Capabilities/CapabilitiesServiceV2.cs +++ b/source/Octopus.Tentacle/Services/Capabilities/CapabilitiesServiceV2.cs @@ -6,6 +6,7 @@ using Octopus.Tentacle.Contracts.KubernetesScriptServiceV1; using Octopus.Tentacle.Contracts.ScriptServiceV2; using Octopus.Tentacle.Core.Services; +using Octopus.Tentacle.Core.Services.Scripts; using Octopus.Tentacle.Util; namespace Octopus.Tentacle.Services.Capabilities @@ -24,7 +25,7 @@ public async Task GetCapabilitiesAsync(CancellationToken } //non-kubernetes agent tentacles only support the standard script services - return new CapabilitiesResponseV2(new List { nameof(IScriptService), nameof(IFileTransferService), nameof(IScriptServiceV2) }); + return new CapabilitiesResponseV2(new List { nameof(IScriptService), nameof(IFileTransferService), nameof(IScriptServiceV2), nameof(ScriptServiceV2.AbandonScriptAsync) }); } } } \ No newline at end of file diff --git a/source/Octopus.Tentacle/Services/Scripts/ScriptService.cs b/source/Octopus.Tentacle/Services/Scripts/ScriptService.cs index e1b952202..82b482d6f 100644 --- a/source/Octopus.Tentacle/Services/Scripts/ScriptService.cs +++ b/source/Octopus.Tentacle/Services/Scripts/ScriptService.cs @@ -93,7 +93,7 @@ public async Task CompleteScriptAsync(CompleteScriptComman RunningScript LaunchShell(ScriptTicket ticket, string serverTaskId, IScriptWorkspace workspace, CancellationTokenSource cancel) { - var runningScript = new RunningScript(shell, workspace, workspace.CreateLog(), serverTaskId, scriptIsolationMutex, cancel.Token, new Dictionary(), PowerShellStartupDetection.PowerShellStartupTimeout, log); + var runningScript = RunningScript.Create(shell, workspace, workspace.CreateLog(), serverTaskId, scriptIsolationMutex, cancel.Token, new Dictionary(), PowerShellStartupDetection.PowerShellStartupTimeout, log); _ = Task.Run(async () => await runningScript.Execute()); return runningScript; } diff --git a/source/Octopus.Tentacle/Startup/LinuxServiceConfigurator.cs b/source/Octopus.Tentacle/Startup/LinuxServiceConfigurator.cs index f57809cd0..128996c1d 100644 --- a/source/Octopus.Tentacle/Startup/LinuxServiceConfigurator.cs +++ b/source/Octopus.Tentacle/Startup/LinuxServiceConfigurator.cs @@ -1,8 +1,9 @@ -using System; +using System; using System.Collections.Generic; using System.IO; using System.Text; using System.Text.RegularExpressions; +using System.Threading.Tasks; using Octopus.Tentacle.Core.Diagnostics; using Octopus.Tentacle.Util; @@ -47,22 +48,38 @@ public void ConfigureServiceByConfigPath(string thisServiceName, serviceConfigurationState); } + // Used by ServiceCommand (an AbstractCommand) to install/configure the + // Tentacle as a Linux systemd service. + // + // Why this is sync: AbstractCommand.Start() is sync because ICommand.Start() + // is sync. When Tentacle runs as a Windows service we host AbstractCommands + // via Topshelf, whose runtime callback API is also sync — so the call path + // has to return sync end-to-end. + // + // Why blocking on the async call is safe: the console-app main thread has + // no SynchronizationContext. Topshelf's OnStart callback runs on a fresh + // `new Thread(...)` worker that also has none. Either way, nothing for the + // awaited continuation to wait on. + // See https://blog.stephencleary.com/2012/07/dont-block-on-async-code.html void ConfigureService(string thisServiceName, string exePath, string? instance, string? configPath, string serviceDescription, ServiceConfigurationState serviceConfigurationState) + => ConfigureServiceAsync(thisServiceName, exePath, instance, configPath, serviceDescription, serviceConfigurationState).GetAwaiter().GetResult(); + + async Task ConfigureServiceAsync(string thisServiceName, string exePath, string? instance, string? configPath, string serviceDescription, ServiceConfigurationState serviceConfigurationState) { //Check if system has bash and systemd - CheckSystemPrerequisites(); + await CheckSystemPrerequisitesAsync(); var cleanedInstanceName = SanitizeString(instance ?? thisServiceName); var systemdUnitFilePath = $"/etc/systemd/system/{cleanedInstanceName}.service"; if (serviceConfigurationState.Restart) - RestartService(cleanedInstanceName); + await RestartServiceAsync(cleanedInstanceName); if (serviceConfigurationState.Stop) - StopService(cleanedInstanceName); + await StopServiceAsync(cleanedInstanceName); if (serviceConfigurationState.Uninstall) - UninstallService(cleanedInstanceName, systemdUnitFilePath); + await UninstallServiceAsync(cleanedInstanceName, systemdUnitFilePath); var serviceDependencies = new List(); serviceDependencies.AddRange(new[] {"network.target"}); @@ -72,7 +89,7 @@ void ConfigureService(string thisServiceName, string exePath, string? instance, var userName = serviceConfigurationState.Username ?? "root"; if (serviceConfigurationState.Install) - InstallService(cleanedInstanceName, + await InstallServiceAsync(cleanedInstanceName, instance, configPath, exePath, @@ -82,7 +99,7 @@ void ConfigureService(string thisServiceName, string exePath, string? instance, serviceDependencies); if (serviceConfigurationState.Reconfigure) - ReconfigureService(cleanedInstanceName, + await ReconfigureServiceAsync(cleanedInstanceName, instance, configPath, exePath, @@ -92,42 +109,42 @@ void ConfigureService(string thisServiceName, string exePath, string? instance, serviceDependencies); if (serviceConfigurationState.Start) - StartService(cleanedInstanceName); + await StartServiceAsync(cleanedInstanceName); } - void RestartService(string serviceName) + async Task RestartServiceAsync(string serviceName) { log.Info($"Restarting service: {serviceName}"); - if (systemCtlHelper.RestartService(serviceName)) + if (await systemCtlHelper.RestartServiceAsync(serviceName)) log.Info("Service has been restarted"); else log.Error("The service could not be restarted"); } - void StopService(string serviceName) + async Task StopServiceAsync(string serviceName) { log.Info($"Stopping service: {serviceName}"); - if (systemCtlHelper.StopService(serviceName)) + if (await systemCtlHelper.StopServiceAsync(serviceName)) log.Info("Service stopped"); else log.Error("The service could not be stopped"); } - void StartService(string serviceName) + async Task StartServiceAsync(string serviceName) { - if (systemCtlHelper.StartService(serviceName, true)) + if (await systemCtlHelper.StartServiceAsync(serviceName, true)) log.Info($"Service started: {serviceName}"); else log.Error($"Could not start the systemd service: {serviceName}"); } - void UninstallService(string instance, string systemdUnitFilePath) + async Task UninstallServiceAsync(string instance, string systemdUnitFilePath) { log.Info($"Removing systemd service: {instance}"); try { - systemCtlHelper.StopService(instance); - systemCtlHelper.DisableService(instance); + await systemCtlHelper.StopServiceAsync(instance); + await systemCtlHelper.DisableServiceAsync(instance); File.Delete(systemdUnitFilePath); log.Info("Service uninstalled"); } @@ -138,7 +155,7 @@ void UninstallService(string instance, string systemdUnitFilePath) } } - void InstallService(string serviceName, + async Task InstallServiceAsync(string serviceName, string? instance, string? configPath, string exePath, @@ -149,8 +166,8 @@ void InstallService(string serviceName, { try { - WriteUnitFile(systemdUnitFilePath, GenerateSystemdUnitFile(instance, configPath, serviceDescription, exePath, userName, serviceDependencies)); - systemCtlHelper.EnableService(serviceName, true); + await WriteUnitFileAsync(systemdUnitFilePath, GenerateSystemdUnitFile(instance, configPath, serviceDescription, exePath, userName, serviceDependencies)); + await systemCtlHelper.EnableServiceAsync(serviceName, true); log.Info($"Service installed: {serviceName}"); } catch (Exception e) @@ -160,7 +177,7 @@ void InstallService(string serviceName, } } - void ReconfigureService(string serviceName, + async Task ReconfigureServiceAsync(string serviceName, string? instance, string? configPath, string exePath, @@ -173,13 +190,13 @@ void ReconfigureService(string serviceName, { log.Info($"Attempting to remove old service: {serviceName}"); //remove service - systemCtlHelper.StopService(serviceName); - systemCtlHelper.DisableService(serviceName); + await systemCtlHelper.StopServiceAsync(serviceName); + await systemCtlHelper.DisableServiceAsync(serviceName); File.Delete(systemdUnitFilePath); //re-add service - WriteUnitFile(systemdUnitFilePath, GenerateSystemdUnitFile(instance, configPath, serviceDescription, exePath, userName, serviceDependencies)); - systemCtlHelper.EnableService(serviceName, true); + await WriteUnitFileAsync(systemdUnitFilePath, GenerateSystemdUnitFile(instance, configPath, serviceDescription, exePath, userName, serviceDependencies)); + await systemCtlHelper.EnableServiceAsync(serviceName, true); log.Info($"Service installed: {serviceName}"); } catch (Exception e) @@ -189,48 +206,48 @@ void ReconfigureService(string serviceName, } } - void WriteUnitFile(string path, string contents) + async Task WriteUnitFileAsync(string path, string contents) { File.WriteAllText(path, contents); var commandLineInvocation = new CommandLineInvocation("/bin/bash", $"-c \"chmod 644 {path}\""); - var result = commandLineInvocation.ExecuteCommand(); + var result = await commandLineInvocation.ExecuteCommandAsync(); if (result.ExitCode == 0) return; result.Validate(); } - void CheckSystemPrerequisites() + async Task CheckSystemPrerequisitesAsync() { if (!File.Exists("/bin/bash")) throw new ControlledFailureException( "Could not detect bash. bash is required to run tentacle."); - if (!HaveSudoPrivileges()) + if (!await HaveSudoPrivilegesAsync()) throw new ControlledFailureException( "Requires elevated privileges. Please run command as sudo."); - if (!IsSystemdInstalled()) + if (!await IsSystemdInstalledAsync()) throw new ControlledFailureException( "Could not detect systemd. systemd is required to run Tentacle as a service"); } - bool IsSystemdInstalled() + async Task IsSystemdInstalledAsync() { var commandLineInvocation = new CommandLineInvocation("/bin/bash", "-c \"command -v systemctl >/dev/null\""); - var result = commandLineInvocation.ExecuteCommand(); + var result = await commandLineInvocation.ExecuteCommandAsync(); return result.ExitCode == 0; } - bool HaveSudoPrivileges() + async Task HaveSudoPrivilegesAsync() { var commandLineInvocation = new CommandLineInvocation("/bin/bash", "-c \"sudo -vn 2> /dev/null\""); - var result = commandLineInvocation.ExecuteCommand(); + var result = await commandLineInvocation.ExecuteCommandAsync(); return result.ExitCode == 0; } - string GenerateSystemdUnitFile(string? instance, + string GenerateSystemdUnitFile(string? instance, string? configPath, string serviceDescription, string exePath, string userName, IEnumerable serviceDependencies) { @@ -246,7 +263,7 @@ string GenerateSystemdUnitFile(string? instance, if (!string.IsNullOrEmpty(instance)) { stringBuilder.Append($" --instance={instance}"); - } + } else if (!string.IsNullOrEmpty(configPath)) { stringBuilder.Append($" --config={configPath}"); @@ -267,4 +284,4 @@ string GenerateSystemdUnitFile(string? instance, static string SanitizeString(string str) => Regex.Replace(str.Replace("/", ""), @"\s+", "-"); } -} \ No newline at end of file +} diff --git a/source/Octopus.Tentacle/Startup/WindowsServiceConfigurator.cs b/source/Octopus.Tentacle/Startup/WindowsServiceConfigurator.cs index 1fc5dd0eb..6a3f0358b 100644 --- a/source/Octopus.Tentacle/Startup/WindowsServiceConfigurator.cs +++ b/source/Octopus.Tentacle/Startup/WindowsServiceConfigurator.cs @@ -1,4 +1,4 @@ -using System; +using System; using System.Collections.Generic; using System.Diagnostics.CodeAnalysis; using System.IO; @@ -7,6 +7,7 @@ using System.ServiceProcess; using System.Text; using System.Threading; +using System.Threading.Tasks; using Octopus.Tentacle.Core.Diagnostics; using Octopus.Tentacle.Util; using Polly; @@ -58,12 +59,33 @@ public void ConfigureServiceByConfigPath(string thisServiceName, serviceConfigurationState); } + // Used by ServiceCommand (an AbstractCommand) to install/configure the + // Tentacle as a Windows Service. + // + // Why this is sync: AbstractCommand.Start() is sync because ICommand.Start() + // is sync. When Tentacle runs as a Windows service we host AbstractCommands + // via Topshelf, whose runtime callback API is also sync — so the call path + // has to return sync end-to-end. + // + // Why blocking on the async call is safe: the console-app main thread has + // no SynchronizationContext. Topshelf's OnStart callback runs on a fresh + // `new Thread(...)` worker that also has none. Either way, nothing for the + // awaited continuation to wait on. + // See https://blog.stephencleary.com/2012/07/dont-block-on-async-code.html void ConfigureService(string thisServiceName, string exePath, string? instance, string? configPath, string serviceDescription, ServiceConfigurationState serviceConfigurationState) + => ConfigureServiceAsync(thisServiceName, exePath, instance, configPath, serviceDescription, serviceConfigurationState).GetAwaiter().GetResult(); + + async Task ConfigureServiceAsync(string thisServiceName, + string exePath, + string? instance, + string? configPath, + string serviceDescription, + ServiceConfigurationState serviceConfigurationState) { windowsLocalAdminRightsChecker.AssertIsRunningElevated(); var services = ServiceController.GetServices(); @@ -81,7 +103,7 @@ void ConfigureService(string thisServiceName, if (serviceConfigurationState.Uninstall) { - UninstallService(thisServiceName, controller); + await UninstallServiceAsync(thisServiceName, controller); } var serviceDependencies = new List(); @@ -96,18 +118,18 @@ void ConfigureService(string thisServiceName, if (serviceConfigurationState.Install) { - controller = InstallService(thisServiceName, exePath, instance, configPath, + controller = await InstallServiceAsync(thisServiceName, exePath, instance, configPath, serviceDescription, serviceConfigurationState, controller, serviceDependencies); } if (serviceConfigurationState.Reconfigure) { - ReconfigureService(thisServiceName, exePath, instance, configPath, serviceDescription, serviceDependencies); + await ReconfigureServiceAsync(thisServiceName, exePath, instance, configPath, serviceDescription, serviceDependencies); } if ((serviceConfigurationState.Install || serviceConfigurationState.Reconfigure) && !string.IsNullOrWhiteSpace(serviceConfigurationState.Username)) { - ConfigureCredentialsForService(thisServiceName, serviceConfigurationState); + await ConfigureCredentialsForServiceAsync(thisServiceName, serviceConfigurationState); } if (serviceConfigurationState.Start) @@ -116,7 +138,7 @@ void ConfigureService(string thisServiceName, } } - void ConfigureCredentialsForService(string thisServiceName, ServiceConfigurationState serviceConfigurationState) + async Task ConfigureCredentialsForServiceAsync(string thisServiceName, ServiceConfigurationState serviceConfigurationState) { if (!string.IsNullOrWhiteSpace(serviceConfigurationState.Password)) { @@ -137,13 +159,13 @@ void ConfigureCredentialsForService(string thisServiceName, ServiceConfiguration } else { - Sc($"config \"{thisServiceName}\" obj= \"{serviceConfigurationState.Username}\""); + await ScAsync($"config \"{thisServiceName}\" obj= \"{serviceConfigurationState.Username}\""); } log.Info("Service credentials set"); } - void ReconfigureService(string thisServiceName, + async Task ReconfigureServiceAsync(string thisServiceName, string exePath, string? instance, string? configPath, @@ -154,13 +176,13 @@ void ReconfigureService(string thisServiceName, var command = exePath.EndsWith(".dll") ? $"config \"{thisServiceName}\" binpath= \"dotnet \\\"{exePath}\\\" run {instanceIdentifier} DisplayName= \"{thisServiceName}\" depend= {string.Join("/", serviceDependencies)} start= auto" : $"config \"{thisServiceName}\" binpath= \"\\\"{exePath}\\\" run {instanceIdentifier} DisplayName= \"{thisServiceName}\" depend= {string.Join("/", serviceDependencies)} start= auto"; - Sc(command); - Sc($"description \"{thisServiceName}\" \"{serviceDescription}\""); + await ScAsync(command); + await ScAsync($"description \"{thisServiceName}\" \"{serviceDescription}\""); log.Info("Service reconfigured"); } - ServiceController? InstallService(string thisServiceName, + async Task InstallServiceAsync(string thisServiceName, string exePath, string? instance, string? configPath, @@ -181,8 +203,8 @@ void ReconfigureService(string thisServiceName, ? $"create \"{thisServiceName}\" binpath= \"dotnet \\\"{exePath}\\\" run {instanceIdentifier} DisplayName= \"{thisServiceName}\" depend= {string.Join("/", serviceDependencies)} start= auto" : $"create \"{thisServiceName}\" binpath= \"\\\"{exePath}\\\" run {instanceIdentifier} DisplayName= \"{thisServiceName}\" depend= {string.Join("/", serviceDependencies)} start= auto"; - Sc(command); - Sc($"description \"{thisServiceName}\" \"{serviceDescription}\""); + await ScAsync(command); + await ScAsync($"description \"{thisServiceName}\" \"{serviceDescription}\""); } log.Info("Service installed"); @@ -207,7 +229,7 @@ static string InstanceIdentifier(string? instance, string? configPath) throw new InvalidOperationException("Either the instance name of configuration path must be provided to configure a service"); } - void UninstallService(string thisServiceName, ServiceController? controller) + async Task UninstallServiceAsync(string thisServiceName, ServiceController? controller) { if (controller == null) { @@ -215,7 +237,7 @@ void UninstallService(string thisServiceName, ServiceController? controller) } else { - Sc($"delete \"{thisServiceName}\""); + await ScAsync($"delete \"{thisServiceName}\""); log.Info("Service uninstalled"); } @@ -333,7 +355,7 @@ void WaitForControllerStatus(ServiceController controller, ServiceControllerStat 150); } - void Sc(string arguments) + async Task ScAsync(string arguments) { var outputBuilder = new StringBuilder(); var argumentsToLog = string.Join(" ", arguments); @@ -342,7 +364,7 @@ void Sc(string arguments) var sc = Path.Combine(system32, "sc.exe"); logFileOnlyLogger.Info($"Executing sc.exe {argumentsToLog}"); - var exitCode = SilentProcessRunnerExtended.ExecuteCommand(sc, + var exitCode = await SilentProcessRunnerExtended.ExecuteCommandAsync(sc, arguments, Environment.CurrentDirectory, output => outputBuilder.AppendLine(output), diff --git a/source/Octopus.Tentacle/Util/CommandLineRunner.cs b/source/Octopus.Tentacle/Util/CommandLineRunner.cs index cc6da549a..b19903eff 100644 --- a/source/Octopus.Tentacle/Util/CommandLineRunner.cs +++ b/source/Octopus.Tentacle/Util/CommandLineRunner.cs @@ -1,5 +1,7 @@ -using System; +using System; using System.Collections.Generic; +using System.Threading; +using System.Threading.Tasks; using Octopus.Tentacle.Core.Diagnostics; namespace Octopus.Tentacle.Util @@ -7,13 +9,11 @@ namespace Octopus.Tentacle.Util public class CommandLineRunner : ICommandLineRunner { public bool Execute(IEnumerable commandLineInvocations, ILog log) - { - return Execute(commandLineInvocations, + => Execute(commandLineInvocations, log.Verbose, log.Info, log.Error, log.Error); - } public bool Execute(IEnumerable commandLineInvocations, Action debug, @@ -35,15 +35,57 @@ public bool Execute(CommandLineInvocation invocation, ILog log) log.Error, log.Error); + // We're at the ICommandLineRunner sync entry point, consumed by Octopus.Manager.Tentacle + // (WPF). The WPF installer calls Execute from ThreadPool.QueueUserWorkItem (a sync + // delegate), so this is the sync-over-async bridge: a one-line wrapper over the public + // async implementation. Safe because the installer dispatches us on a plain thread-pool + // worker. No captured SynchronizationContext, so no deadlock. Async callers should call + // ExecuteAsync directly. + // See https://blog.stephencleary.com/2012/07/dont-block-on-async-code.html public bool Execute(CommandLineInvocation invocation, Action debug, Action info, Action error, Action exception) + => ExecuteAsync(invocation, debug, info, error, exception).GetAwaiter().GetResult(); + + public Task ExecuteAsync(IEnumerable commandLineInvocations, ILog log) + => ExecuteAsync(commandLineInvocations, + log.Verbose, + log.Info, + log.Error, + log.Error); + + public async Task ExecuteAsync(IEnumerable commandLineInvocations, + Action debug, + Action info, + Action error, + Action exception) + { + foreach (var invocation in commandLineInvocations) + if (!await ExecuteAsync(invocation, debug, info, error, exception)) + return false; + + return true; + } + + public Task ExecuteAsync(CommandLineInvocation invocation, ILog log) + => ExecuteAsync(invocation, + log.Info, + log.Info, + log.Error, + log.Error); + + public async Task ExecuteAsync(CommandLineInvocation invocation, + Action debug, + Action info, + Action error, + Action exception) { try { - var exitCode = SilentProcessRunner.ExecuteCommand(invocation.Executable, + var exitCode = await SilentProcessRunner.ExecuteCommandAsync( + invocation.Executable, (invocation.Arguments ?? "") + " " + (invocation.SystemArguments ?? ""), Environment.CurrentDirectory, debug, @@ -75,4 +117,4 @@ public bool Execute(CommandLineInvocation invocation, return true; } } -} \ No newline at end of file +} diff --git a/source/Octopus.Tentacle/Util/ICommandLineRunner.cs b/source/Octopus.Tentacle/Util/ICommandLineRunner.cs index a1244a681..ab0b1d7db 100644 --- a/source/Octopus.Tentacle/Util/ICommandLineRunner.cs +++ b/source/Octopus.Tentacle/Util/ICommandLineRunner.cs @@ -1,5 +1,6 @@ -using System; +using System; using System.Collections.Generic; +using System.Threading.Tasks; using Octopus.Tentacle.Core.Diagnostics; namespace Octopus.Tentacle.Util @@ -21,5 +22,21 @@ bool Execute(CommandLineInvocation invocation, Action info, Action error, Action exception); + + Task ExecuteAsync(IEnumerable commandLineInvocations, ILog log); + + Task ExecuteAsync(IEnumerable commandLineInvocations, + Action debug, + Action info, + Action error, + Action exception); + + Task ExecuteAsync(CommandLineInvocation commandLineInvocation, ILog log); + + Task ExecuteAsync(CommandLineInvocation invocation, + Action debug, + Action info, + Action error, + Action exception); } -} \ No newline at end of file +} diff --git a/source/Octopus.Tentacle/Util/ISilentProcessRunner.cs b/source/Octopus.Tentacle/Util/ISilentProcessRunner.cs index 06a61e558..fa6d21f45 100644 --- a/source/Octopus.Tentacle/Util/ISilentProcessRunner.cs +++ b/source/Octopus.Tentacle/Util/ISilentProcessRunner.cs @@ -1,49 +1,52 @@ -using System; +using System; using System.Collections.Generic; using System.Threading; +using System.Threading.Tasks; using Octopus.Tentacle.Startup; namespace Octopus.Tentacle.Util { public interface ISilentProcessRunner { - public int ExecuteCommand( + Task ExecuteCommandAsync( string executable, string arguments, string workingDirectory, Action info, Action error, - CancellationToken cancel = default); + CancellationToken cancel = default, + CancellationToken abandon = default); - public int ExecuteCommand( + Task ExecuteCommandAsync( string executable, string arguments, string workingDirectory, Action debug, Action info, Action error, - CancellationToken cancel = default); + CancellationToken cancel = default, + CancellationToken abandon = default); } public class SilentProcessRunnerWrapper : ISilentProcessRunner { - public int ExecuteCommand(string executable, string arguments, string workingDirectory, Action info, Action error, CancellationToken cancel = default) + public Task ExecuteCommandAsync(string executable, string arguments, string workingDirectory, Action info, Action error, CancellationToken cancel = default, CancellationToken abandon = default) { - return SilentProcessRunnerExtended.ExecuteCommand(executable, arguments, workingDirectory, info, error, cancel); + return SilentProcessRunnerExtended.ExecuteCommandAsync(executable, arguments, workingDirectory, info, error, cancel, abandon); } - public int ExecuteCommand(string executable, string arguments, string workingDirectory, Action debug, Action info, Action error, CancellationToken cancel = default) + public Task ExecuteCommandAsync(string executable, string arguments, string workingDirectory, Action debug, Action info, Action error, CancellationToken cancel = default, CancellationToken abandon = default) { - return SilentProcessRunner.ExecuteCommand(executable, arguments, workingDirectory, debug, info, error, cancel: cancel); + return SilentProcessRunner.ExecuteCommandAsync(executable, arguments, workingDirectory, debug, info, error, cancel: cancel, abandon: abandon); } } public static class SilentProcessRunnerExtended { - public static CmdResult ExecuteCommand(this CommandLineInvocation invocation) - => ExecuteCommand(invocation, Environment.CurrentDirectory); + public static async Task ExecuteCommandAsync(this CommandLineInvocation invocation) + => await ExecuteCommandAsync(invocation, Environment.CurrentDirectory); - public static CmdResult ExecuteCommand(this CommandLineInvocation invocation, string workingDirectory) + public static async Task ExecuteCommandAsync(this CommandLineInvocation invocation, string workingDirectory) { if (workingDirectory == null) throw new ArgumentNullException(nameof(workingDirectory)); @@ -52,7 +55,7 @@ public static CmdResult ExecuteCommand(this CommandLineInvocation invocation, st var infos = new List(); var errors = new List(); - var exitCode = ExecuteCommand( + var exitCode = await ExecuteCommandAsync( invocation.Executable, arguments, workingDirectory, @@ -63,20 +66,22 @@ public static CmdResult ExecuteCommand(this CommandLineInvocation invocation, st return new CmdResult(exitCode, infos, errors); } - public static int ExecuteCommand( + public static Task ExecuteCommandAsync( string executable, string arguments, string workingDirectory, Action info, Action error, - CancellationToken cancel = default) - => SilentProcessRunner.ExecuteCommand(executable, + CancellationToken cancel = default, + CancellationToken abandon = default) + => SilentProcessRunner.ExecuteCommandAsync(executable, arguments, workingDirectory, LogFileOnlyLogger.Current.Info, info, error, customEnvironmentVariables: null, - cancel: cancel); + cancel: cancel, + abandon: abandon); } -} \ No newline at end of file +} diff --git a/source/Octopus.Tentacle/Util/SystemCtlHelper.cs b/source/Octopus.Tentacle/Util/SystemCtlHelper.cs index 8ff4fd632..d97fb3a03 100644 --- a/source/Octopus.Tentacle/Util/SystemCtlHelper.cs +++ b/source/Octopus.Tentacle/Util/SystemCtlHelper.cs @@ -1,5 +1,6 @@ using System; using System.Linq; +using System.Threading.Tasks; using Octopus.Tentacle.Core.Diagnostics; namespace Octopus.Tentacle.Util @@ -13,26 +14,26 @@ public SystemCtlHelper(ISystemLog log) this.log = log; } - public bool StartService(string serviceName, bool logFailureAsError = false) - => RunServiceCommand("start", serviceName, logFailureAsError); + public Task StartServiceAsync(string serviceName, bool logFailureAsError = false) + => RunServiceCommandAsync("start", serviceName, logFailureAsError); - public bool RestartService(string serviceName, bool logFailureAsError = false) - => RunServiceCommand("restart", serviceName, logFailureAsError); + public Task RestartServiceAsync(string serviceName, bool logFailureAsError = false) + => RunServiceCommandAsync("restart", serviceName, logFailureAsError); - public bool StopService(string serviceName, bool logFailureAsError = false) - => RunServiceCommand("stop", serviceName, logFailureAsError); + public Task StopServiceAsync(string serviceName, bool logFailureAsError = false) + => RunServiceCommandAsync("stop", serviceName, logFailureAsError); - public bool EnableService(string serviceName, bool logFailureAsError = false) - => RunServiceCommand("enable", serviceName, logFailureAsError); + public Task EnableServiceAsync(string serviceName, bool logFailureAsError = false) + => RunServiceCommandAsync("enable", serviceName, logFailureAsError); - public bool DisableService(string serviceName, bool logFailureAsError = false) - => RunServiceCommand("disable", serviceName, logFailureAsError); + public Task DisableServiceAsync(string serviceName, bool logFailureAsError = false) + => RunServiceCommandAsync("disable", serviceName, logFailureAsError); - bool RunServiceCommand(string command, string serviceName, bool logFailureAsError) + async Task RunServiceCommandAsync(string command, string serviceName, bool logFailureAsError) { // Try without sudo first var commandLineInvocation = new CommandLineInvocation("/bin/bash", $"-c \"systemctl {command} {serviceName}\""); - var result = commandLineInvocation.ExecuteCommand(); + var result = await commandLineInvocation.ExecuteCommandAsync(); if (result.ExitCode == 0) return true; // Check if failure is due to authentication/permission issues @@ -48,9 +49,9 @@ bool RunServiceCommand(string command, string serviceName, bool logFailureAsErro { log.Info($"Permission denied. Retrying 'systemctl {command} {serviceName}' with sudo..."); var sudoInvocation = new CommandLineInvocation("/bin/bash", $"-c \"sudo systemctl {command} {serviceName}\""); - result = sudoInvocation.ExecuteCommand(); + result = await sudoInvocation.ExecuteCommandAsync(); if (result.ExitCode == 0) return true; - + usedSudo = true; }