Skip to content

Conversation

@fioan89
Copy link
Collaborator

@fioan89 fioan89 commented Dec 2, 2025

Netflix reported that only seems to reproduce on Linux (we've only tested Ubuntu so far).
I can’t reproduce it on macOS. First, here’s some context:

  1. Polling workspaces:
    Coder Toolbox polls the deployment every 5 seconds for workspace updates.
    These updates (new workspaces, deletions,status changes) are stored in a
    cached “environments” list (an oversimplified explanation). When a URI is executed,
    we reset the content of the list and run the login sequence, which re-initializes
    the HTTP poller and CLI using the new deployment URL and token. A new polling loop
    then begins populating the environments list again.

  2. Cache monitoring:
    Toolbox watches this cached list for changes—especially status changes, which determine
    when an SSH connection can be established.

In Netflix’s case, they launched Toolbox, created a workspace from the Dashboard, and the
poller added it to the environments list. When the workspace switched from starting to ready,
they used a URI to connect to it. The URI reset the list, then the poller repopulated it. But
because the list had the same IDs (but new object references), Toolbox didn’t detect any changes.
As a result, it never triggered the SSH connection. This issue only reproduces on Linux, but it
might explain some of the sporadic macOS failures Atif mentioned in the past.

I need to dig deeper into the Toolbox bytecode to determine whether this is a Toolbox bug, but
it does seem like Toolbox wasn’t designed to switch cleanly between multiple deployments and/or users.
The current Coder plugin behavior—always performing a full login sequence on every URI—is also ...sub-optimal.
It only really makes sense in these scenarios:

  1. Toolbox started with deployment A, but the URI targets deployment B.
  2. Toolbox started with deployment A/user X, but the URI targets deployment A/user Y.

But this design is inefficient for the most common case: connecting via URI to a workspace on the
same deployment and same user. While working on the fix, I realized that scenario (2) is not realistic.
On the same host machine, why would multiple users log into the same deployment via Toolbox? The whole
fix revolves around the idea of just recreating the http client and updating the CLI with the new token
instead of going through the full authentication steps when the URI deployment URL is the same as the
currently opened URL

The fix focuses on simply recreating the HTTP client and updating the CLI token when the URI URL matches the existing deployment URL, instead of running a full login.

This PR splits responsibilities more cleanly:

  • CoderProtocolHandler now only finds the workspace and agent and handles IDE installation and launch.
  • the logic for creating a new HTTP client, updating the CLI, cleaning up old resources (polling loop, environment cache), and handling deployment URL changes is separated out.

The benefits would be:

  • shared logic for cleanup and re-initialization, with less coupling and clearer, more maintainable code.
  • a clean way to check whether the URI’s deployment URL matches the current one and react appropriately when they differ.

Small improvement where we get rid of emitting environment and
ssh connection trigger events from new coroutines. StateFlow in
Kotlin is a hot, conflated flow that keeps only the most recent value.
In other words we can immediately update the value without needing to
launch a new coroutine, and we won't block the current thread.
…ened

Netflix reported that only seems to reproduce on Linux (we've only tested Ubuntu so far).
I can’t reproduce it on macOS. First, here’s some context:
1. Polling workspaces:
Coder Toolbox polls the deployment every 5 seconds for workspace updates.
These updates (new workspaces, deletions,status changes) are stored in a
cached “environments” list (an oversimplified explanation). When a URI is executed,
we reset the content of the list and run the login sequence, which re-initializes
the HTTP poller and CLI using the new deployment URL and token. A new polling loop
then begins populating the environments list again.

2. Cache monitoring:
Toolbox watches this cached list for changes—especially status changes, which determine
when an SSH connection can be established.

In Netflix’s case, they launched Toolbox, created a workspace from the Dashboard, and the
poller added it to the environments list. When the workspace switched from starting to ready,
they used a URI to connect to it. The URI reset the list, then the poller repopulated it. But
because the list had the same IDs (but new object references), Toolbox didn’t detect any changes.
As a result, it never triggered the SSH connection. This issue only reproduces on Linux, but it
might explain some of the sporadic macOS failures Atif mentioned in the past.

I need to dig deeper into the Toolbox bytecode to determine whether this is a Toolbox bug, but
it does seem like Toolbox wasn’t designed to switch cleanly between multiple deployments and/or users.
The current Coder plugin behavior—always performing a full login sequence on every URI—is also ...sub-optimal.
It only really makes sense in these scenarios:

1. Toolbox started with deployment A, but the URI targets deployment B.
2. Toolbox started with deployment A/user X, but the URI targets deployment A/user Y.

But this design is inefficient for the most common case: connecting via URI to a workspace on the
same deployment and same user. While working on the fix, I realized that scenario (2) is not realistic.
On the same host machine, why would multiple users log into the same deployment via Toolbox? The whole
fix revolves around the idea of just recreating the http client and updating the CLI with the new token
instead of going through the full authentication steps when the URI deployment URL is the same as the
currently opened URL
With this commit we split the responsibilities. `CoderProtocolHander` is now only
responsible searching the workspace and agent and orchestrating the IDE installation and launching.
The code around initializing the http client and cli with a new URL and a new token, plus cleaning
up the old resources like the polling loop and the list of environments.
There are two major benefits to this approach:
- allows us to easily share/reuse the logic around cleaning up resources and re-initializing the
http client and cli without passing so many callbacks to CoderProtocolHandler (less coupling, code that
is cleaner and easier to read, easier to maintain and test)
- provides a nice and easy way to check whether the URI url is the same as the one from current
deployment and properly react if they differ.
@fioan89 fioan89 marked this pull request as ready for review December 8, 2025 22:05
@fioan89 fioan89 requested a review from code-asher December 8, 2025 22:05
@fioan89 fioan89 requested review from f0ssel and jcjiang December 8, 2025 22:07
When Toolbox is closed and URI is executed the handler waits for the
plugin to fully initialize with the last successful deployment after
which it goes on with the URI handling logic.
We can improve this situation by skipping the last successful
deployment which anyway. This reduces a lot the time to an actual IDE connection.
Auto connect was running after logout on deployments that were opened
when Toolbox was launched by a URI. This fix forces URI handler to disable
auto connect for the URI connections.
Copy link
Member

@code-asher code-asher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice 👌

* Also called as part of our own logout.
*/
override fun close() {
softClose()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just kind of thinking out loud, not something we have to do now or in this PR, but it looks like the pattern we have is to close the client and cli and then immediately set/unset them, so I wonder if we would benefit from formalizing that, like if we had something like update(newClient, newCli) that closed the old ones if any and set new ones, so it is never possible to accidentally be in a sort of desynced state where you have a closed client/cli but have not updated or unset the client/cli.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be possible, I'll add it on my TODO list.

@fioan89 fioan89 merged commit a2c028e into main Dec 10, 2025
6 checks passed
@fioan89 fioan89 deleted the fix-uri-handling-on-linux branch December 10, 2025 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants