Skip to content

fix(websocket): Fix the bug related to multi-threaded access to trans… (IDFGH-17155)#997

Open
udoudou wants to merge 1 commit intoespressif:masterfrom
udoudou:bugfix/esp_websocket_client_thread_safe
Open

fix(websocket): Fix the bug related to multi-threaded access to trans… (IDFGH-17155)#997
udoudou wants to merge 1 commit intoespressif:masterfrom
udoudou:bugfix/esp_websocket_client_thread_safe

Conversation

@udoudou
Copy link
Contributor

@udoudou udoudou commented Jan 27, 2026

The close operation on the client->transport resource is performed under the protection of client->lock. esp_websocket_client_send_with_exact_opcode checks the state without holding the lock when determining whether the transport can be accessed. This can lead to some unexpected behaviors:

  1. If tx_lock is not enabled, calling esp_transport_ws_send_raw will be abnormally blocked if the transport is already closed.

  2. If tx_lock is enabled, esp_transport_close and esp_transport_ws_send_raw may be called concurrently. Since close involves releasing some resources, this may lead to unpredictable behavior.

Description

To fix this issue, I adjusted the implementation of resource mutual exclusion protection.
1、tx_lock has been removed
The original purpose of introducing tx_lock was to allow for parallel processing of receiving and sending, improving transmission efficiency. However, this doesn't necessarily require introducing a new tx_lock. This is because the receiving direction is accessed only by the esp_websocket_client_task. The sending direction may be accessed concurrently by several user tasks and the esp_websocket_client_task. Therefore, the esp_websocket_client_task does not need lock protection when receiving. Lock protection is only applied when the esp_websocket_client_task needs to send or when a user task sends, to check if the transport is available.

2、Change esp_websocket_client_abort_connection to asynchronous close transport.
Modify esp_websocket_client_abort_connection to only modify the state, preventing user tasks from directly executing close transport and conflicting with esp_websocket_client_task's receive transport.

The APIs involved in resource access contention mainly include esp_websocket_client_send_bin, esp_websocket_client_send_bin_partial, esp_websocket_client_send_text, esp_websocket_client_send_text_partial, esp_websocket_client_send_cont_msg, esp_websocket_client_send_fin, esp_websocket_client_send_with_opcode, esp_websocket_client_close, and esp_websocket_client_close_with_code. Ultimately, all of these are involved in esp_websocket_client_send_with_exact_opcode. The main issue is contention involving transport sending, errormsg_buffer, and state when the state is WEBSOCKET_STATE_CONNECTED. After the modification, esp_websocket_client_send_with_exact_opcode accesses and modifies these resources under lock protection. In esp_websocket_client_task, when the state is WEBSOCKET_STATE_CONNECTED, any transport sending, changes to a connectionless state, or access to errormsg_buffer are all performed under lock protection. The close of the transport is uniformly performed by esp_websocket_client_task in a non-WEBSOCKET_STATE_CONNECTED state, ensuring that user task sending will not occur concurrently with the close of the transport, and esp_websocket_client_task does not require lock protection when receiving.

Related

Testing


Checklist

Before submitting a Pull Request, please ensure the following:

  • 🚨 This PR does not introduce breaking changes.
  • All CI checks (GH Actions) pass.
  • Documentation is updated as needed.
  • Tests are updated or added as necessary.
  • Code is well-commented, especially in complex areas.
  • Git history is clean — commits are squashed to the minimum necessary.

Note

Refactors websocket client concurrency and connection lifecycle handling.

  • Removes tx_lock Kconfig/options and all associated separate TX locking; uses a single client->lock to guard all transport sends/state/error buffer access
  • Introduces WEBSOCKET_STATE_WAIT_ABORT_CONNECT and changes esp_websocket_client_abort_connection() to be asynchronous (sets state, dispatches disconnect); actual esp_transport_close() now performed by the client task
  • Ensures PING/PONG/CLOSE and send paths acquire client->lock and re-check state before accessing transport; adds locked aborts on read/write failures
  • Adjusts main task state machine: splits abort handling, reconnect/wait logic, and server-initiated close flow; removes previous WAIT_TIMEOUT handling tied to immediate close
  • Cleans up Kconfig (drops tx lock options) and updates SPDX year

Written by Cursor Bugbot for commit 3bbe73a. This will update automatically on new commits. Configure here.

@CLAassistant
Copy link

CLAassistant commented Jan 27, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot changed the title fix(websocket): Fix the bug related to multi-threaded access to trans… fix(websocket): Fix the bug related to multi-threaded access to trans… (IDFGH-17155) Jan 27, 2026
@espressif-bot espressif-bot added the Status: Opened Issue is new label Jan 27, 2026
@udoudou udoudou force-pushed the bugfix/esp_websocket_client_thread_safe branch from e01a525 to d379c8c Compare January 27, 2026 11:11
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

@udoudou udoudou force-pushed the bugfix/esp_websocket_client_thread_safe branch from d379c8c to 3bbe73a Compare January 27, 2026 11:53
ESP_LOGE(TAG, "Error receive data");
esp_websocket_client_abort_connection(client, WEBSOCKET_ERROR_TYPE_TCP_TRANSPORT);
}
xSemaphoreGiveRecursive(client->lock);
Copy link
Collaborator

@gabsuren gabsuren Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't removing the lock be unsafe for wss:// (SSL/TLS) connections?
The underlying SSL library maintains internal state that cannot be safely accessed by mbedtls_ssl_read and mbedtls_ssl_write simultaneously without external synchronization.
By releasing client->lock during esp_websocket_client_recv(), we allow a race condition here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read interface is only called within the esp_websocket_client_task. Furthermore, the shutdown of the transport is now performed within the esp_websocket_client_task. Therefore, no additional lock protection is required.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that recv is only called by the main task, so we don't have multiple threads trying to read at the same time.

However, my concern is about concurrent Read and Write.

With this change, the main task calls esp_transport_read() without holding the lock. While it is blocked there waiting for data, a user thread can call esp_websocket_client_send(), acquire the lock (since the main task doesn't have it), and call esp_transport_write().

This means esp_transport_read and esp_transport_write will execute simultaneously on the same transport handle.

For TCP (ws://): This is likely fine as the kernel handles concurrent socket access.
For SSL (wss://): This is generally unsafe. The underlying mbedtls context is updated by both read and write operations. If esp_transport_ssl doesn't have its own internal mutex (which I believe it doesn't), accessing the same SSL context from two threads at once will cause undefined behavior or crashes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This point has not yet been considered; I will check this area and make further improvements and modifications.

ESP_LOGI(TAG, "Reconnect after %d ms", client->wait_timeout_ms);
client->state = WEBSOCKET_STATE_WAIT_TIMEOUT;
}
client->state = WEBSOCKET_STATE_WAIT_ABORT_CONNECT;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify the behavior regarding delayed cleanup?
Since esp_websocket_client_abort_connection now sets the state to WEBSOCKET_STATE_WAIT_ABORT_CONNECT instead of closing the transport immediately, it seems the actual closure is deferred to the main task loop, right?

If the main task is currently blocked inside esp_transport_read() (which runs without the lock), it will not process this state change until the read returns or times out (up to network_timeout_ms, default 10s).

This means related resources will remain open for that duration even after the user explicitly calls esp_websocket_client_stop() or _destroy() . This could prevent immediate resource reclamation or fast reconnection attempts if the user expects synchronous cleanup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, the original implementation also couldn't immediately reclaim or quickly reconnect because esp_transport_read was protected by a lock. Even if a user thread called the interface to directly close the transport, it still needed to wait for esp_transport_read to return and release the lock before it could acquire the lock and perform the close operation. Therefore, I think this has little impact.

Furthermore, this design reduces the optimization to requiring only one lock. Management and maintenance should be easier, without worrying about multiple locks deadlocking each other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To optimize the closing speed, my current idea is to obtain the socket file descriptor (fd) corresponding to the transport and close that fd directly. This would allow write/read operations from other tasks to return immediately. The socket fd layer itself is designed to allow this, but after abstraction by esp_transport, the race condition risk still needs careful evaluation. Currently, I suspect a potential risk: after a user thread closes socket fd A, a third task might create a new socket fd and return the same fd A. The esp_websocket_client_task would then directly or indirectly perform operations on the cached fd A again. This optimization is not currently considered for this aspect.

Copy link
Collaborator

@gabsuren gabsuren Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that the original implementation also has latency because stop() had to wait for the lock.

However, the difference is that stop() was synchronous. When it returned, the user was guaranteed that the resources (FD) were released.

With the new asynchronous cleanup (deferred to the task loop), stop() returns immediately, but the socket remains open. If the user code does:

esp_websocket_client_stop(client);

// ... destroy ...

`esp_websocket_client_init(&new_config);` // RESTART

The new init/start might fail because the old socket is still held by the blocked reader task.

Regarding the FD close optimization: I agree that closing the FD directly from another thread is very risky due to the reuse race condition you mentioned. It is safer to avoid that unless esp_transport provides a safe cancellation API.

@gabsuren
Copy link
Collaborator

gabsuren commented Feb 4, 2026

@udoudou Thanks for the PR and for tackling the lock contention issue!
The concurrent handling looks great for improving throughput.
I let a few notes regarding SSL/TLS thread safety and any delayed cleanup risks.

Best regards,
Suren

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Opened Issue is new

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants