fix(websocket): Fix the bug related to multi-threaded access to trans… (IDFGH-17155) by udoudou · Pull Request #997 · espressif/esp-protocols

udoudou · 2026-01-27T10:50:56Z

The close operation on the client->transport resource is performed under the protection of client->lock. esp_websocket_client_send_with_exact_opcode checks the state without holding the lock when determining whether the transport can be accessed. This can lead to some unexpected behaviors:

If tx_lock is not enabled, calling esp_transport_ws_send_raw will be abnormally blocked if the transport is already closed.
If tx_lock is enabled, esp_transport_close and esp_transport_ws_send_raw may be called concurrently. Since close involves releasing some resources, this may lead to unpredictable behavior.

Description

To fix this issue, I adjusted the implementation of resource mutual exclusion protection.
1、tx_lock has been removed
The original purpose of introducing tx_lock was to allow for parallel processing of receiving and sending, improving transmission efficiency. However, this doesn't necessarily require introducing a new tx_lock. This is because the receiving direction is accessed only by the esp_websocket_client_task. The sending direction may be accessed concurrently by several user tasks and the esp_websocket_client_task. Therefore, the esp_websocket_client_task does not need lock protection when receiving. Lock protection is only applied when the esp_websocket_client_task needs to send or when a user task sends, to check if the transport is available.

2、Change esp_websocket_client_abort_connection to asynchronous close transport.
Modify esp_websocket_client_abort_connection to only modify the state, preventing user tasks from directly executing close transport and conflicting with esp_websocket_client_task's receive transport.

The APIs involved in resource access contention mainly include esp_websocket_client_send_bin, esp_websocket_client_send_bin_partial, esp_websocket_client_send_text, esp_websocket_client_send_text_partial, esp_websocket_client_send_cont_msg, esp_websocket_client_send_fin, esp_websocket_client_send_with_opcode, esp_websocket_client_close, and esp_websocket_client_close_with_code. Ultimately, all of these are involved in esp_websocket_client_send_with_exact_opcode. The main issue is contention involving transport sending, errormsg_buffer, and state when the state is WEBSOCKET_STATE_CONNECTED. After the modification, esp_websocket_client_send_with_exact_opcode accesses and modifies these resources under lock protection. In esp_websocket_client_task, when the state is WEBSOCKET_STATE_CONNECTED, any transport sending, changes to a connectionless state, or access to errormsg_buffer are all performed under lock protection. The close of the transport is uniformly performed by esp_websocket_client_task in a non-WEBSOCKET_STATE_CONNECTED state, ensuring that user task sending will not occur concurrently with the close of the transport, and esp_websocket_client_task does not require lock protection when receiving.

Testing

Checklist

Before submitting a Pull Request, please ensure the following:

🚨 This PR does not introduce breaking changes.
All CI checks (GH Actions) pass.
Documentation is updated as needed.
Tests are updated or added as necessary.
Code is well-commented, especially in complex areas.
Git history is clean — commits are squashed to the minimum necessary.

Note

Refactors websocket client concurrency and connection lifecycle handling.

Removes tx_lock Kconfig/options and all associated separate TX locking; uses a single client->lock to guard all transport sends/state/error buffer access
Introduces WEBSOCKET_STATE_WAIT_ABORT_CONNECT and changes esp_websocket_client_abort_connection() to be asynchronous (sets state, dispatches disconnect); actual esp_transport_close() now performed by the client task
Ensures PING/PONG/CLOSE and send paths acquire client->lock and re-check state before accessing transport; adds locked aborts on read/write failures
Adjusts main task state machine: splits abort handling, reconnect/wait logic, and server-initiated close flow; removes previous WAIT_TIMEOUT handling tied to immediate close
Cleans up Kconfig (drops tx lock options) and updates SPDX year

^{Written by Cursor Bugbot for commit 3bbe73a. This will update automatically on new commits. Configure here.}

CLAassistant · 2026-01-27T10:51:03Z

All committers have signed the CLA.

components/esp_websocket_client/esp_websocket_client.c

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

components/esp_websocket_client/esp_websocket_client.c

…port and close contention.

gabsuren · 2026-02-04T11:14:59Z

components/esp_websocket_client/esp_websocket_client.c

                    ESP_LOGE(TAG, "Error receive data");
-                    esp_websocket_client_abort_connection(client, WEBSOCKET_ERROR_TYPE_TCP_TRANSPORT);
                }
-                xSemaphoreGiveRecursive(client->lock);


Won't removing the lock be unsafe for wss:// (SSL/TLS) connections?
The underlying SSL library maintains internal state that cannot be safely accessed by mbedtls_ssl_read and mbedtls_ssl_write simultaneously without external synchronization.
By releasing client->lock during esp_websocket_client_recv(), we allow a race condition here.

The read interface is only called within the esp_websocket_client_task. Furthermore, the shutdown of the transport is now performed within the esp_websocket_client_task. Therefore, no additional lock protection is required.

I agree that recv is only called by the main task, so we don't have multiple threads trying to read at the same time.

However, my concern is about concurrent Read and Write.

With this change, the main task calls esp_transport_read() without holding the lock. While it is blocked there waiting for data, a user thread can call esp_websocket_client_send(), acquire the lock (since the main task doesn't have it), and call esp_transport_write().

This means esp_transport_read and esp_transport_write will execute simultaneously on the same transport handle.

For TCP (ws://): This is likely fine as the kernel handles concurrent socket access.
For SSL (wss://): This is generally unsafe. The underlying mbedtls context is updated by both read and write operations. If esp_transport_ssl doesn't have its own internal mutex (which I believe it doesn't), accessing the same SSL context from two threads at once will cause undefined behavior or crashes.

This point has not yet been considered; I will check this area and make further improvements and modifications.

gabsuren · 2026-02-04T11:27:37Z

components/esp_websocket_client/esp_websocket_client.c

-        ESP_LOGI(TAG, "Reconnect after %d ms", client->wait_timeout_ms);
-        client->state = WEBSOCKET_STATE_WAIT_TIMEOUT;
-    }
+    client->state = WEBSOCKET_STATE_WAIT_ABORT_CONNECT;


Could you please clarify the behavior regarding delayed cleanup?
Since esp_websocket_client_abort_connection now sets the state to WEBSOCKET_STATE_WAIT_ABORT_CONNECT instead of closing the transport immediately, it seems the actual closure is deferred to the main task loop, right?

If the main task is currently blocked inside esp_transport_read() (which runs without the lock), it will not process this state change until the read returns or times out (up to network_timeout_ms, default 10s).

This means related resources will remain open for that duration even after the user explicitly calls esp_websocket_client_stop() or _destroy() . This could prevent immediate resource reclamation or fast reconnection attempts if the user expects synchronous cleanup.

In fact, the original implementation also couldn't immediately reclaim or quickly reconnect because esp_transport_read was protected by a lock. Even if a user thread called the interface to directly close the transport, it still needed to wait for esp_transport_read to return and release the lock before it could acquire the lock and perform the close operation. Therefore, I think this has little impact.

Furthermore, this design reduces the optimization to requiring only one lock. Management and maintenance should be easier, without worrying about multiple locks deadlocking each other.

To optimize the closing speed, my current idea is to obtain the socket file descriptor (fd) corresponding to the transport and close that fd directly. This would allow write/read operations from other tasks to return immediately. The socket fd layer itself is designed to allow this, but after abstraction by esp_transport, the race condition risk still needs careful evaluation. Currently, I suspect a potential risk: after a user thread closes socket fd A, a third task might create a new socket fd and return the same fd A. The esp_websocket_client_task would then directly or indirectly perform operations on the cached fd A again. This optimization is not currently considered for this aspect.

I understand that the original implementation also has latency because stop() had to wait for the lock.

However, the difference is that stop() was synchronous. When it returned, the user was guaranteed that the resources (FD) were released.

With the new asynchronous cleanup (deferred to the task loop), stop() returns immediately, but the socket remains open. If the user code does:

esp_websocket_client_stop(client); // ... destroy ... `esp_websocket_client_init(&new_config);` // RESTART

The new init/start might fail because the old socket is still held by the blocked reader task.

Regarding the FD close optimization: I agree that closing the FD directly from another thread is very risky due to the reuse race condition you mentioned. It is safer to avoid that unless esp_transport provides a safe cancellation API.

gabsuren · 2026-02-04T11:34:27Z

@udoudou Thanks for the PR and for tackling the lock contention issue!
The concurrent handling looks great for improving throughput.
I let a few notes regarding SSL/TLS thread safety and any delayed cleanup risks.

Best regards,
Suren

github-actions bot changed the title ~~fix(websocket): Fix the bug related to multi-threaded access to trans…~~ fix(websocket): Fix the bug related to multi-threaded access to trans… (IDFGH-17155) Jan 27, 2026

espressif-bot added the Status: Opened Issue is new label Jan 27, 2026

cursor bot reviewed Jan 27, 2026

View reviewed changes

components/esp_websocket_client/esp_websocket_client.c Outdated Show resolved Hide resolved

udoudou force-pushed the bugfix/esp_websocket_client_thread_safe branch from e01a525 to d379c8c Compare January 27, 2026 11:11

cursor bot reviewed Jan 27, 2026

View reviewed changes

components/esp_websocket_client/esp_websocket_client.c Show resolved Hide resolved

components/esp_websocket_client/esp_websocket_client.c Show resolved Hide resolved

fix(websocket): Fix the bug related to multi-threaded access to trans…

3bbe73a

…port and close contention.

udoudou force-pushed the bugfix/esp_websocket_client_thread_safe branch from d379c8c to 3bbe73a Compare January 27, 2026 11:53

gabsuren requested changes Feb 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(websocket): Fix the bug related to multi-threaded access to trans… (IDFGH-17155)#997

fix(websocket): Fix the bug related to multi-threaded access to trans… (IDFGH-17155)#997
udoudou wants to merge 1 commit intoespressif:masterfrom
udoudou:bugfix/esp_websocket_client_thread_safe

udoudou commented Jan 27, 2026 •

edited by cursor bot

Loading

Uh oh!

CLAassistant commented Jan 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

gabsuren Feb 4, 2026 •

edited

Loading

Uh oh!

udoudou Feb 10, 2026

Uh oh!

gabsuren Feb 10, 2026

Uh oh!

udoudou Feb 11, 2026

Uh oh!

gabsuren Feb 4, 2026

Uh oh!

udoudou Feb 10, 2026

Uh oh!

udoudou Feb 10, 2026

Uh oh!

gabsuren Feb 11, 2026 •

edited

Loading

Uh oh!

gabsuren commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

udoudou commented Jan 27, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related

Testing

Checklist

Uh oh!

CLAassistant commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gabsuren Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

udoudou Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

gabsuren Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

udoudou Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

gabsuren Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

udoudou Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

udoudou Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

gabsuren Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabsuren commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

udoudou commented Jan 27, 2026 •

edited by cursor bot

Loading

CLAassistant commented Jan 27, 2026 •

edited

Loading

gabsuren Feb 4, 2026 •

edited

Loading

gabsuren Feb 11, 2026 •

edited

Loading