fix(websocket): Fix the bug related to multi-threaded access to trans… (IDFGH-17155)#997
Conversation
e01a525 to
d379c8c
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
…port and close contention.
d379c8c to
3bbe73a
Compare
| ESP_LOGE(TAG, "Error receive data"); | ||
| esp_websocket_client_abort_connection(client, WEBSOCKET_ERROR_TYPE_TCP_TRANSPORT); | ||
| } | ||
| xSemaphoreGiveRecursive(client->lock); |
There was a problem hiding this comment.
Won't removing the lock be unsafe for wss:// (SSL/TLS) connections?
The underlying SSL library maintains internal state that cannot be safely accessed by mbedtls_ssl_read and mbedtls_ssl_write simultaneously without external synchronization.
By releasing client->lock during esp_websocket_client_recv(), we allow a race condition here.
There was a problem hiding this comment.
The read interface is only called within the esp_websocket_client_task. Furthermore, the shutdown of the transport is now performed within the esp_websocket_client_task. Therefore, no additional lock protection is required.
There was a problem hiding this comment.
I agree that recv is only called by the main task, so we don't have multiple threads trying to read at the same time.
However, my concern is about concurrent Read and Write.
With this change, the main task calls esp_transport_read() without holding the lock. While it is blocked there waiting for data, a user thread can call esp_websocket_client_send(), acquire the lock (since the main task doesn't have it), and call esp_transport_write().
This means esp_transport_read and esp_transport_write will execute simultaneously on the same transport handle.
For TCP (ws://): This is likely fine as the kernel handles concurrent socket access.
For SSL (wss://): This is generally unsafe. The underlying mbedtls context is updated by both read and write operations. If esp_transport_ssl doesn't have its own internal mutex (which I believe it doesn't), accessing the same SSL context from two threads at once will cause undefined behavior or crashes.
There was a problem hiding this comment.
This point has not yet been considered; I will check this area and make further improvements and modifications.
| ESP_LOGI(TAG, "Reconnect after %d ms", client->wait_timeout_ms); | ||
| client->state = WEBSOCKET_STATE_WAIT_TIMEOUT; | ||
| } | ||
| client->state = WEBSOCKET_STATE_WAIT_ABORT_CONNECT; |
There was a problem hiding this comment.
Could you please clarify the behavior regarding delayed cleanup?
Since esp_websocket_client_abort_connection now sets the state to WEBSOCKET_STATE_WAIT_ABORT_CONNECT instead of closing the transport immediately, it seems the actual closure is deferred to the main task loop, right?
If the main task is currently blocked inside esp_transport_read() (which runs without the lock), it will not process this state change until the read returns or times out (up to network_timeout_ms, default 10s).
This means related resources will remain open for that duration even after the user explicitly calls esp_websocket_client_stop() or _destroy() . This could prevent immediate resource reclamation or fast reconnection attempts if the user expects synchronous cleanup.
There was a problem hiding this comment.
In fact, the original implementation also couldn't immediately reclaim or quickly reconnect because esp_transport_read was protected by a lock. Even if a user thread called the interface to directly close the transport, it still needed to wait for esp_transport_read to return and release the lock before it could acquire the lock and perform the close operation. Therefore, I think this has little impact.
Furthermore, this design reduces the optimization to requiring only one lock. Management and maintenance should be easier, without worrying about multiple locks deadlocking each other.
There was a problem hiding this comment.
To optimize the closing speed, my current idea is to obtain the socket file descriptor (fd) corresponding to the transport and close that fd directly. This would allow write/read operations from other tasks to return immediately. The socket fd layer itself is designed to allow this, but after abstraction by esp_transport, the race condition risk still needs careful evaluation. Currently, I suspect a potential risk: after a user thread closes socket fd A, a third task might create a new socket fd and return the same fd A. The esp_websocket_client_task would then directly or indirectly perform operations on the cached fd A again. This optimization is not currently considered for this aspect.
There was a problem hiding this comment.
I understand that the original implementation also has latency because stop() had to wait for the lock.
However, the difference is that stop() was synchronous. When it returned, the user was guaranteed that the resources (FD) were released.
With the new asynchronous cleanup (deferred to the task loop), stop() returns immediately, but the socket remains open. If the user code does:
esp_websocket_client_stop(client);
// ... destroy ...
`esp_websocket_client_init(&new_config);` // RESTART
The new init/start might fail because the old socket is still held by the blocked reader task.
Regarding the FD close optimization: I agree that closing the FD directly from another thread is very risky due to the reuse race condition you mentioned. It is safer to avoid that unless esp_transport provides a safe cancellation API.
|
@udoudou Thanks for the PR and for tackling the lock contention issue! Best regards, |
The
closeoperation on theclient->transportresource is performed under the protection ofclient->lock.esp_websocket_client_send_with_exact_opcodechecks the state without holding the lock when determining whether the transport can be accessed. This can lead to some unexpected behaviors:If
tx_lockis not enabled, callingesp_transport_ws_send_rawwill be abnormally blocked if the transport is already closed.If
tx_lockis enabled,esp_transport_closeandesp_transport_ws_send_rawmay be called concurrently. Sincecloseinvolves releasing some resources, this may lead to unpredictable behavior.Description
To fix this issue, I adjusted the implementation of resource mutual exclusion protection.
1、tx_lock has been removed
The original purpose of introducing
tx_lockwas to allow for parallel processing of receiving and sending, improving transmission efficiency. However, this doesn't necessarily require introducing a newtx_lock. This is because the receiving direction is accessed only by theesp_websocket_client_task. The sending direction may be accessed concurrently by several user tasks and theesp_websocket_client_task. Therefore, theesp_websocket_client_taskdoes not need lock protection when receiving. Lock protection is only applied when theesp_websocket_client_taskneeds to send or when a user task sends, to check if the transport is available.2、Change
esp_websocket_client_abort_connectionto asynchronous close transport.Modify
esp_websocket_client_abort_connectionto only modify the state, preventing user tasks from directly executing close transport and conflicting withesp_websocket_client_task's receive transport.The APIs involved in resource access contention mainly include esp_websocket_client_send_bin, esp_websocket_client_send_bin_partial, esp_websocket_client_send_text, esp_websocket_client_send_text_partial, esp_websocket_client_send_cont_msg, esp_websocket_client_send_fin, esp_websocket_client_send_with_opcode, esp_websocket_client_close, and esp_websocket_client_close_with_code. Ultimately, all of these are involved in esp_websocket_client_send_with_exact_opcode. The main issue is contention involving transport sending, errormsg_buffer, and state when the state is WEBSOCKET_STATE_CONNECTED. After the modification, esp_websocket_client_send_with_exact_opcode accesses and modifies these resources under lock protection. In esp_websocket_client_task, when the state is WEBSOCKET_STATE_CONNECTED, any transport sending, changes to a connectionless state, or access to errormsg_buffer are all performed under lock protection. The close of the transport is uniformly performed by esp_websocket_client_task in a non-WEBSOCKET_STATE_CONNECTED state, ensuring that user task sending will not occur concurrently with the close of the transport, and esp_websocket_client_task does not require lock protection when receiving.
Related
Testing
Checklist
Before submitting a Pull Request, please ensure the following:
Note
Refactors websocket client concurrency and connection lifecycle handling.
tx_lockKconfig/options and all associated separate TX locking; uses a singleclient->lockto guard all transport sends/state/error buffer accessWEBSOCKET_STATE_WAIT_ABORT_CONNECTand changesesp_websocket_client_abort_connection()to be asynchronous (sets state, dispatches disconnect); actualesp_transport_close()now performed by the client taskclient->lockand re-checkstatebefore accessing transport; adds locked aborts on read/write failuresWritten by Cursor Bugbot for commit 3bbe73a. This will update automatically on new commits. Configure here.