Local resume task queue per task group. by MrGuin · Pull Request #2398 · apache/brpc

MrGuin · 2023-09-27T07:57:40Z

What problem does this PR solve?

一个要解决的问题是 brpc worker 线程多的时候，由于大量的 signal_task 导致的性能雪崩。我们观察到 bthread_concurrency 多了后大量的 cpu 都会花费在 TaskContro::signal_task 以及 ParkingLot::signal 上，更具体是 ParkingLot::_pending_signal 这个原子变量上。signal_task 是一个热点函数，在每次发生 bthread 切换，新的 bthread 创建，以及等待中的 bthread 恢复执行时都会被调用；而 brpc 当前的 signal_task 的实现比较粗糙，每次 signal_task 都无差别地遍历对 parkinglot 做 signal，_pending_signal 的自增成为巨大的热点，会占用整体超过一半的 cpu。
这是我们在64核机器上 bthread_concurrency 设置为48时的 cpu profiling:

就像代码注释里所说，

Current algorithm does not guarantee enough threads will be created to match caller's requests. But in another side, there's also many useless signalings according to current impl. Capping the concurrency is a good balance between performance and timeliness of scheduling.

针对 worker 唤醒，做的改动就是记录在 parkinglot 上等待的 worker num，只有有 worker 等待时才 signal_task；另外就是 steal_task 时让 worker 在 wait 在 parking lot 之前 busy poll 一小段时间，不那么频繁地 wait。

另外一个改动就是对于重新唤醒的 task 的处理。我们的场景 RPC bthread 会等待我们的 service 处理然后被唤醒，现有的 butex_wake 实现中虽然 brpc 的 worker 线程会立即切换到被唤醒的 bthread，但非 worker pthread 调用 butex_wake 只会把被唤醒的 bthread 给放进优先级很低的 _remote_rq。我们认为被唤醒的应该优先被处理，所以每个 TaskGroup 引入了额外的 _resume_rq 来保存由外部 pthread 唤醒的 bthread，在 wait_task 中优先检查。

What is changed and the side effects?

Changed:

Side effects:

Performance effects(性能影响):
Breaking backward compatibility(向后兼容性):

Check List:

Please make sure your changes are compilable(请确保你的更改可以通过编译).
When providing us with a new feature, it is best to add related tests(如果你向我们增加一个新的功能, 请添加相关测试).
Please follow Contributor Covenant Code of Conduct.(请遵循贡献者准则).

…ched(),ready_to_run_remote() of TaskGroup

…n wait_task

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Add resume_rq for remote task and improve wait_task.

* add no_signal parameter to notify_one * define guard for bthread_cond_signal

The latest code relies on: * C++11 -> C++17 * Glog minimum version >= 0.6.0

wwbmmm · 2023-10-12T07:12:36Z

Could you describe what functionality this PR has implemented and the general implementation approach?(可以描述一下这个PR实现了什么功能，以及大致的实现思路吗？)

JimChengLin · 2023-10-14T02:40:58Z

我理解这个 MR 主要是两方面的工作：

给 ParkingLot 加了一个 waiter counter，如果没有 waiter 就不要走 futex wake 了，可以省 syscall。这是一个没有副作用的优化。还有个小建议是没有 waiter 的情况就不要计入 no singal 了，这时候所有 worker 已经火力全开了，攒着也没啥用。

2. 用 global queue 来替换 tg 的 remote task q。这个我考虑过一段时间了。我的 concern 在于这是 trade off 而不是纯优化。bthread 当前 steal 是随机 probe 所有的 tg。坏处是 tg 的 task 少的话，steal 会扫来扫去的。好处是 task 多的话，cache line 竞争会小。当然 global queue 也可以内部自己分多条 queue，但 trade off 是一样的。
就是这个steal策略有 2 个维度，一个是负载，一个是吞吐导向还是响应导向。如果负载低就应该用全局数据结构，反之要 shard。如果吞吐导向就应该 batch 唤醒和派发，反之应该激进唤醒 waiter。我不觉得有啥办法能把这 4 点就满足了。

华为最近 OSDI 2023 有篇 https://www.usenix.org/conference/osdi23/presentation/wang-jiawei
NSDI 2022 也有 https://www.[usenix.org/system/files/nsdi22-paper-mcclure_2.pdf](https://www.usenix.org/system/files/nsdi22-paper-mcclure_2.pdf)

tldr，我觉得线上作为默认调度都差点意思。BWoS 给 tokio 的 PR 好久还没合进去。

MrGuin · 2023-10-16T04:03:06Z

@wwbmmm 这个是我们对 brpc 做的一些改动，提给我们的 fork 的，不小心提到了这里，sorry。正好趁这个机会大家讨论一下，已补充上下文。

wwbmmm · 2023-10-18T03:31:46Z

针对 worker 唤醒，做的改动就是记录在 parkinglot 上等待的 worker num，只有有 worker 等待时才 signal_task；另外就是 steal_task 时让 worker 在 wait 在 parking lot 之前 busy poll 一小段时间，不那么频繁地 wait。

这个是比较通用的一个优化，代码改动也比较小，可以单独提个PR吗？

另外一个改动就是对于重新唤醒的 task 的处理。我们的场景 RPC bthread 会等待我们的 service 处理然后被唤醒，现有的 butex_wake 实现中虽然 brpc 的 worker 线程会立即切换到被唤醒的 bthread，但非 worker pthread 调用 butex_wake 只会把被唤醒的 bthread 给放进优先级很低的 _remote_rq。我们认为被唤醒的应该优先被处理，所以每个 TaskGroup 引入了额外的 _resume_rq 来保存由外部 pthread 唤醒的 bthread，在 wait_task 中优先检查。

这个需求感觉有点定制化。有测过这个改之后的实际性能收益吗？

* change redis txn and support watch * update redis multi unit test

chenBright · 2024-06-19T08:41:50Z

针对 worker 唤醒，做的改动就是记录在 parkinglot 上等待的 worker num，只有有 worker 等待时才 signal_task；另外就是 steal_task 时让 worker 在 wait 在 parking lot 之前 busy poll 一小段时间，不那么频繁地 wait。

@MrGuin 这个改动上线稳定运行了吗？可否推进一下合到社区。

MrGuin · 2024-06-20T10:01:41Z

针对 worker 唤醒，做的改动就是记录在 parkinglot 上等待的 worker num，只有有 worker 等待时才 signal_task；另外就是 steal_task 时让 worker 在 wait 在 parking lot 之前 busy poll 一小段时间，不那么频繁地 wait。

@MrGuin 这个改动上线稳定运行了吗？可否推进一下合到社区。

好的，我抽空针对这个提个 PR。我们内部一直在用，目前还算稳定。c7g.16xlarge 机型下四五十个 worker 线程可以达到两三百万的 qps，性能可以正常随着 worker 数 scale，不会雪崩了。

zhengJade · 2024-08-27T02:32:35Z

这个有计划合并进来嘛，暂时没有的话，我就自己写了

lzxddz and others added 14 commits June 14, 2023 14:08

add resume_rq for remote task and update wait_task(),sched(),ending_s…

abe5a5a

…ched(),ready_to_run_remote() of TaskGroup

add remote queue size bvar

0145eed

add bvar consume command and socket write latency; remove busy loop i…

b0e2b9e

…n wait_task

include fix

bd54270

remove duplicate header

8158abd

Update src/bthread/parking_lot.cpp

d48aa3c

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update src/bthread/moodycamelqueue.h

86048e8

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Merge pull request #4 from monographdb/resume_q_zkl

d77640f

Add resume_rq for remote task and improve wait_task.

Add no_signal parameter to notify_one. (#5)

bc3eaab

* add no_signal parameter to notify_one * define guard for bthread_cond_signal

Update minimum virsion requirements for dependancies. (#6)

03bc4be

The latest code relies on: * C++11 -> C++17 * Glog minimum version >= 0.6.0

change static resume_rq to shared_ptr get from singleton object (#8)

73fb5a9

Add memory header file

0e8e5a4

include headers (#9)

62a3c88

set default behaviour for bthread_cond_signal to no signal (#7)

c9b7ad5

MrGuin force-pushed the resume_q_by_tg branch from 1477516 to 14bcfab Compare October 16, 2023 06:37

MrGuin force-pushed the resume_q_by_tg branch 2 times, most recently from de2ac82 to f9ce7cd Compare November 10, 2023 09:35

fix the problem that butex_wake does not signal pending tasks (#13)

2986f4d

MrGuin force-pushed the resume_q_by_tg branch from f9ce7cd to d86060e Compare December 13, 2023 08:42

MrGuin added 4 commits December 14, 2023 17:41

Redis transaction support. (#12)

846f5ac

* change redis txn and support watch * update redis multi unit test

local resume_rq each task group

a70b93f

wait_task busy loop before waiting on PL

6686bfe

add bvar ready_to_run_skip_signal_task_per_second

7e81e6e

MrGuin force-pushed the resume_q_by_tg branch from d86060e to 7e81e6e Compare December 21, 2023 05:22

MrGuin added 2 commits January 9, 2024 15:50

change wait_task busy poll time from 100ms to 15ms

8009b20

check waiting_worker_num in signal_task

6269bf1

chenBright mentioned this pull request Jun 19, 2024

关于 signal_task 逻辑的一些疑问 #2667

Open

chenBright mentioned this pull request Dec 13, 2024

bthread signal & wait 问题咨询 #2849

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local resume task queue per task group.#2398

Local resume task queue per task group.#2398
MrGuin wants to merge 21 commits intoapache:masterfrom
monographdb:resume_q_by_tg

MrGuin commented Sep 27, 2023 •

edited

Loading

Uh oh!

wwbmmm commented Oct 12, 2023

Uh oh!

JimChengLin commented Oct 14, 2023 •

edited

Loading

Uh oh!

MrGuin commented Oct 16, 2023

Uh oh!

wwbmmm commented Oct 18, 2023 •

edited

Loading

Uh oh!

chenBright commented Jun 19, 2024

Uh oh!

MrGuin commented Jun 20, 2024

Uh oh!

zhengJade commented Aug 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

MrGuin commented Sep 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and the side effects?

Check List:

Uh oh!

wwbmmm commented Oct 12, 2023

Uh oh!

JimChengLin commented Oct 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MrGuin commented Oct 16, 2023

Uh oh!

wwbmmm commented Oct 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenBright commented Jun 19, 2024

Uh oh!

MrGuin commented Jun 20, 2024

Uh oh!

zhengJade commented Aug 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

MrGuin commented Sep 27, 2023 •

edited

Loading

JimChengLin commented Oct 14, 2023 •

edited

Loading

wwbmmm commented Oct 18, 2023 •

edited

Loading