fix allreduce latency and memory usage calculation when using tp #28

cli99 · 2024-11-13T07:55:27Z

What does this PR do?

This PR makes the following changes:

In get_memory_optimizer_state_and_gradient_per_layer, updatingmemory_optimizer_state_others_per_layer given layernorm is not sharded across the tensor parallel processes.
In get_latency_fwd_per_tp_comm, using get_intra_node_bandwidth which is adjusted by intra_node_memory_efficiency in allreduce latency calculation.

Related to #27

cli99 added 2 commits November 12, 2024 23:38

fix allreduce latency and mem usage when tp is in use

168d9a3

update latency calcuation in allgather

c3bb7eb

cli99 mentioned this pull request Nov 13, 2024

[REQUEST]some question about memory and latency analysis #27

Open

cli99 merged commit d841e40 into main Nov 13, 2024

cli99 deleted the fix-tp branch November 13, 2024 22:41