Skip to content

Question about the training time #14

@Hugo-cell111

Description

@Hugo-cell111

Hi! I have noticed the information in the paper:

We train the 160M model on 1B-token datasets using a single NVIDIA A100
40GB GPU. For experiments with the 160M, 470M, and 1B models on 10B and 50B-token datasets,
we utilize 8 NVIDIA A100 40GB GPUs. All data scoring steps, including proxy data annotation,
scorer training, and data scoring, are performed on a single NVIDIA A100 80GB GPU.

In the step"Annotate proxy data", the PMP-Solver is trained, and I wonder how long will it take in this step for an H800 or A100 80GB GPU? Can I speed up this step using multi-gpu parallelism in deepspeed framework? Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions