Skip to content

PytorchStreamReader failed reading file data.pkl: file read failed #37

Description

@andreanans

Deepdewedge training very often crashes due to random file read errors when running on multiple GPUs, for example:

“Error loading .//subtomos/fitting_subtomos/subtomo0/229.pt
Error message is: PytorchStreamReader failed reading file data.pkl: file read failed”
every time that happens, it would wait some time, and attempt again and again, until it can read that file and continue. But sometimes after N attempts, the job just crashes. The file is not missing; it was written by a previous job, so this is not something that failed to be written as the job is executing.

Is this issue caused by the multiple GPUs trying to access the same file? Is there anything we can do to avoid this error?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions