Skip to content

Conversation

@bobleesj
Copy link
Collaborator

@bobleesj bobleesj commented Jan 9, 2026

Pretty brittle CUDA-accelerated loading of .h5 file.

On a single L40s GPU (~40 Gb), it takes about 0.5s to load and decompress ~1 Gb into ~10 Gb.

The current hardware limit is disk-to-GPU memory transfer speed (~80-90% of total wall time)

Tested using Arina 4D-STEM using data collected at NCEM with @smribet:

Screenshot 2026-01-08 at 4 05 00 PM

Performance:

Screenshot 2026-01-08 at 4 05 55 PM

API:

from quantem.hpc.io import load, bin
data = load(HDF5_PATH) 
bin(data, 2) 

I will keep this as a draft PR for now. I will be testing with other datasets collected recently and improve its API, etc.

example_cuda_gpu_load.ipynb

@bobleesj bobleesj changed the title CUDA-kernel for disk to VRAM .h5 load/bin (~10x faster) CUDA kernel for loading compressed .h5 from disk to VRAM Jan 9, 2026
no __init__ needed for test file.
@arthurmccray
Copy link
Collaborator

arthurmccray commented Jan 9, 2026

This looks great! Adding this comment to more widely raise a couple of questions we discussed in group meeting:

  • This uses the cupy custom CUDA kernel method of launching, and we don't currently have cupy as a dependency.
    • We've messed about a little in the past with writing custom kernels that are torch compatible and with gradient calculation see this somewhat stale repo for an idea, but never came to a final conclusion.
    • I think it would be ideal if we could make a torch-only method of running our own kernels (w/ or w/o backpropagation), but I don't honestly know how difficult that would be.
  • We do need to be a little careful with licensing, especially for code such as cuda kernels that we're adopting from other sources

It makes a lot of sense to me to figure out how we want to implement custom CUDA kernels, as I suspect we will have more usecases in the future, and we should probably do so sooner rather than later.

@bobleesj
Copy link
Collaborator Author

@arthurmccray Thanks so much for the feedback.

custom kernels that are torch compatible and with gradient calculation

indeed, integrating cuda kernal w/ pytorch requires a bit of logitics (mainly passing gradient info as you've mentioned). I experimented before and I can do a follow-up on this.

implement custom CUDA kernels,

Yes, I think we should leverage the given hardware and given that the API for the end-user dosn't change (just logging to mallard, done, gain 2-10x speed) then we should push for this.

I will report back all other topics that you've mentioned during dev meeting as well.

@bobleesj bobleesj closed this Jan 11, 2026
@bobleesj bobleesj deleted the hpc-load branch January 11, 2026 06:56
@bobleesj bobleesj restored the hpc-load branch January 11, 2026 06:57
@bobleesj bobleesj deleted the hpc-load branch January 11, 2026 06:57
@bobleesj bobleesj restored the hpc-load branch January 11, 2026 06:59
@bobleesj bobleesj reopened this Jan 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants