A suite of ML training benchmarks for Tigris object storage and TAG (Tigris Acceleration Gateway). Each benchmark targets a different data format or training topology used in modern ML pipelines.
| # | Name | Format | Dataset |
|---|---|---|---|
| 2 | ViT + ImageNet-21K | Individual JPEGs (~5.5M) | 500 GB ImageNet-21K |
| 3 | CLIP + WebDataset | TAR shards (4–256 MB) | 500 GB DataComp |
| 4 | DINOv2 + Lance Embedding Pipeline | Lance columnar | 500 GB synthetic embeddings |
| 5 | Multi-Node Distributed ViT | WebDataset TAR shards | 500 GB DataComp |
| 6 | s3torchbenchmarking (upstream) | Dataset + checkpointing | Synthetic |
| og | ViT + S3IterableDataset | Individual JPEGs | 10 GB synthetic |
See RUNBOOK.md for a full step-by-step execution guide.
# On a g5.8xlarge with GPU, NVMe, and Tigris credentials configured:
cd benchmark-2/
# Upload dataset (~12–24 hours for 500 GB)
python upload_imagenet21k.py --subset-size 500 --bucket s3torch-benchmark-2
# Run entitlement benchmark (raw I/O ceiling — no GPU bottleneck)
python benchmark.py --config configs/entitlement.yaml
# Run training benchmark
python benchmark.py --config configs/training.yamltag-benchmarks/
├── README.md # This file
├── RUNBOOK.md # Step-by-step Benchmark 2 execution guide
├── benchmark-2/ # ViT + ImageNet-21K
│ ├── benchmark.py
│ ├── upload_imagenet21k.py
│ ├── collect_results.py
│ ├── monitor.py
│ └── configs/
├── benchmark-3/ # CLIP + WebDataset
│ ├── benchmark.py
│ ├── prepare_webdataset.py
│ └── configs/
├── benchmark-4/ # DINOv2 + Lance (embedding pipeline)
│ ├── benchmark.py
│ ├── prepare_lance.py
│ └── configs/
├── benchmark-5/ # Multi-node distributed ViT
│ ├── benchmark.py
│ ├── setup_node.sh
│ ├── launch_distributed.sh
│ └── configs/
├── benchmark-6/ # s3torchbenchmarking (Hydra-based)
│ ├── src/
│ ├── conf/
│ └── utils/
└── benchmark-og/ # AWS S3 data loading benchmark replication
├── benchmark.py
└── configs/
Why g5.8xlarge? It has a useful ratio of NVMe (900 GB) to RAM (128 GB). TAG's NVMe cache holds the full dataset while page cache holds only ~25%, making TAG's contribution clearly measurable. Bigger instances (p4d, p5) have so much RAM that page cache masks TAG's benefit unless datasets are multi-TB.
Why entitlement tests? GPU-bound training benchmarks mask I/O differences — the A10G saturates at a fixed throughput regardless of storage backend. The entitlement test (no-op model) removes the GPU bottleneck and measures raw data pipeline throughput. This mirrors MLPerf Storage methodology.
Why 500 GB datasets? Minimum 4× system RAM (128 GB), aligned with MLPerf Storage sizing rules (5×). Smaller datasets risk page cache masking TAG's contribution.