Data Mounting
How C3 stores data
C3 keeps all data (datasets you upload and artifacts your jobs produce) on a centralised storage server with high-bandwidth connections to GPU nodes around the world. Think of it like a warehouse on a motorway network: the links between the warehouse and the GPUs are fast, high-bandwidth connections. Uploading from your local machine is the bottleneck, since home and office network connections to remote servers are much slower than server-to-server transfers. The good news is you only need to upload once. After that, C3 moves data between its storage and GPUs at full speed.
Paths
Data in C3 can be organised by project, by job, or both:
| Path | What it contains |
|---|---|
/datasets/{name}/ | Uploaded datasets |
/jobs/{jobId}/ | Job output artifacts |
/projects/{project}/data/{name}/ | Datasets scoped to a project |
/projects/{project}/jobs/{jobId}/ | Job artifacts scoped to a project |
You can use whichever path style suits your workflow. /jobs/{jobId}/ resolves the project automatically.
Upload a dataset
c3 data cp ./local-data/ /datasets/my-dataset/
This uploads your data to C3's centralised storage. You only need to do this once. After the initial upload, every c3 deploy that references this dataset gets rapid access to it directly from the storage network, with no re-upload needed.
C3 uses content-addressed deduplication: each file is hashed (SHA256) before upload, and if the content already exists, the upload is skipped. This means re-uploading a dataset with minor changes only transfers the files that actually changed, and overall storage usage can be lower than standard methods since identical files are never stored twice (see How deduplication works below).
Browse data
Use c3 data ls to browse datasets, versions, and files:
c3 data ls /datasets/ # List all datasets
c3 data ls /datasets/my-dataset/ # List versions
c3 data ls -l /datasets/my-dataset/@latest/ # List files in latest version
Mount a dataset in a job
Reference the dataset in your .c3 config:
datasets:
- ref: /datasets/my-dataset
mount: /data/my-dataset
Once referenced, C3 handles moving the data to whichever GPU your job lands on. From your script's perspective, the files are simply local at the mount path. You read them like any other files:
import numpy as np
data = np.loadtxt("/data/my-dataset/measurements.csv", delimiter=",")
Mount path rules
- Mount paths must be absolute (start with
/). Relative paths are rejected at submission time with a clear error. - If
mountis omitted, it is auto-derived as/data/<dataset-name>(e.g.,/datasets/cifar10becomes/data/cifar10). - In
.c3YAML, a relative mount likemydatais auto-prefixed to/data/mydata.
Local directories
You can reference a local directory as a dataset. C3 auto-uploads it before submitting the job:
datasets:
- ref: ./local-data
mount: /data/train
This is equivalent to running c3 data cp ./local-data/ /datasets/... yourself, but handled automatically.
Versioning
Every upload creates a new version. Your jobs always get exactly the data they expect:
c3 data log /datasets/my-dataset/
VERSION CREATED FILES SIZE
v3 2024-01-15 10:00:00 1000 2.5GB
v2 2024-01-10 09:00:00 1000 2.4GB
v1 2024-01-05 08:00:00 500 1.2GB
Jobs reference the latest version by default, or you can pin to a specific version for reproducibility.
How deduplication works
All data in C3 (datasets, workspaces, and job artifacts) uses the same content-addressed storage. Every file is stored as a blob keyed by its SHA256 hash, and a manifest lists which blobs make up each dataset, workspace, or set of job artifacts.
This means:
- Cross-job dedup: If two jobs produce identical output files, the data is stored once
- Workspace dedup: Re-deploying the same code skips uploading unchanged files
- Cross-dataset dedup: Identical files shared across datasets use the same storage
- Instant re-uploads:
c3 data cponly uploads files that have actually changed
Deduplication is automatic and transparent. Artifacts still appear per-job (each job has its own listing), but identical files across jobs share storage behind the scenes.
Data commands
| Command | Description |
|---|---|
c3 data ls /path/ | List files, datasets, or job artifacts |
c3 data cp SRC DST | Copy files (upload or download) |
c3 data rm -r /path/ | Delete a dataset (requires -r for datasets) |
c3 data du /path/ | Show disk usage |
c3 data log /path/ | Show version history |