Skip to main content

Data Mounting

How C3 stores data

C3 keeps all data (datasets you upload and artifacts your jobs produce) on a centralised storage server with high-bandwidth connections to GPU nodes around the world. Think of it like a warehouse on a motorway network: the links between the warehouse and the GPUs are fast, high-bandwidth connections. Uploading from your local machine is the bottleneck, since home and office network connections to remote servers are much slower than server-to-server transfers. The good news is you only need to upload once. After that, C3 moves data between its storage and GPUs at full speed.

Paths

Data in C3 can be organised by project, by job, or both:

PathWhat it contains
/datasets/{name}/Uploaded datasets
/jobs/{jobId}/Job output artifacts
/projects/{project}/data/{name}/Datasets scoped to a project
/projects/{project}/jobs/{jobId}/Job artifacts scoped to a project

You can use whichever path style suits your workflow. /jobs/{jobId}/ resolves the project automatically.

Upload a dataset

c3 data cp ./local-data/ /datasets/my-dataset/

This uploads your data to C3's centralised storage. You only need to do this once. After the initial upload, every c3 deploy that references this dataset gets rapid access to it directly from the storage network, with no re-upload needed.

C3 uses content-addressed deduplication: each file is hashed (SHA256) before upload, and if the content already exists, the upload is skipped. This means re-uploading a dataset with minor changes only transfers the files that actually changed, and overall storage usage can be lower than standard methods since identical files are never stored twice (see How deduplication works below).

Browse data

Use c3 data ls to browse datasets, versions, and files:

c3 data ls /datasets/                          # List all datasets
c3 data ls /datasets/my-dataset/ # List versions
c3 data ls -l /datasets/my-dataset/@latest/ # List files in latest version

Mount a dataset in a job

Reference the dataset in your .c3 config:

datasets:
- ref: /datasets/my-dataset
mount: /data/my-dataset

Once referenced, C3 handles moving the data to whichever GPU your job lands on. From your script's perspective, the files are simply local at the mount path. You read them like any other files:

import numpy as np

data = np.loadtxt("/data/my-dataset/measurements.csv", delimiter=",")

Mount path rules

  • Mount paths must be absolute (start with /). Relative paths are rejected at submission time with a clear error.
  • If mount is omitted, it is auto-derived as /data/<dataset-name> (e.g., /datasets/cifar10 becomes /data/cifar10).
  • In .c3 YAML, a relative mount like mydata is auto-prefixed to /data/mydata.

Local directories

You can reference a local directory as a dataset. C3 auto-uploads it before submitting the job:

datasets:
- ref: ./local-data
mount: /data/train

This is equivalent to running c3 data cp ./local-data/ /datasets/... yourself, but handled automatically.

Versioning

Every upload creates a new version. Your jobs always get exactly the data they expect:

c3 data log /datasets/my-dataset/
VERSION   CREATED              FILES   SIZE
v3 2024-01-15 10:00:00 1000 2.5GB
v2 2024-01-10 09:00:00 1000 2.4GB
v1 2024-01-05 08:00:00 500 1.2GB

Jobs reference the latest version by default, or you can pin to a specific version for reproducibility.

How deduplication works

All data in C3 (datasets, workspaces, and job artifacts) uses the same content-addressed storage. Every file is stored as a blob keyed by its SHA256 hash, and a manifest lists which blobs make up each dataset, workspace, or set of job artifacts.

This means:

  • Cross-job dedup: If two jobs produce identical output files, the data is stored once
  • Workspace dedup: Re-deploying the same code skips uploading unchanged files
  • Cross-dataset dedup: Identical files shared across datasets use the same storage
  • Instant re-uploads: c3 data cp only uploads files that have actually changed

Deduplication is automatic and transparent. Artifacts still appear per-job (each job has its own listing), but identical files across jobs share storage behind the scenes.

Data commands

CommandDescription
c3 data ls /path/List files, datasets, or job artifacts
c3 data cp SRC DSTCopy files (upload or download)
c3 data rm -r /path/Delete a dataset (requires -r for datasets)
c3 data du /path/Show disk usage
c3 data log /path/Show version history