Cloud Storage (S3 and GCS)
Yoda's DataFusion backend can store OLAP data as Parquet files on Amazon S3 or Google Cloud Storage instead of the local filesystem. This is useful for large analytical datasets that exceed local disk capacity, or for multi-host deployments where OLAP data needs to live in shared object storage.
Feature flag
Enable the cloud-storage feature on the yoda dependency in your Cargo.toml:
[dependencies]
yoda = { version = "1", features = ["cloud-storage"] }Or, when building from a checkout of the workspace:
cargo build --features cloud-storageDataFusion only
Cloud storage modes are available exclusively with the DataFusion OLAP backend. DuckDB uses its own storage path and does not support the cloud-storage feature.
Storage modes
The DataFusion backend supports five StorageMode variants:
| Mode | TOML value | Description |
|---|---|---|
InMemory | "memory" / "inmemory" | Default. Data lost on shutdown. |
ArrowIpc | "arrow_ipc" / "ipc" | Local .arrow files. Fast, no compression. Requires path. |
Parquet | "parquet" | Local .parquet files. Compressed, predicate pushdown. Requires path. |
S3Parquet | "s3-parquet" | Parquet on Amazon S3. Requires url. cloud-storage feature. |
GcsParquet | "gcs-parquet" | Parquet on Google Cloud Storage. Requires url. cloud-storage feature. |
S3 backend
URL format
s3://<bucket>/<prefix>Example: s3://my-analytics-bucket/yoda-data
Per-table subdirectories are created beneath the prefix automatically.
Credentials
Credentials are resolved by the object_store crate using standard AWS environment variables:
export AWS_ACCESS_KEY_ID="AKIA..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_REGION="us-east-1"
# Optional: for IAM role / instance profile, no static credentials neededThe full credential chain follows object_store conventions (env vars → instance metadata → ECS task role → IAM assumed role).
GCS backend
URL format
gs://<bucket>/<prefix>Example: gs://my-gcs-bucket/yoda-data
Credentials
Credentials follow object_store defaults:
# Service account key file
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
# Or use Application Default Credentials (gcloud auth application-default login)Read and write characteristics
Both cloud modes share the same DataFusion ListingTable integration:
- Reads: DataFusion's query engine reads Parquet objects natively from object storage, with column pruning and predicate pushdown applied where possible.
- Bulk inserts (
load_arrow/ CDC sync): a new Parquet object is appended under the table prefix. Efficient for analytical workloads with infrequent bulk loads. - UPDATE / DELETE: requires a read-modify-write cycle — the full table is downloaded, mutated in-memory, and re-uploaded as a single consolidated object. Not suitable for high-frequency point updates.
Object-store latency
Network round-trips dominate query time for cloud modes. For single-host deployments with a small dataset, local Parquet or ArrowIpc storage will be significantly faster. Choose cloud storage when the dataset size exceeds local disk, or when the OLAP mirror must be shared across hosts.
TOML configuration
S3
[engine]
oltp_path = "app.db"
olap_backend = "datafusion"
sync_mode = "destructive"
[engine.datafusion_storage]
mode = "s3-parquet"
url = "s3://my-analytics-bucket/yoda-data"GCS
[engine]
oltp_path = "app.db"
olap_backend = "datafusion"
sync_mode = "temporal"
[engine.datafusion_storage]
mode = "gcs-parquet"
url = "gs://my-gcs-bucket/yoda-data"Full S3 analytics example
[engine]
oltp_path = "app.db"
olap_backend = "datafusion"
sync_mode = "temporal"
sync_interval_ms = 1000
sync_batch_size = 2000
read_pool_size = 4
log_format = "json"
[engine.datafusion_storage]
mode = "s3-parquet"
url = "s3://acme-analytics/prod/yoda"
[[tables]]
name = "orders"
ddl = "CREATE TABLE IF NOT EXISTS orders (id INTEGER PRIMARY KEY, amount REAL, status TEXT, created_at INTEGER)"
[[tables.columns]]
name = "id"
type = "int64"
nullable = false
primary_key = true
[[tables.columns]]
name = "amount"
type = "float64"
nullable = true
[[tables.columns]]
name = "status"
type = "utf8"
nullable = true
[[tables.columns]]
name = "created_at"
type = "int64"
nullable = trueRun with appropriate AWS credentials in the environment:
AWS_ACCESS_KEY_ID=... AWS_SECRET_ACCESS_KEY=... AWS_REGION=us-east-1 \
yd serve --config htap.tomlCombining with temporal mode
Cloud storage works with both sync_mode = "destructive" and sync_mode = "temporal". Temporal mode on S3 is a natural fit for long-retention audit history: Parquet files accumulate append-only, and point-in-time queries work identically to local storage (see Temporal mode).
Source
crates/yoda-datafusion/src/storage.rs — StorageMode enum definition with full doc-comments on read/write characteristics. crates/yoda-tui/src/config.rs — DataFusionStorageToml and parse_storage_mode (TOML mode string parsing).