Skip to content

Cloud Storage (S3 and GCS)

Yoda's DataFusion backend can store OLAP data as Parquet files on Amazon S3 or Google Cloud Storage instead of the local filesystem. This is useful for large analytical datasets that exceed local disk capacity, or for multi-host deployments where OLAP data needs to live in shared object storage.

Feature flag

Enable the cloud-storage feature on the yoda dependency in your Cargo.toml:

toml
[dependencies]
yoda = { version = "1", features = ["cloud-storage"] }

Or, when building from a checkout of the workspace:

sh
cargo build --features cloud-storage

DataFusion only

Cloud storage modes are available exclusively with the DataFusion OLAP backend. DuckDB uses its own storage path and does not support the cloud-storage feature.

Storage modes

The DataFusion backend supports five StorageMode variants:

ModeTOML valueDescription
InMemory"memory" / "inmemory"Default. Data lost on shutdown.
ArrowIpc"arrow_ipc" / "ipc"Local .arrow files. Fast, no compression. Requires path.
Parquet"parquet"Local .parquet files. Compressed, predicate pushdown. Requires path.
S3Parquet"s3-parquet"Parquet on Amazon S3. Requires url. cloud-storage feature.
GcsParquet"gcs-parquet"Parquet on Google Cloud Storage. Requires url. cloud-storage feature.

S3 backend

URL format

text
s3://<bucket>/<prefix>

Example: s3://my-analytics-bucket/yoda-data

Per-table subdirectories are created beneath the prefix automatically.

Credentials

Credentials are resolved by the object_store crate using standard AWS environment variables:

sh
export AWS_ACCESS_KEY_ID="AKIA..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_REGION="us-east-1"
# Optional: for IAM role / instance profile, no static credentials needed

The full credential chain follows object_store conventions (env vars → instance metadata → ECS task role → IAM assumed role).

GCS backend

URL format

text
gs://<bucket>/<prefix>

Example: gs://my-gcs-bucket/yoda-data

Credentials

Credentials follow object_store defaults:

sh
# Service account key file
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

# Or use Application Default Credentials (gcloud auth application-default login)

Read and write characteristics

Both cloud modes share the same DataFusion ListingTable integration:

  • Reads: DataFusion's query engine reads Parquet objects natively from object storage, with column pruning and predicate pushdown applied where possible.
  • Bulk inserts (load_arrow / CDC sync): a new Parquet object is appended under the table prefix. Efficient for analytical workloads with infrequent bulk loads.
  • UPDATE / DELETE: requires a read-modify-write cycle — the full table is downloaded, mutated in-memory, and re-uploaded as a single consolidated object. Not suitable for high-frequency point updates.

Object-store latency

Network round-trips dominate query time for cloud modes. For single-host deployments with a small dataset, local Parquet or ArrowIpc storage will be significantly faster. Choose cloud storage when the dataset size exceeds local disk, or when the OLAP mirror must be shared across hosts.

TOML configuration

S3

toml
[engine]
oltp_path    = "app.db"
olap_backend = "datafusion"
sync_mode    = "destructive"

[engine.datafusion_storage]
mode = "s3-parquet"
url  = "s3://my-analytics-bucket/yoda-data"

GCS

toml
[engine]
oltp_path    = "app.db"
olap_backend = "datafusion"
sync_mode    = "temporal"

[engine.datafusion_storage]
mode = "gcs-parquet"
url  = "gs://my-gcs-bucket/yoda-data"

Full S3 analytics example

toml
[engine]
oltp_path        = "app.db"
olap_backend     = "datafusion"
sync_mode        = "temporal"
sync_interval_ms = 1000
sync_batch_size  = 2000
read_pool_size   = 4
log_format       = "json"

[engine.datafusion_storage]
mode = "s3-parquet"
url  = "s3://acme-analytics/prod/yoda"

[[tables]]
name = "orders"
ddl  = "CREATE TABLE IF NOT EXISTS orders (id INTEGER PRIMARY KEY, amount REAL, status TEXT, created_at INTEGER)"

  [[tables.columns]]
  name        = "id"
  type        = "int64"
  nullable    = false
  primary_key = true

  [[tables.columns]]
  name     = "amount"
  type     = "float64"
  nullable = true

  [[tables.columns]]
  name     = "status"
  type     = "utf8"
  nullable = true

  [[tables.columns]]
  name     = "created_at"
  type     = "int64"
  nullable = true

Run with appropriate AWS credentials in the environment:

sh
AWS_ACCESS_KEY_ID=... AWS_SECRET_ACCESS_KEY=... AWS_REGION=us-east-1 \
  yd serve --config htap.toml

Combining with temporal mode

Cloud storage works with both sync_mode = "destructive" and sync_mode = "temporal". Temporal mode on S3 is a natural fit for long-retention audit history: Parquet files accumulate append-only, and point-in-time queries work identically to local storage (see Temporal mode).

Source

crates/yoda-datafusion/src/storage.rsStorageMode enum definition with full doc-comments on read/write characteristics. crates/yoda-tui/src/config.rsDataFusionStorageToml and parse_storage_mode (TOML mode string parsing).

Released under the Apache-2.0 License.