Skip to content

OLAP Backends

Yoda supports two OLAP engines: Apache DataFusion (default) and DuckDB. Both implement the same OlapEngine trait, so switching backends requires only a config change.


Quick Comparison

FeatureDataFusionDuckDB
Feature flagdatafusion-backend (default)duckdb-backend
C++ dependencyNone — pure RustYes — bundled via duckdb-sys
Async modelNatively asyncspawn_blocking wrappers
TransactionsNo (no-ops)Yes — full ACID
Bulk-load pathArrow batch → load_arrowArrow Appender API (zero-copy)
Storage modesInMemory / ArrowIpc / Parquet / S3 / GCSInMemory or single .duckdb file
Streaming resultsNative (RecordBatchBoxStream)Collect-then-stream
Primary key enforcementNoYes (in destructive mode)
Compile timeFastSlower (C++ bundled build)

DataFusion

DataFusion is the default OLAP engine. It is a pure-Rust, natively async columnar query engine built on the Apache Arrow in-memory format.

Enable it (or keep the default):

toml
yoda = "1"
# or explicitly:
yoda = { version = "1", features = ["datafusion-backend"] }

Storage Modes

DataFusion's storage is configurable via HtapConfig::datafusion_storage (StorageMode enum):

rust
use yoda::{HtapConfig, StorageMode};

// In-memory (default) — no persistence
let config = HtapConfig {
    datafusion_storage: StorageMode::InMemory,
    ..HtapConfig::default()
};

// Arrow IPC files — fast durable writes
let config = HtapConfig {
    olap_in_memory: false,
    datafusion_storage: StorageMode::ArrowIpc {
        path: "/var/lib/myapp/olap-arrow".into(),
    },
    ..HtapConfig::default()
};

// Parquet — compressed, predicate pushdown
let config = HtapConfig {
    olap_in_memory: false,
    datafusion_storage: StorageMode::Parquet {
        path: "/var/lib/myapp/olap-parquet".into(),
    },
    ..HtapConfig::default()
};

Cloud backends require the cloud-storage feature:

rust
// S3 (requires cloud-storage feature + AWS_* env vars)
let config = HtapConfig {
    datafusion_storage: StorageMode::S3Parquet {
        url: "s3://my-bucket/analytics/".to_string(),
    },
    ..HtapConfig::default()
};

// GCS (requires cloud-storage feature + GOOGLE_* env vars)
let config = HtapConfig {
    datafusion_storage: StorageMode::GcsParquet {
        url: "gs://my-bucket/analytics/".to_string(),
    },
    ..HtapConfig::default()
};

Bulk Loading

CDC INSERT batches are loaded via OlapEngine::load_arrow(), which accepts an Arrow RecordBatch directly. DataFusion appends the batch without any SQL string construction. The fallback to SQL only occurs for column types not yet handled by the Arrow builder (Date, Timestamp, Decimal, List, Struct).

Streaming Queries

DataFusion implements OlapEngine::query_stream() natively via DataFrame::execute_stream(), returning a RecordBatchBoxStream that emits batches without buffering the entire result set. The Arrow Flight SQL server uses this for all data transfer.

No Transactions

DataFusion's transaction support is a no-op. In SyncMode::Temporal, the UPDATE (close previous version) + INSERT (new version) pair is not atomic — a crash between the two leaves an open-ended previous version and a missing new version until the engine resumes and reprocesses from the last committed sequence number.


DuckDB

DuckDB embeds the full DuckDB columnar engine via duckdb-sys (C++ bundled build). It provides excellent SQL compatibility, MVCC-based concurrent reads, and ACID transactions.

Enable DuckDB:

toml
yoda = { version = "1", features = ["duckdb-backend"] }
# or both backends:
yoda = { version = "1", features = ["full"] }

Configure the engine:

rust
use yoda::{HtapConfig, OlapBackendType};

// In-memory DuckDB
let config = HtapConfig {
    olap_backend: OlapBackendType::DuckDb,
    olap_in_memory: true,
    ..HtapConfig::default()
};

// Persistent DuckDB — single file
let config = HtapConfig {
    olap_backend: OlapBackendType::DuckDb,
    olap_in_memory: false,
    olap_path: Some("/var/lib/myapp/olap.duckdb".to_string()),
    ..HtapConfig::default()
};

Bulk Loading

DuckDB's load_arrow() uses the native Appender API (append_record_batch) for zero-copy Arrow ingestion. All Arrow types — including Date, Timestamp, Binary, and Decimal — are handled natively without SQL literal serialisation. This is the fastest bulk-load path across both backends.

Thread Safety

duckdb::Connection is !Send. Yoda wraps each connection in a Mutex with unsafe impl Send/Sync and dispatches all operations through spawn_blocking. The engine maintains a single write connection and a read pool (default: 4) using DuckDB's MVCC for concurrent read isolation.

Transactions

DuckDB supports full ACID transactions. In SyncMode::Temporal, the close + insert pair for each UPDATE or DELETE is wrapped in a single BEGIN … COMMIT, making it fully atomic. This is the key advantage of DuckDB over DataFusion for temporal workloads.

datafusion_storage is Ignored

When olap_backend = OlapBackendType::DuckDb, the datafusion_storage field has no effect. DuckDB uses either in-memory mode or a single .duckdb file, controlled by olap_in_memory and olap_path.


When to Pick Which

Choose DataFusion when:

  • You want zero C/C++ dependencies (CI, cross-compilation, WebAssembly targets).
  • You need cloud object storage (S3, GCS) natively.
  • Your workload is primarily append-only (INSERTs) — the Arrow batch path is equally fast.
  • Temporal mode atomicity is not critical (or you run DuckDB for temporal and DataFusion for destructive).

Choose DuckDB when:

  • You need SyncMode::Temporal with atomic UPDATE/DELETE transitions.
  • You need primary-key enforcement on the OLAP mirror.
  • You want a single durable file for the OLAP store with a familiar SQL dialect.
  • You need the DuckDB extension ecosystem (spatial, JSON, HTTPFS, etc.) via raw queries.

Next Steps

Released under the Apache-2.0 License.