OLAP Backends

Yoda supports two OLAP engines: Apache DataFusion (default) and DuckDB. Both implement the same OlapEngine trait, so switching backends requires only a config change.

Quick Comparison

Feature	DataFusion	DuckDB
Feature flag	`datafusion-backend` (default)	`duckdb-backend`
C++ dependency	None — pure Rust	Yes — bundled via `duckdb-sys`
Async model	Natively async	`spawn_blocking` wrappers
Transactions	No (no-ops)	Yes — full ACID
Bulk-load path	Arrow batch → `load_arrow`	Arrow `Appender` API (zero-copy)
Storage modes	InMemory / ArrowIpc / Parquet / S3 / GCS	InMemory or single `.duckdb` file
Streaming results	Native (`RecordBatchBoxStream`)	Collect-then-stream
Primary key enforcement	No	Yes (in destructive mode)
Compile time	Fast	Slower (C++ bundled build)

DataFusion

DataFusion is the default OLAP engine. It is a pure-Rust, natively async columnar query engine built on the Apache Arrow in-memory format.

Enable it (or keep the default):

toml

yoda = "1"
# or explicitly:
yoda = { version = "1", features = ["datafusion-backend"] }

Storage Modes

DataFusion's storage is configurable via HtapConfig::datafusion_storage (StorageMode enum):

rust

use yoda::{HtapConfig, StorageMode};

// In-memory (default) — no persistence
let config = HtapConfig {
    datafusion_storage: StorageMode::InMemory,
    ..HtapConfig::default()
};

// Arrow IPC files — fast durable writes
let config = HtapConfig {
    olap_in_memory: false,
    datafusion_storage: StorageMode::ArrowIpc {
        path: "/var/lib/myapp/olap-arrow".into(),
    },
    ..HtapConfig::default()
};

// Parquet — compressed, predicate pushdown
let config = HtapConfig {
    olap_in_memory: false,
    datafusion_storage: StorageMode::Parquet {
        path: "/var/lib/myapp/olap-parquet".into(),
    },
    ..HtapConfig::default()
};

Cloud backends require the cloud-storage feature:

rust

// S3 (requires cloud-storage feature + AWS_* env vars)
let config = HtapConfig {
    datafusion_storage: StorageMode::S3Parquet {
        url: "s3://my-bucket/analytics/".to_string(),
    },
    ..HtapConfig::default()
};

// GCS (requires cloud-storage feature + GOOGLE_* env vars)
let config = HtapConfig {
    datafusion_storage: StorageMode::GcsParquet {
        url: "gs://my-bucket/analytics/".to_string(),
    },
    ..HtapConfig::default()
};

Bulk Loading

CDC INSERT batches are loaded via OlapEngine::load_arrow(), which accepts an Arrow RecordBatch directly. DataFusion appends the batch without any SQL string construction. The fallback to SQL only occurs for column types not yet handled by the Arrow builder (Date, Timestamp, Decimal, List, Struct).

Streaming Queries

DataFusion implements OlapEngine::query_stream() natively via DataFrame::execute_stream(), returning a RecordBatchBoxStream that emits batches without buffering the entire result set. The Arrow Flight SQL server uses this for all data transfer.

No Transactions

DataFusion's transaction support is a no-op. In SyncMode::Temporal, the UPDATE (close previous version) + INSERT (new version) pair is not atomic — a crash between the two leaves an open-ended previous version and a missing new version until the engine resumes and reprocesses from the last committed sequence number.

DuckDB

DuckDB embeds the full DuckDB columnar engine via duckdb-sys (C++ bundled build). It provides excellent SQL compatibility, MVCC-based concurrent reads, and ACID transactions.

Enable DuckDB:

toml

yoda = { version = "1", features = ["duckdb-backend"] }
# or both backends:
yoda = { version = "1", features = ["full"] }

Configure the engine:

rust

use yoda::{HtapConfig, OlapBackendType};

// In-memory DuckDB
let config = HtapConfig {
    olap_backend: OlapBackendType::DuckDb,
    olap_in_memory: true,
    ..HtapConfig::default()
};

// Persistent DuckDB — single file
let config = HtapConfig {
    olap_backend: OlapBackendType::DuckDb,
    olap_in_memory: false,
    olap_path: Some("/var/lib/myapp/olap.duckdb".to_string()),
    ..HtapConfig::default()
};

Bulk Loading

DuckDB's load_arrow() uses the native Appender API (append_record_batch) for zero-copy Arrow ingestion. All Arrow types — including Date, Timestamp, Binary, and Decimal — are handled natively without SQL literal serialisation. This is the fastest bulk-load path across both backends.

Thread Safety

duckdb::Connection is !Send. Yoda wraps each connection in a Mutex with unsafe impl Send/Sync and dispatches all operations through spawn_blocking. The engine maintains a single write connection and a read pool (default: 4) using DuckDB's MVCC for concurrent read isolation.

Transactions

DuckDB supports full ACID transactions. In SyncMode::Temporal, the close + insert pair for each UPDATE or DELETE is wrapped in a single BEGIN … COMMIT, making it fully atomic. This is the key advantage of DuckDB over DataFusion for temporal workloads.

`datafusion_storage` is Ignored

When olap_backend = OlapBackendType::DuckDb, the datafusion_storage field has no effect. DuckDB uses either in-memory mode or a single .duckdb file, controlled by olap_in_memory and olap_path.

When to Pick Which

Choose DataFusion when:

You want zero C/C++ dependencies (CI, cross-compilation, WebAssembly targets).
You need cloud object storage (S3, GCS) natively.
Your workload is primarily append-only (INSERTs) — the Arrow batch path is equally fast.
Temporal mode atomicity is not critical (or you run DuckDB for temporal and DataFusion for destructive).

Choose DuckDB when:

You need SyncMode::Temporal with atomic UPDATE/DELETE transitions.
You need primary-key enforcement on the OLAP mirror.
You want a single durable file for the OLAP store with a familiar SQL dialect.
You need the DuckDB extension ecosystem (spatial, JSON, HTTPFS, etc.) via raw queries.

Next Steps

Sync Modes — DuckDB atomicity matters most for temporal mode
Configuration Reference — StorageMode, OlapBackendType, and persistence settings
Arrow Flight SQL — streaming OLAP queries over gRPC

OLAP Backends ​

Quick Comparison ​

DataFusion ​

Storage Modes ​

Bulk Loading ​

Streaming Queries ​

No Transactions ​

DuckDB ​

Bulk Loading ​

Thread Safety ​

Transactions ​

datafusion_storage is Ignored ​

When to Pick Which ​

Next Steps ​

OLAP Backends

Quick Comparison

DataFusion

Storage Modes

Bulk Loading

Streaming Queries

No Transactions

DuckDB

Bulk Loading

Thread Safety

Transactions

`datafusion_storage` is Ignored

When to Pick Which

Next Steps