Skip to content

Architecture

Yoda is a fully in-process HTAP (Hybrid Transactional/Analytical Processing) engine that embeds a low-latency SQLite write path alongside a high-throughput analytical engine (DuckDB or Apache DataFusion), bridged by trigger-based Change Data Capture so that analytical queries always see a near-real-time view of your OLTP data — without a separate server, ETL job, or network hop.

Standard HTAP Mode

text
Client


HtapEngine (facade — crates/yoda/)

  ├─ SqlParserRouter ──────────────────► AST-based routing (sqlparser-rs)
  │                                       ├── writes / DDL / simple SELECT → OLTP
  │                                       └── aggregates / JOIN / CTE / window → OLAP

  ├─ RusqliteEngine (OLTP)
  │     ├── write connection  (WAL mode, dedicated OS thread)
  │     ├── read pool         (round-robin, 4 connections by default)
  │     └── CDC connection    ──► _yoda_cdc_log (trigger-populated)

  ├─ OlapBackend (enum — crates/yoda-olap/)
  │     ├── DuckDbEngine      (feature: duckdb-backend)
  │     └── DataFusionEngine  (feature: datafusion-backend, default)

  └─ CdcSyncEngine (crates/yoda-sync/)
        ├── polls _yoda_cdc_log (seq watermark)
        ├── SyncMode::Destructive — mirror (UPDATE/DELETE in-place)
        ├── SyncMode::Temporal   — SCD Type 2 (append-only history)
        ├── bulk INSERT via Arrow batch → OlapEngine::load_arrow()
        └── background loop with CancellationToken shutdown

Sidecar Mode

In sidecar mode the local OLTP write path is optional. Instead, a TimestampCdcConsumer polls an external database (SQLite or PostgreSQL) for rows whose updated_at timestamp exceeds the last recorded watermark:

text
External DB (SQLite / PostgreSQL)
  │  SELECT … WHERE updated_at > watermark ORDER BY updated_at, pk

TimestampCdcConsumer<S: SourceConnector>  (crates/yoda-sidecar/)
  │  emits CdcEvent stream (Insert / Update / Delete)
  │  watermark persisted in RocksDB (optional)

CdcSyncEngine  (temporal or destructive)
  │  same DML pipeline as standard mode

OlapBackend (DuckDB / DataFusion)

No schema changes are required on the source database — only that each tracked table has an updated_at column (and optionally a deleted_at column for soft-delete detection).

Why Two Engines?

SQLite (via Rusqlite) excels at low-latency point writes: WAL mode gives sub-millisecond single-row commits, and the embedded nature means zero network overhead. DuckDB and DataFusion excel at vectorised scan-heavy queries over large tables: column-oriented storage, predicate pushdown, and SIMD execution make aggregates orders of magnitude faster than row-store databases.

Yoda lets each engine do what it does best. The SqlParserRouter makes the routing invisible to the caller — one query() call, no manual dispatch.

Why SQLite for OLTP?

  • Serverless: no separate process, no port, no credentials.
  • WAL mode: concurrent reads never block writes.
  • Trigger-based CDC: SQLite's AFTER INSERT/UPDATE/DELETE triggers write compact JSON arrays to _yoda_cdc_log with ~10 % less overhead than json_object() equivalents.
  • forbid(unsafe_code): the async wrapper (yoda-tokio-rusqlite) uses dedicated OS threads + crossbeam channels to make rusqlite::Connection (which is !Send) safely usable from async code with zero unsafe blocks.

Why DuckDB or DataFusion for OLAP?

DuckDB brings a battle-tested columnar SQL engine with ACID transaction support, native Arrow Appender bulk-loading (zero-copy), and MVCC-based concurrent reads. All operations run on blocking threads via spawn_blocking.

DataFusion is a pure-Rust natively async engine with pluggable storage backends (InMemory, Arrow IPC, Parquet, S3, GCS). It integrates naturally with Tokio and streams results without buffering the entire result set (RecordBatchBoxStream). It is the default because it has zero C/C++ dependencies.

See OLAP Backends for a detailed comparison and guidance on which to pick.

How CDC Works

  1. When register_table is called, Yoda installs three SQLite triggers on the target table — AFTER INSERT, AFTER UPDATE, and AFTER DELETE.
  2. Each trigger appends one row to _yoda_cdc_log: a monotonically increasing sequence number, a Unix timestamp, the operation code (I/U/D), the table name, and a JSON array snapshot of the row data.
  3. CdcSyncEngine polls _yoda_cdc_log using a stored watermark (last_synced_seq). On each cycle it fetches up to sync_batch_size events, converts them to OLAP DML, and advances the watermark.
  4. Consecutive INSERTs to the same table are batched into a single Arrow RecordBatch and loaded via OlapEngine::load_arrow() — an Arrow-native path that avoids SQL string construction entirely (5–7× faster than individual INSERT statements for bulk workloads).
  5. Processed events are pruned from _yoda_cdc_log after each successful cycle (prune_after_sync = true by default) to keep the log table small.

For crash-durable CDC event buffering, set rocksdb_cdc_path — SQLite triggers still fire into _yoda_cdc_log, but a bridge drains them atomically into RocksDB before the sync engine reads them. See RocksDB CDC for details.

Key Crates

CrateRole
yodaHtapEngine facade, HtapConfig, integration tests
yoda-coreShared traits and types (CdcEvent, SyncMode, QueryTarget, …)
yoda-tokio-rusqliteAsync SQLite wrapper (dedicated thread per connection)
yoda-oltp-rusqliteRusqliteEngine + CDC trigger setup
yoda-syncCdcSyncEngine, SqlParserRouter, CDC-to-DML converter
yoda-datafusionDataFusion OLAP engine with pluggable StorageMode
yoda-duckdbDuckDB OLAP engine with Arrow Appender bulk-load
yoda-sidecarTimestampCdcConsumer, watermark store, SourceConnector trait
yoda-flightArrow Flight SQL gRPC server (flight-sql feature)
yoda-tuiyd CLI + TUI dashboard

Released under the Apache-2.0 License.