Skip to content

Change Data Capture (CDC)

Change Data Capture (CDC) replicates individual inserts, updates, and deletes from a source system into your warehouse in real time -- instead of re-extracting entire tables on every run.

Skippr reads native change logs (PostgreSQL WAL, MySQL binlog, MongoDB change streams, DynamoDB Streams, Kafka Debezium envelopes) and applies each mutation to the destination with exactly-once semantics.

How it works

Source Database

  │  log reader ── reads native change log (WAL / binlog / stream)

Skippr WAL (local)

  │  segment commit ── Arrow IPC + CDC metadata (mutation kind, order token)

Committed Segment

  │  idempotent apply ── MERGE with order-token guard + tombstone check

Destination Warehouse
  ├── _skippr_order_token column (stale-write rejection)
  └── _skippr_tombstones_{table} (anti-resurrect protection)

Each change carries a mutation kind (insert, update, or delete) and a lexicographically sortable order token derived from the source's native log position (e.g. PostgreSQL LSN, MySQL binlog file + position, MongoDB resume token).

At the destination, Skippr uses these to guarantee correctness:

  • Upsert-if-newer -- a row is only written if its order token is greater than the existing token. Stale or replayed writes are silently discarded.
  • Tombstone anti-resurrect -- deletes are recorded in a per-table tombstone table. A later insert for a deleted key is blocked unless its order token proves it occurred after the delete.

Enabling CDC

CDC requires two pieces in your skippr.yaml:

  1. Set cdc_enabled: true on the source
  2. Add a cdc: pipeline block with your business key columns

Skippr automatically infers and enforces the strongest CDC guarantee your source/sink pair supports -- you never need to specify a guarantee level.

yaml
project: my_cdc_pipeline

source:
  kind: postgres
  host: localhost
  port: 5432
  user: replicator
  password: ${POSTGRES_PASSWORD}
  database: mydb
  cdc_enabled: true

warehouse:
  kind: snowflake
  database: ANALYTICS
  schema: RAW
  warehouse: COMPUTE_WH

cdc:
  business_key_columns:
    - id

See CDC Configuration for the full reference.

Supported sources

SourceCDC MechanismDetails
PostgreSQLWAL logical replication (pgoutput)CDC Sources -- PostgreSQL
MySQLBinlog replicationCDC Sources -- MySQL
MongoDBChange streamsCDC Sources -- MongoDB
DynamoDBDynamoDB StreamsCDC Sources -- DynamoDB
KafkaDebezium envelope parsingCDC Sources -- Kafka

Supported destinations

All warehouse destinations support CDC with exactly-once MERGE semantics:

DestinationMERGE StrategyDetails
SnowflakeMERGE DMLCDC Destinations -- Snowflake
BigQueryMERGE DMLCDC Destinations -- BigQuery
PostgreSQLStaging table + INSERT ... ON CONFLICTCDC Destinations -- PostgreSQL
RedshiftStaging table + MERGECDC Destinations -- Redshift
ClickHouseReplacingMergeTreeCDC Destinations -- ClickHouse
DatabricksUnity Catalog MERGECDC Destinations -- Databricks
SynapseMERGE via TiberiusCDC Destinations -- Synapse
MotherDuckDuckDB MERGECDC Destinations -- MotherDuck

Further reading