
Realtime CDC streams for DuckLake as a DuckDB extension
Built a DuckDB community extension for realtime CDC streams from DuckLake.
Repo:
https://github.com/ekkuleivonen/ducklake-cdc-extension
It turns DuckLake snapshot changes into durable row-level change streams that can be consumed from SQL, or Python.
Current features:
- row-level DML CDC
- DDL/schema change subscriptions
- durable consumers with checkpointing
- per-consumer filtering
- replay from snapshots
Example:
SELECT *
FROM cdc_dml_changes_read('lake', 'orders_sink');
Use cases I built this for:
- realtime sinks
- cache invalidation
- event-driven lakehouse automation
- search indexing
- lightweight streaming pipelines without external CDC infra
One design goal was to avoid requiring a separate Python daemon or Kafka setup for simple CDC workflows around DuckLake.
Still early and evolving, but would love feedback from people building on DuckDB/DuckLake.