u/MephySix — reddlx

I can't find an answer online anywhere, so let me try here if anyone had a similar experience or have a workaround.

The source database managers imposed the ingestion process can only hit the database once per day, not once per environment per day. So basically I have to ingest into bronze once in production, and then copy that data into UAT and dev environments within Databricks.

My solution was to have a job that daily runs create or replace table uat.schema.table shallow clone prod.schema.table for every table, once per day, after the ingestion into prod is finished. This way I can have the data available everywhere, with proper isolation between the environments (outside of this job), without the need to physically copy the data.

The issue is this data is used by Spark declarative pipelines. Although the ingestion is append-only, sometimes 'vacuum' and 'optimize' operations will delete physical files. When the cloning happens, the lower environment table will have a single record in its history, for the clone operation, that "merges" all operations that happened during the day - including write, vacuum and optimizing. Now when the UAT/dev pipelines try to run, they will fail with DELTA_SOURCE_TABLE_IGNORE_CHANGES "there have been non-append changes to the data".

It's easily reproducible: have a table, shallow clone it, run a streaming pipeline on it (in my case it uses the Auto CDC function). Now write to the original append-only, then call optimize on it (in a way that deletes at least 1 file). Now "clone or replace" again. Try to run a pipeline again - get an error. Note the pipeline defined against the base table runs fine, just the clone fails, though it's the same data (a literal clone).

It seems the issue comes from the clone operation not holding history metadata about the multiple operations it merges into the clone operation in the cloned table's history. So I can understand where the error comes from - the pipeline has to way to prove the clone has append-only history. But this sounds kinda... wrong? It should be able to run.

Is this the intended way for clone to work? It just doesn't work with streaming/declarative pipelines/Auto CDC? Or am I doing something stupid or an anti-pattern?