u/ArrowBacon

SCD2 overkill?

I'm currently designing a medallion implementation. We've settled on a pattern where bronze is raw data, silver is source aligned but cleansed (eg standard data types, schema drift logic etc), and gold has two parts: 1) enterprise data model (merging sources) and 2) star schemas for reporting, based on the EDM.

I am then looking at history requirements and think we may need SCD2 implemented at silver (for source aligned history, and warehouse backup), at the EDM (for enterprise wide history) and in the star schema (analytical history).

This feels slightly like overkill but I can't see a way to reduce effort without losing the ability to recreate all the layers. Any advice please?

reddit.com
u/ArrowBacon — 1 day ago

Hi, I'm part of a team currently designing a greenfield Fabric instance at my organisation. We definitely want to pursue a metadata driven ELT framework, at least for the ingest>Bronze>Silver parts of the medallion.

My initial thought was this should be done using a Fabric SQL DB with tables to help orchestrate the jobs, record outputs etc. A colleague has suggested using YAML or JSON files instead. My current feeling is the file-based approach has value in keeping the definitions "together" rather than spread across multiple tables, has good version control and could be promoted easily between environments. But it also lacks the validation that can easily be enforced in a relational database.

Does anyone have practical experience and a view on this? Bronze/Silver will both be Lakehouses.

reddit.com
u/ArrowBacon — 14 days ago