u/FiftyShadesOfBlack

I've been brought on as a data engineering consultant for a small to mid-sized company who has a poorly built architecture in Databricks. There's currently no documentation or clear architecture, so I've been spending weeks trying to untangle everything.

They now want me to start implementing data quality checks because as of now there's no testing within the process at all and they're unsure if their outputs are even correct. Currently the data they want me to test are just raw files uploaded into Databricks tables on an irregular schedule, all with different granularity and logic that will require more complex checks than just null checks and unique primary keys. What is the best starting point for this? They have jobs and jobs that run jobs but no pipelines established, and I don't think I have the power to change that yet, so I think that takes DLT off the table unless I can prove it's worth the refactor.

My first thought was integrating pyspark testing scripts to run within the jobs, but there has to be a more sophisticated way to do this?

reddit.com
u/FiftyShadesOfBlack — 9 days ago

I've been brought on as a data engineering consultant for a small to mid-sized company who has a poorly built architecture in Databricks. There's currently no documentation or clear architecture, so I've been spending weeks trying to untangle everything.

They now want me to start implementing data quality checks because as of now there's no testing within the process at all and they're unsure if their outputs are even correct. Currently the data they want me to test are just raw files uploaded into Databricks tables on an irregular schedule, all with different granularity and logic that will require more complex checks than just null checks and unique primary keys. What is the best starting point for this? They have jobs and jobs that run jobs but no pipelines established, and I don't think I have the power to change that yet, so I think that takes DLT off the table unless I can prove it's worth the refactor.

My first thought was integrating pyspark testing scripts to run within the jobs, but there has to be a more sophisticated way to do this?

reddit.com
u/FiftyShadesOfBlack — 9 days ago