u/botsunny

I'm relatively new to MLOps and I've been tasked with productionising feature engineering code (mostly written in SQL) into Lakeflow Spark Declarative Pipelines (SDP) on Databricks.

The current workflow is a bit tedious; DS decides the model is ready, hands me the feature logic (which are huge, complex SQL code with many joins and aggregations for every feature they've ever researched), and based on the features that model actually needs, I slim down the SQL code to only output those features. This is necessary as the project requires features to be served within 1 hour of raw data being ingested, and creating a "master" pipeline for all features that runs continuously to meet the time frame was extremely expensive.

As you can guess, with this workflow, when DS updates their model or adds a feature, I have to manually edit the pipeline code. Sometimes it's a lot of work even for one added feature as there may be a lot of intermediate operations and/or CTEs involved in its computation. I would trace back the original complex logic, which is a PITA.

I'm still new to this, so I would like to hear from this community any advice or solution you may have on approaching this problem, preferably one that integrates smoothly with Databricks.

ChatGPT talked about implementing a framework where DS adds feature metadata to a feature registry, each model gets a config file listing its features, and a parser reads it and auto-generates the pipeline by piecing the feature engineering operations together.

Sounds great, except I still can't seem to wrap my head around the idea of a parser that can reliably assemble the SQL code without including too many unneeded features (as features may be computed together), especially since the code I have is very complex and I still have to reduce joins and nesting in each file such that the pipeline materialized views can incrementally refresh.

How do I bring feature engineering pipelines to production?