u/Affectionate-Fun-339

When building scraping pipelines, I often have to do multiple processing steps and store the output of each step.

Check out my tool on GitHub: Github.com/mcklmo/medallion

I have not used Databricks or PyDpark personally but as far I understand it solves essentially the same problem.

I created this library for my own use and re-use it across personal projects. The name medallion is inspired by the Medallion Data Architecture, which traditionally involves Bronze, Silver, and Gold data layers. However, my tool does not enforce any amount of computational steps - you can add as many as you like.

It saves me lot of time and writing boilerplate. I only have to write the actual scraping code for each step in the pipeline (obtaining source data and
transforming input to structured output) and wire it up with the library. The tool orchestrates the steps in a chain, asserts that input types of each step match the previous step's output, stores intermediate results for each step, and uses a cache to save computational costs (particularly useful for expensive transformations, such as those involving LLMs).
The tool supports entrypoints from Python code with the 'PipeLine' type, or from CLI with the 'medallion' command.

Check it out if you're curious! I'd love to hear your feedback or reflections.

Showcasing Medallion - a data pipeline including caching and storage