I Built a small library for DataFrame schema enforcement - dfguard. Would love to hear your thoughts
For any data engineer/swe who works a lot with dataframes - data schema checks are so boring but often necessary. I was looking at pandera for a small project but got annoyed that it has its own type system. If I'm writing PySpark, I already know pyspark.sql.types. Why should I learn pandera's equivalent (A few libs follow this approach). And libs like great_expectattions felt like overkill.
I wanted something light that enforces schema checks at function call time using the types I already use. And I DID NOT want to explicitly call some schema validation functions repeatedly - the project will end up being peppered with them everywhere. A project level setting should enable schema checks everywhere where the appropriate type-annotation is present.
So I built dfguard (PyPI: https://pypi.org/project/dfguard/). It checks that a DataFrame passed to a function matches the expected schema, using whatever types your library already uses.
PySpark, pandas, Polars are supported. It looks at dataframe schema metadata only (not data) and validates it when a function is called based on type annotations.
Some things I enjoyed while building or learnt:
- If you have a packaged data pipeline, dfg.arm() in your package __init__.py covers every dfguard schema-annotated DataFrame argument. No decorator on each function.
- pandas was annoying - dtype is 'object' for strings, lists, dicts, everything. Ended up recommending `pd.ArrowDtype` for users who needs precise nested types in pandas.
- Docs have examples for Airflow and Kedro if you're using those.
pip install 'dfguard[pandas]' pyarrow
pip install 'dfguard[polars]'
pip install 'dfguard[pyspark]'
This quickstart should cover everything for anyone who's interested in trying it out.
Curious to hear any thoughts or if you'd like to see some new feature added. If you try it out, I'm ecstatic.
Edit --
For any curious users about how easy it is (the quickstart page has minimal examples for most things) -
Only 3 things to do -
Import lib
declare the schema of dataframes with the types that are compatible with your df library (2 ways to do it depending on circumstance). There is a function to assign an existing df schema to a dfguard object so that you can use it directly - thought I wouldn't recommend it for data pipelines.
decorating the function if you're using notebooks/scripts, or if you have a packaged data pipeline - include a line in you init.py. That's it!
Good practices - include schema.py files in your packages and import all the schemas for data frames you want schemas enforced.
By default - extra columns are allowed. subset=False in "enforce" functions make it strict.
Shameless plug: if you like the repo - consider starring the repo.