u/goerch — reddlx

While building a local data warehouse on GitHub Archive data, we ran into a slightly frustrating DuckDB behavior: columns that happen to contain only NULL values in the first file read get inferred as the generic JSON type. When real values show up in later files, staging breaks.

Our fix: a synthetically generated canonical sample — a single JSON file with at least one non-null value for every column. It gets passed to read_json_auto alongside every real archive file, ensuring stable type inference from the start. A WHERE clause filters it out of the actual results.

The project is a fully local data warehouse on GitHub Archive — ~700-attribute variable JSON events, schema discovery in Python that auto-generates dbt models, and a star schema with mart tables that Rill can explore interactively at sub-second speed across 25M+ events.

The canonical sample approach feels like something others must have hit too — curious how you've handled this, or whether there's a cleaner DuckDB-native solution we missed.

Repo: idesis-gmbh/githubexperiments