u/mmscoin

New open-source repo for standardization of messy data into geoparquet
▲ 16 r/remotesensing+2 crossposts

New open-source repo for standardization of messy data into geoparquet

Been working on an open-source geospatial ETL prototype called Dymium focused on standardizing fragmented geological datasets into ML-ready GeoParquet outputs.

Current pipeline handles:

  • MRDS ingestion and normalization
  • geological PDF extraction
  • cross-source dataset fusion
  • spatial geology enrichment
  • GeoParquet export
  • lightweight Streamlit visualization

My main motivation was seeing how much mineral/geological data is still trapped across inconsistent schemas, PDFs, shapefiles, and legacy formats.

Still very early-stage and intentionally scoped around the data-standardization layer rather than full modeling. README includes current limitations, uncertainty handling examples, and demo outputs.

I need feedback from GIS/geospatial/data engineering people — especially around:

  • schema normalization approaches
  • GeoParquet workflows
  • geology layer enrichment
  • ingestion validation
  • interoperability issues across jurisdictions

Repo:
https://github.com/Nebula-Dust/Dymium

u/mmscoin — 3 days ago

I'm a ML Ops engineer looking to train models on Processing/Recycling/data integration for Neodymium & Praseodymium to improve things like the classification of NdPr-bearing waste streams or prediction of recovery difficulty. DM me if you're interested!

reddit.com
u/mmscoin — 12 days ago