u/Certain_Leader9946

I maintain Apache Spark Connect for Golang so I added streaming and built a Data Lake ORM
▲ 14 r/apachespark+1 crossposts

I maintain Apache Spark Connect for Golang so I added streaming and built a Data Lake ORM

Quick context for people who haven't touched it: Apache Spark Connect is the gRPC surface Spark exposes so you can run Spark SQL against a cluster without bundling a JVM in your app. The official Go client is apache/spark-connect-go. I've been contributing upstream for a while. I shipped SPARK-52780, which is streaming reads, so you can pull large result sets without OOMing to your application code, and implement Go streaming systems on the back of it.

I've built successful products using my own fork of spark-connect-go against Databricks and I thought it would be worth sharing the fruit of my labour.

I also think the mindset catching on more is people think of using Spark for 'data contracts'. This works now because Spark Connect is push-based. Commit semantics got better, so the technical reasons for bronze existing are less justifiable, but we're still writing bronze layers because the pattern calcified.

That is, the "dirty data" and "broken pipeline" work I get paid to fix as a contractor is janitorial cleanup of a landing zone that didn't need to exist in the first place if you validate at the application boundary using the type system, you write straight to silver, and the whole bronze tier becomes dead weight.

So one of the spin-out projects that came from this is lake-orm. The vision for this is to stop losing sleep over the same bug. A number of systems I've worked with that touches a data lake writes the same struct-to-Parquet plumbing, the same ingestion code, validation glue, "oops I realised my metadata is dirty and we need to clean it.". Append, merge by key, find out someone wrote something bad to 'bronze' that fails a data-quality check that wasn't really thought about.

In my mind the ORM wants to provide a batteries included approach, which to me just means 'stop re-writing the same code for data pipelines and just declare the models you care about'. For most situations I've seen having worked in the wild of mega-orgs this pattern works. They really just want to be able to quickly define almost document like object storage quickly, and the data quality is more important than the semantics, and where the semantics count what matters more is that it's clear, and that the data is partitioned in a super reasonable way.

Sometimes I've been working in 500-100B row systems and the big blocker to hitting the ground running on day 1 is just grocking through the twenty table join to get a particular concept nobody documented. So I want to shift future clients towards a contract driven approach which aligns better with the other half of my career building lean typed data platforms (often in SQLC). I have fairly strong opinions as an engineer about this and I am happy to answer any questions with my general thought process here.

So anyway for your base simple case, you provide Go structs tagged with spark:"..."  and validators, and they become Iceberg or Delta tables on object storage. Writes go direct to silver via an object-storage fast-path. Reads stream back with constant memory. Joins and aggregates use a CQRS-shaped output struct. The whole thing works with Databricks, and its a not for profit passion project I thought was worth shouting about. I'm not asking you to use the ORM, or to like it, but I am really passionate about the job I do and I wanted to let you guys know it exists now. Contributions are super welcome.

Both repos:

Let me know your thoughts on either project. Happy Coding!

EDIT: PS for the PySpark haters out there, I made DataFrames typed in a way analogous to Dataset.

u/Certain_Leader9946 — 2 days ago