u/Away-Excitement-5997

▲ 1 r/learndatascience+1 crossposts

When companies store massive amounts of data they often use something called a data lake which is basically dumping files like Parquet or CSV into cheap cloud storage. Sounds great in theory but in practice it turns into a swamp pretty fast.

Things like updating a single row can take 47 minutes because the system has to rewrite entire files. There are no real transactions so readers can see half-finished writes. There is no audit trail and no way to roll back if something breaks.

This explaining these 5 problems and how a tool called Apache Hudi fixes them by adding a smart layer on top of your lake. The goal is to help you understand the real problems that come up when working with data at scale and how engineers solve them

u/Away-Excitement-5997 — 6 days ago
▲ 1 r/learndatascience+1 crossposts

A short explainer breaking down the two storage types in Apache Hudi and when to pick each one.

CoW rewrites the entire base file on every upsert which makes reads fast but writes expensive. MoR appends delta logs and merges at query time so writes are cheap but reads pay the cost later. Compaction is what brings MoR back in line by merging those deltas into a fresh base file.

The also covers how the Hudi timeline works and why it matters for time travel and versioning

u/Away-Excitement-5997 — 8 days ago