r/databricks

Data Quality on Databricks design

Hey, i am deciding between DQX and Deequ for data quality on Databricks, or even to use them both, i think Deequ is amazing because of AnomalyCheck which let us compare batch to batch and make the data flow consistently over time which is very under appreciated, while DQX is amazing at the row level detection. How did u design your data quality on Databricks?

I was thinking using DQX for in-transit Data Quality checks for hard fails, while Deequ for AnomalyCheck for observartion/dashboards/notifications.

reddit.com
u/ptab0211 — 14 hours ago

What are you using for Spark agents with Databricks at scale?

Mid-sized org, around 300 people. Running multiple Databricks workspaces across AWS and Azure, hundreds of Spark jobs daily.

Debugging slow jobs, skew, small files, memory spills, shuffle issues takes too much manual time. Spark UI and Databricks monitoring cover the basics but at this scale someone always has to notice something is wrong before investigation starts. That lag is becoming a real problem.
Started looking into Spark agents to automate detection and surface bottlenecks without someone manually going through every stage. The volume justifies it but not sure what's production-ready vs still experimental.

What are people using for Spark agents at this scale? Does it hold up across multiple workspaces or does it fall apart once jobs get into the hundreds daily?

reddit.com
u/New-Reception46 — 22 hours ago

The real gap isn't connecting Claude to Databricks, it's the 3,000 tokens it costs every time you do

u/imsuryya — 2 days ago

Suggest must review or do databricks end-to-end solutions aiming for data Architect role

Hi all,

Can you suggest any GitHub projects or YouTube videos that cover end-end databricks projects which are complex and interesting to gain experience and exposure same as prod environment?

reddit.com
u/EatAndRun_Mommy — 1 day ago
▲ 13 r/databricks+1 crossposts

The Next Era of the Open Lakehouse: Apache Iceberg™ v3 in Public Preview on Databricks

u/Youssef_Mrini — 1 day ago

DABs direct deployment engine plan-file tool

Hello,

I originally built this for my team's local DevEx and CI/CD stacks, but figured others might get some use out of it too. With the direct deployment engine for Declarative Automation Bundles heading towards GA, i think it is a good time to share it.

dagshund, a visualizer for DAB plan files.

Zero runtime dependencies. Pipe in a plan.json like this:

databricks bundle plan -o json | uvx dagshund

and depending on what you want you may get:

  • a colored diff on stdout for quick sanity checks
  • a markdown summary you can drop into a PR comment
  • a self-contained interactive HTML report with a resource graph and per-job task DAGs (React Flow + ELK layout, no external assets, just open the file). Live examples here.

The CI piece came out of always feeling a bit blind when deploying large bundles without a terraform-style plan. It also exposes detailed exit codes so you can gate pipelines on drift or dangerous actions (resource replaces & destroys).

The HTML report is for anyone who'd rather see a graph than scroll through a wall of diff text, or who wants more detail on job DAGs than a terminal can fit. It also surfaces lateral dependencies and hierarchy relationships that the plan file mentions but state doesn't track directly, so it's generally just more detailed. We also use them as built artifacts in our PRs.

Docs and full feature list on GitLab, or on the GitHub mirror.

Fair disclaimer: since the direct deployment engine isn't GA yet, expect dagshund to follow along if the plan format shifts before release, sometimes this may take a day or two.

Cheers!

u/Lost-Relative-1631 — 1 day ago

Delta Lake Secrets: What Happens After You Run Write, Update or Merge

I wrote a practical deep dive on Delta Lake that explains what actually happens behind the scenes—not just the basic theory.

Most tutorials stop at “Delta supports ACID and Time Travel,” but I wanted to understand how it really works.

In this blog, I covered:

• _delta_log and transaction logs
• Why Delta never deletes old files immediately
• Checkpoints and snapshot mechanism
• Data skipping and how Z-Ordering improves performance
• History, Restore, and Time Travel
• Merge, Update, Delete operations
• Convert Parquet to Delta
• Optimize and the small file problem
• Real PySpark examples for every concept

I tried to explain everything in a simple, practical way with real examples instead of documentation-style theory.

https://medium.com/@wnccpdfvz/why-delta-lake-is-faster-than-traditional-data-lakes-5c865f67b66b

reddit.com
u/Sea_Driver_924 — 16 hours ago

Ingest codebeamer data in databricks

Hi Everyone, Codebeamer is an Application Life Cycle Management tool . I have been assigned to get all the data out of this application to Databricks i.e ingesting all data in delta tables creating data model fit for analytics purpose. I have given only codeBeamer APIs to work with. Anyone has worked on similar projects, let me know your insights and suggestions and best practices. Thanks in advance.

reddit.com
u/razzritu4 — 23 hours ago

How to query batch job runs + number of rows inserted to bronze (+ updated, deleted for silver)?

u/Sea_Basil_6501 — 1 day ago