u/YeeduPlatform — reddlx

▲ 6 r/Yeedu+1 crossposts

There’s a clear shift happening in the data world: newer engines like DuckDB, Polars, and DataFusion are deliberately avoiding the JVM. This isn’t just an implementation choice—it reflects a deeper change in how performance is defined. Systems like Apache Spark were built around scaling out across clusters. But modern engines are optimizing for something else: how efficiently a single core can execute work.

That shift leads directly to hardware-aware execution. Native engines (C++/Rust) can fully exploit SIMD, operate on columnar memory, and avoid garbage collection entirely. The JVM, while great for general-purpose systems, introduces overhead exactly where analytical workloads are sensitive—GC pauses, object-heavy memory layouts, and limited vectorization. The result is a hard ceiling on per-core performance that scaling alone can’t fix.

What’s emerging isn’t “Spark vs new engines,” but a separation of roles:

Spark for orchestration, distribution, and fault tolerance
Native engines (like Velox or Apache Arrow) for execution

The interesting question now isn’t whether the JVM is good or bad—it’s whether execution efficiency is becoming more important than abstraction in modern data systems.

reddit.com

u/YeeduPlatform — 8 days ago

▲ 3 r/Yeedu

If you’ve been looking up how to reduce Spark costs, why Databricks bills spike, or how to scale data pipelines without overspending, you’ve probably hit this pattern: everything works, jobs are stable, data is growing… and yet your cloud bill grows faster than your platform. So the real question is: are you actually scaling workloads, or just paying more to run the same patterns at larger volume?

Does more data automatically trigger bigger clusters instead of smarter execution?
Are your jobs finishing slower than they should for the hardware you’re paying for?
Is “keeping clusters ready” quietly turning into always-on cost leakage?
Does running pipelines more frequently feel like a budget trade-off?
Are you choosing platforms based on convenience instead of cost-to-performance fit?

At some point, optimization tweaks stop working because the issue isn’t configuration — it’s how your Spark workloads are structured and executed. Until that shifts, cost growth isn’t accidental… it’s inevitable.

reddit.com

u/YeeduPlatform — 14 days ago

▲ 7 r/Yeedu+1 crossposts

Most of us run into (yet another) Spark job that looked perfectly fine but kept dragging in production. Spark UI shows a few tasks running forever while everything else finished quickly. Turned out to be data skew, again.

While digging into it, I ended up writing down what I wish I had checked earlier before running the job. Mostly around how data skew actually shows up during execution, why it’s so easy to miss in dev, and what signals to look for before things go sideways.

Lol. I wrote a blog about it and published in medium as well.

reddit.com

u/YeeduPlatform — 16 days ago

▲ 1 r/Yeedu

The DBU rate looks fine on paper until autoscaling kicks in. Databricks' Intelligent Workload Management scales aggressively and you have no real control over it. Add the idle warm pool baked into every DBU (you're paying for readiness even between jobs) and the mandatory Premium tier just to access Serverless — and what looked like ~$1,700/month for a 50 jobs/day workload became $5,000+ in practice. No Spot Instance access either, so that 60–90% EC2 discount AWS offers? Gone.

We ended up benchmarking Yeedu Warm Start as an alternative. Flat license, jobs run in your own VPC, and their Turbo Engine cuts job duration ~5x so you're burning far less compute per run. At low job volumes Databricks still wins on paper, but the crossover hits around 270–300 jobs/day — which for most active data teams is 6–12 months of normal growth away. Sharing in case others are trying to make sense of a surprise bill.

reddit.com

u/YeeduPlatform — 20 days ago

▲ 5 r/Yeedu

There’s a pattern that shows up in almost every data stack after a year or two of growth.

A Spark job needs to call an API, so someone drops the key into a config file. A notebook needs access to a storage bucket, so the engineer exports credentials as environment variables. A pipeline fails in production and someone shares a temporary token in Slack so the job can be rerun.

None of these decisions feel risky in isolation. They solve an immediate problem. The pipeline runs. The incident closes.

But over time the platform accumulates hundreds of these small decisions. Credentials end up scattered across notebooks, job configs, environment variables, CI pipelines, and random scripts. Nobody is entirely sure which pipelines depend on which keys. Rotating credentials becomes stressful because you might break something that nobody remembers owning.

At that point the problem isn’t security policy. It’s infrastructure design.

The thing most teams eventually realize is that credentials behave a lot like data assets. They need ownership, access control, lifecycle management, and a clear place in the platform architecture. Treating them as configuration details is what creates the chaos in the first place.

We ran into this while thinking about how credentials should work inside a data platform. The design question wasn’t just where to store secrets, but how they should behave across teams and environments.

A few principles ended up shaping the system:

Credentials should have clear scope boundaries. Some are personal tokens that should never be shared. Some belong to a team workspace. Others are infrastructure credentials that every workspace relies on.

Defaults should exist, but teams should be able to override them locally without breaking other environments.

And most importantly, credentials should be referenced, not embedded. Pipelines and catalogs shouldn’t store secrets themselves — they should point to them.

That thinking eventually turned into a secrets management system inside Yeedu with scoped credentials, Vault-backed storage, and validation checks before credentials get used in jobs or catalogs.

We wrote a deeper breakdown of how it works here: https://yeedu.com/posts/secrets-management-in-yeedu

Curious how others are handling this in practice.

For teams running Spark, Databricks, Snowflake, or multi-cloud data platforms — what actually solved the secrets sprawl problem for you?

Centralized vault integrations? Platform-native secret stores? Or are credentials still quietly living inside pipeline configs somewhere?

u/YeeduPlatform — 2 months ago