u/ahshahid

Tpcds benchmark as measure of performance in spark and like engined. - promo

I have been running tpcds benchmark for perf measurement using Apache Spark and my fork. Tpcds runs read only SQL on prepared data. The preparation aspect of data is crucial to the numbers. Since tpcds puts no requirements on preparing data, each vendor / engine fine tunes it to its strength. Nothing wrong there , but the cost of preparing data is overlooked.

For preparing data for tpcds using tpcds toolkit , the tables created are by default partitioned on date column, unless the flag is explicitly set to false.

When I ran tpcds benchmark on a two nodes m1 machines for 3 TB scale factor, the amount of time taken to generate partitioned data was in excess of 6 hrs with sporadic ooms .

The same 3TB data when generated without partitioning on date column, but sorted locally on date column while writing a split locally, took around 40 - 50 mins.

The numbers of partitioned stock spark vs non partitioned spark fork, tpcds run time was 2200 sec and 2300 sec respectively.

Another point is the relevance of TPCDS Benchmark itself in spark and related engines.

Tpcds queries are straightforward sqls, while real world queries using data frame apis can be and for sure are, extremely complex.. so complex that they cannot even be represented using a SQL string. The way one can join data frames or keep adding projects, a SQL string representation if at all created will have abnormally high nested clauses far beyond 6 -7 level of nesting usually allowed by SQL databases..afaik.

Are there any better benchmarks possible which take into account real world usage?.

reddit.com
u/ahshahid — 5 days ago

Promo: KwikQuery's TabbyDB-4.1.1 available for download

The trial version of KwikQuery's TabbyDB-4.1.1 which is in 100% agreement with Spark - 4.1.1 is available on the wesbite for download.

In terms of difference between TabbyDB 4.01 and 4.1.1 ( from perspective of enhancements done in TabbyDB) is

  1. Adding logic of marking a chain of projects as immutable so that optimizer rules are applied only once to project containing huge/repetitive expressions as described in the earlier post.

  2. Enhancement in the runtime performance of pushed down Broadcasted keys of Broadcast Hash Joins, further, by minimizing the iterations over the pushed keys. ( Please note that this feature of pushing down of Broadcasted Keys of Hash Join is present only in TabbyDB). The improvement done in 4.1.1 vis-a-vis 4.0.1 has resulted in 17 % overall TPCDS Benchmark from 14 % , when compared with open source spark of the respective versions.

reddit.com
u/ahshahid — 6 days ago

Promo: Two quiet Spark optimizer inefficiencies fixed in our Spark fork (TabbyDB)

If you've ever run complex analytical queries on Spark with wide projections full of nested expressions, you may have been losing performance to two issues . I have seen this issue to again cause query compilation times to run into hours.

Problem 1: CollapseProject and duplicated subexpressions

Spark's CollapseProject rule merges a chain of Project nodes into one. That's generally good — smaller tree, simpler plan. But it can leave you with a single Project where multiple Alias expressions each independently evaluate the same expensive subexpression. Spark does have CSE logic to catch this, but it only works when there are multiple Project nodes to reason about. If your project arrives already in collapsed form, Spark has nothing to split, and the duplicated work silently makes it into physical execution — evaluated redundantly for every single row.

Problem 2: Per-rule change tracking that barely helps

Spark tracks whether each optimizer rule actually modified the plan, so it can theoretically skip unchanged rules on subsequent passes. In practice this gives almost no benefit, because the state info is kept in the tree nodes. If any rule in a batch makes a change in the tree, all the preceding nodes in tree also get recreated loosing the state. the whole batch restarts from rule 1 — including rules that have nothing left to do. For expensive rules that do full plan tree traversals (NullPropagation, ConstantFolding, etc.), that's a lot of wasted optimizer compilation time on large plans.

What we did in TabbyDB

We solve both with a single mechanism:

  1. Let CollapseProject fully flatten everything first — minimal tree, even if subexpressions are duplicated
  2. Then do a controlled expansion: detect replicated deterministic subexpressions and rewrite into a chain of projects, where each subexpression is computed exactly once and referenced by all consumers downstream
  3. Apply the expensive optimizer rules and other batch rules to these expanded blocks once, then mark them immutable

The immutability guarantee is the key. On every subsequent batch iteration, those blocks are skipped entirely by the expensive rules — because we know they're fully normalized and nothing can change. The rest of the plan keeps being optimized normally. It doesn't matter if another rule elsewhere modifies the tree; the immutable blocks are never re-traversed.

Result: less redundant computation at runtime (CSE actually works on already-collapsed projects), and less wasted optimizer time per batch iteration.

Only deterministic expressions qualify — rand(), now(), non-deterministic UDFs etc. are excluded. Query results are identical.

I had opened a ticket on this idea and PushDownPredicates ( a separate perf issues )optimization via the

https://issues.apache.org/jira/browse/SPARK-36786

I had this implemented long back, but did not publish the PR nor ported the code to new spark versions . But now I have got it in KwikQuery's TabbyDB's 4.1.1 release ( will put it as downloadable in a day or 2 )which is based on apache spark's 4.1.1 release.

Apart from that 4.1.1 TabbyDB would further improve the TPCDS numbers from 14% to 17% as tested recently on 3 TB data with 2 nodes, as compared to OSS, for non partitioned tables.

reddit.com
u/ahshahid — 13 days ago