Tpcds benchmark as measure of performance in spark and like engined. - promo
I have been running tpcds benchmark for perf measurement using Apache Spark and my fork. Tpcds runs read only SQL on prepared data. The preparation aspect of data is crucial to the numbers. Since tpcds puts no requirements on preparing data, each vendor / engine fine tunes it to its strength. Nothing wrong there , but the cost of preparing data is overlooked.
For preparing data for tpcds using tpcds toolkit , the tables created are by default partitioned on date column, unless the flag is explicitly set to false.
When I ran tpcds benchmark on a two nodes m1 machines for 3 TB scale factor, the amount of time taken to generate partitioned data was in excess of 6 hrs with sporadic ooms .
The same 3TB data when generated without partitioning on date column, but sorted locally on date column while writing a split locally, took around 40 - 50 mins.
The numbers of partitioned stock spark vs non partitioned spark fork, tpcds run time was 2200 sec and 2300 sec respectively.
Another point is the relevance of TPCDS Benchmark itself in spark and related engines.
Tpcds queries are straightforward sqls, while real world queries using data frame apis can be and for sure are, extremely complex.. so complex that they cannot even be represented using a SQL string. The way one can join data frames or keep adding projects, a SQL string representation if at all created will have abnormally high nested clauses far beyond 6 -7 level of nesting usually allowed by SQL databases..afaik.
Are there any better benchmarks possible which take into account real world usage?.