performance - How to know which count query is the fastest? -


i've been exploring query optimizations in recent releases of spark sql 2.3.0-snapshot , noticed different physical plans semantically-identical queries.

let's assume i've got count number of rows in following dataset:

val q = spark.range(1) 

i count number of rows follows:

  1. q.count
  2. q.collect.size
  3. q.rdd.count
  4. q.queryexecution.tordd.count

my initial thought it's constant operation (surely due local dataset) somehow have been optimized spark sql , give result immediately, esp. 1st 1 spark sql in full control of query execution.

having had @ physical plans of queries led me believe effective query last:

q.queryexecution.tordd.count 

the reasons being that:

  1. it avoids deserializing rows internalrow binary format
  2. the query codegened
  3. there's 1 job single stage

the physical plan simple that.

details job

is reasoning correct? if so, answer different if read dataset external data source (e.g. files, jdbc, kafka)?

the main question factors take consideration whether query more efficient others (per example)?


the other execution plans completeness.

q.count

q.count

q.collect.size

q.collect.size

q.rdd.count

q.rdd.count

i did testing on val q = spark.range(100000000):

  1. q.count: ~50 ms
  2. q.collect.size: stopped query after minute or so...
  3. q.rdd.count: ~1100 ms
  4. q.queryexecution.tordd.count: ~600 ms

some explanation:

option 1 far fastest because uses both partial aggregation , whole stage code generation. whole stage code generation allows jvm clever , drastic optimizations (see: https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html).

option 2. slow , materializes on driver, bad idea.

option 3. option 4, first converts internal row regular row, , quite expensive.

option 4. fast without whole stage code generation.


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -