performance - How to know which count query is the fastest? -

August 15, 2011

i've been exploring query optimizations in recent releases of spark sql 2.3.0-snapshot , noticed different physical plans semantically-identical queries.

let's assume i've got count number of rows in following dataset:

val q = spark.range(1)

i count number of rows follows:

q.count
q.collect.size
q.rdd.count
q.queryexecution.tordd.count

my initial thought it's constant operation (surely due local dataset) somehow have been optimized spark sql , give result immediately, esp. 1st 1 spark sql in full control of query execution.

having had @ physical plans of queries led me believe effective query last:

q.queryexecution.tordd.count

the reasons being that:

it avoids deserializing rows internalrow binary format
the query codegened
there's 1 job single stage

the physical plan simple that.

is reasoning correct? if so, answer different if read dataset external data source (e.g. files, jdbc, kafka)?

the main question factors take consideration whether query more efficient others (per example)?

the other execution plans completeness.

q.count

q.collect.size

q.rdd.count

i did testing on val q = spark.range(100000000):

q.count: ~50 ms
q.collect.size: stopped query after minute or so...
q.rdd.count: ~1100 ms
q.queryexecution.tordd.count: ~600 ms

some explanation:

option 1 far fastest because uses both partial aggregation , whole stage code generation. whole stage code generation allows jvm clever , drastic optimizations (see: https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html).

option 2. slow , materializes on driver, bad idea.

option 3. option 4, first converts internal row regular row, , quite expensive.

option 4. fast without whole stage code generation.

Search This Blog

LP