performance - How to know which count query is the fastest? -
i've been exploring query optimizations in recent releases of spark sql 2.3.0-snapshot , noticed different physical plans semantically-identical queries.
let's assume i've got count number of rows in following dataset:
val q = spark.range(1)
i count number of rows follows:
q.count
q.collect.size
q.rdd.count
q.queryexecution.tordd.count
my initial thought it's constant operation (surely due local dataset) somehow have been optimized spark sql , give result immediately, esp. 1st 1 spark sql in full control of query execution.
having had @ physical plans of queries led me believe effective query last:
q.queryexecution.tordd.count
the reasons being that:
- it avoids deserializing rows
internalrow
binary format - the query codegened
- there's 1 job single stage
the physical plan simple that.
is reasoning correct? if so, answer different if read dataset external data source (e.g. files, jdbc, kafka)?
the main question factors take consideration whether query more efficient others (per example)?
the other execution plans completeness.
q.count
q.collect.size
q.rdd.count
i did testing on val q = spark.range(100000000)
:
q.count
: ~50 msq.collect.size
: stopped query after minute or so...q.rdd.count
: ~1100 msq.queryexecution.tordd.count
: ~600 ms
some explanation:
option 1 far fastest because uses both partial aggregation , whole stage code generation. whole stage code generation allows jvm clever , drastic optimizations (see: https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html).
option 2. slow , materializes on driver, bad idea.
option 3. option 4, first converts internal row regular row, , quite expensive.
option 4. fast without whole stage code generation.
Comments
Post a Comment