Cache query performance Spark -


if i'm trying cache huge dataframe (ex: 100gb table) , when perform query on cached dataframe perform full table scan? how spark index data. spark documentation says:

spark sql can cache tables using in-memory columnar format calling spark.catalog.cachetable("tablename") or dataframe.cache(). spark sql scan required columns , automatically tune compression minimize memory usage , gc pressure. can call spark.catalog.uncachetable("tablename") remove table memory.

http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory

i didn't understand above statement, helpful if explain in detail below statement or how optimizes query on large cached dataframe

"then spark sql scan required columns , automatically

tune compression "

when perform query on cached dataframe perform full table scan? how spark index data.

while minor optimizations possible, spark doesn't index data @ all. in general case should assume spark perform full data scan.

it can apply projections. if query uses subset of columns, spark can access these, required.

columnar stores candidates compression , spark supports number of compression schemes (runlengthencoding, dictencoding, booleanbitset, intdelta, longdelta). depending on type of column , computed statistic spark can automatically choose appropriate compression format or skip compression whatsoever.

in general compressions schemes used columnar stores allow queries on compressed data , (like rle) can used efficient selections. @ same time can increase amount data can stored in memory , accessed without fetching data disk.


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -