Cache query performance Spark -
if i'm trying cache huge dataframe
(ex: 100gb table) , when perform query on cached dataframe
perform full table scan? how spark index data. spark documentation says:
spark sql can cache tables using in-memory columnar format calling spark.catalog.cachetable("tablename") or dataframe.cache(). spark sql scan required columns , automatically tune compression minimize memory usage , gc pressure. can call spark.catalog.uncachetable("tablename") remove table memory.
http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
i didn't understand above statement, helpful if explain in detail below statement or how optimizes query on large cached dataframe
"then spark sql scan required columns , automatically
tune compression "
when perform query on cached
dataframe
perform full table scan? how spark index data.
while minor optimizations possible, spark doesn't index data @ all. in general case should assume spark perform full data scan.
it can apply projections. if query uses subset of columns, spark can access these, required.
columnar stores candidates compression , spark supports number of compression schemes (runlengthencoding
, dictencoding
, booleanbitset
, intdelta
, longdelta
). depending on type of column , computed statistic spark can automatically choose appropriate compression format or skip compression whatsoever.
in general compressions schemes used columnar stores allow queries on compressed data , (like rle) can used efficient selections. @ same time can increase amount data can stored in memory , accessed without fetching data disk.
Comments
Post a Comment