How to get a sum of in front of all numbers in multiple partition in DataFrame with scala/spark? -


i hava dataframe,i want sum of in front of numbers in 'value' column. input:

+-----+-----+ |value|ts   | +-----+-----+ |    1|    0| |    7|    1| |    3|    2| |    3|    3| |    5|    4| |   21|    5| |    7|    6| |    3|    7| |    9|    8| |   10|    9| +-----+-----+ 

this code:

 val arr = array((1,1), (7,2), (3,3), (3,4), (5,5),(21,6), (7,7), (3,8), (9,9), (10,10))  println("before sum partition's number"+df.rdd.getnumpartitions)  df=df.withcolumn("sumvalue", sum(col("value")).over(window.orderby(col("ts"))))  println("after sum partition's number"+df.rdd.getnumpartitions)  df.show() 

this except result:

+-----+---+--------+ |value| ts|sumvalue| +-----+---+--------+ |    1|  1|  1     | |    7|  2|  8     | |    3|  3| 11     | |    3|  4| 14     | |    5|  5| 19     | |   21|  6| 40     | |    7|  7| 47     | |    3|  8| 50     | |    9|  9| 59     | |   10| 10| 69     | +-----+---+--------+ 

but in way ,all of data collected in 1 partition,how same result in multiple partition?

your expected result cannot achieved using partitions sumvalue depends on previous row value. transformations done on data present on same executor. , transformation require dependent on whole dataset not on parts of dataset.

your expected result can achieved without partitions in following way.

given input dataframe

+-----+---+ |value|ts | +-----+---+ |1    |1  | |7    |2  | |3    |3  | |3    |4  | |5    |5  | |21   |6  | |7    |7  | |3    |8  | |9    |9  | |10   |10 | +-----+---+ 

you can use cumulative sum on window function

import org.apache.spark.sql.functions._ val windowspec = window.orderby("ts").rowsbetween(long.minvalue, 0) val finaldf = df.withcolumn("sumvalue", sum(col("value")).over(windowspec)) finaldf.show(false) 

which should give result

+-----+---+--------+ |value|ts |sumvalue| +-----+---+--------+ |1    |1  |1       | |7    |2  |8       | |3    |3  |11      | |3    |4  |14      | |5    |5  |19      | |21   |6  |40      | |7    |7  |47      | |3    |8  |50      | |9    |9  |59      | |10   |10 |69      | +-----+---+--------+ 

updated

looking @ comments , title of question, must have column or columns can use partition.

so give dataframe

+-----+-----+---+ |group|value|ts | +-----+-----+---+ |a    |1    |1  | |a    |7    |2  | |a    |3    |3  | |a    |3    |4  | |a    |5    |5  | |a    |21   |6  | |a    |7    |7  | |a    |3    |8  | |a    |9    |9  | |a    |10   |10 | |b    |10   |11 | |b    |10   |12 | |b    |10   |13 | |b    |10   |14 | |b    |10   |15 | +-----+-----+---+ 

you can use partitionby below

import org.apache.spark.sql.functions._ val windowspec = window.partitionby("group").orderby("ts").rowsbetween(long.minvalue, 0) val finaldf = df.withcolumn("sumvalue", sum(col("value")).over(windowspec)) finaldf.show(false) 

and result as

+-----+-----+---+--------+ |group|value|ts |sumvalue| +-----+-----+---+--------+ |b    |10   |11 |10      | |b    |10   |12 |20      | |b    |10   |13 |30      | |b    |10   |14 |40      | |b    |10   |15 |50      | |a    |1    |1  |1       | |a    |7    |2  |8       | |a    |3    |3  |11      | |a    |3    |4  |14      | |a    |5    |5  |19      | |a    |21   |6  |40      | |a    |7    |7  |47      | |a    |3    |8  |50      | |a    |9    |9  |59      | |a    |10   |10 |69      | +-----+-----+---+--------+ 

i hope answer helpful


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -