How to get a sum of in front of all numbers in multiple partition in DataFrame with scala/spark? -
i hava dataframe,i want sum of in front of numbers in 'value' column. input:
+-----+-----+ |value|ts | +-----+-----+ | 1| 0| | 7| 1| | 3| 2| | 3| 3| | 5| 4| | 21| 5| | 7| 6| | 3| 7| | 9| 8| | 10| 9| +-----+-----+
this code:
val arr = array((1,1), (7,2), (3,3), (3,4), (5,5),(21,6), (7,7), (3,8), (9,9), (10,10)) println("before sum partition's number"+df.rdd.getnumpartitions) df=df.withcolumn("sumvalue", sum(col("value")).over(window.orderby(col("ts")))) println("after sum partition's number"+df.rdd.getnumpartitions) df.show()
this except result:
+-----+---+--------+ |value| ts|sumvalue| +-----+---+--------+ | 1| 1| 1 | | 7| 2| 8 | | 3| 3| 11 | | 3| 4| 14 | | 5| 5| 19 | | 21| 6| 40 | | 7| 7| 47 | | 3| 8| 50 | | 9| 9| 59 | | 10| 10| 69 | +-----+---+--------+
but in way ,all of data collected in 1 partition,how same result in multiple partition?
your expected result cannot achieved using partitions
sumvalue
depends on previous row value
. transformations
done on data present on same executor. , transformation require dependent on whole dataset not on parts of dataset.
your expected result can achieved without partitions in following way.
given input dataframe
+-----+---+ |value|ts | +-----+---+ |1 |1 | |7 |2 | |3 |3 | |3 |4 | |5 |5 | |21 |6 | |7 |7 | |3 |8 | |9 |9 | |10 |10 | +-----+---+
you can use cumulative sum
on window
function
import org.apache.spark.sql.functions._ val windowspec = window.orderby("ts").rowsbetween(long.minvalue, 0) val finaldf = df.withcolumn("sumvalue", sum(col("value")).over(windowspec)) finaldf.show(false)
which should give result
+-----+---+--------+ |value|ts |sumvalue| +-----+---+--------+ |1 |1 |1 | |7 |2 |8 | |3 |3 |11 | |3 |4 |14 | |5 |5 |19 | |21 |6 |40 | |7 |7 |47 | |3 |8 |50 | |9 |9 |59 | |10 |10 |69 | +-----+---+--------+
updated
looking @ comments , title of question, must have column
or columns
can use partition.
so give dataframe
+-----+-----+---+ |group|value|ts | +-----+-----+---+ |a |1 |1 | |a |7 |2 | |a |3 |3 | |a |3 |4 | |a |5 |5 | |a |21 |6 | |a |7 |7 | |a |3 |8 | |a |9 |9 | |a |10 |10 | |b |10 |11 | |b |10 |12 | |b |10 |13 | |b |10 |14 | |b |10 |15 | +-----+-----+---+
you can use partitionby
below
import org.apache.spark.sql.functions._ val windowspec = window.partitionby("group").orderby("ts").rowsbetween(long.minvalue, 0) val finaldf = df.withcolumn("sumvalue", sum(col("value")).over(windowspec)) finaldf.show(false)
and result as
+-----+-----+---+--------+ |group|value|ts |sumvalue| +-----+-----+---+--------+ |b |10 |11 |10 | |b |10 |12 |20 | |b |10 |13 |30 | |b |10 |14 |40 | |b |10 |15 |50 | |a |1 |1 |1 | |a |7 |2 |8 | |a |3 |3 |11 | |a |3 |4 |14 | |a |5 |5 |19 | |a |21 |6 |40 | |a |7 |7 |47 | |a |3 |8 |50 | |a |9 |9 |59 | |a |10 |10 |69 | +-----+-----+---+--------+
i hope answer helpful
Comments
Post a Comment