Spark Filter data with groupby count -


a dataframe a_df like:

+------+----+-----+ |   uid|year|month| +------+----+-----+ |     1|2017|   03|       1|2017|   05| |     2|2017|   01| |     3|2017|   02| |     3|2017|   04| |     3|2017|   05| +------+----+-----+ 

i want filter column uid occurrence time more 2 times, expected result:

+------+----+-----+ |   uid|year|month| +------+----+-----+ |     3|2017|   02| |     3|2017|   04| |     3|2017|   05| +------+----+-----+ 

how can result scala? solution:

val condition_uid = a_df.groupby("uid")                   .agg(count("*").alias("cnt"))                   .filter("cnt > 2").select("uid") val results_df = a_df.join(condition_uid, seq("uid")) 

is there better answer?

i think using window function perfect solution since not have rejoin dataframe.

val window = window.partitionby("uid").orderby("year")  df.withcolumn("count", count("uid").over(window))   .filter($"count" > 2).drop("count").show 

output:

+---+----+-----+-----+ |uid|year|month|count| +---+----+-----+-----+ |  1|2017|   03|    2| |  1|2017|   05|    2| |  2|2017|   01|    1| +---+----+-----+-----+ 

Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -