Spark Filter data with groupby count -

February 15, 2012

a dataframe a_df like:

+------+----+-----+ |   uid|year|month| +------+----+-----+ |     1|2017|   03|       1|2017|   05| |     2|2017|   01| |     3|2017|   02| |     3|2017|   04| |     3|2017|   05| +------+----+-----+

i want filter column uid occurrence time more 2 times, expected result:

+------+----+-----+ |   uid|year|month| +------+----+-----+ |     3|2017|   02| |     3|2017|   04| |     3|2017|   05| +------+----+-----+

how can result scala? solution:

val condition_uid = a_df.groupby("uid")                   .agg(count("*").alias("cnt"))                   .filter("cnt > 2").select("uid") val results_df = a_df.join(condition_uid, seq("uid"))

is there better answer?

i think using window function perfect solution since not have rejoin dataframe.

val window = window.partitionby("uid").orderby("year")  df.withcolumn("count", count("uid").over(window))   .filter($"count" > 2).drop("count").show

output:

+---+----+-----+-----+ |uid|year|month|count| +---+----+-----+-----+ |  1|2017|   03|    2| |  1|2017|   05|    2| |  2|2017|   01|    1| +---+----+-----+-----+

Search This Blog

LP

Spark Filter data with groupby count -

Comments

Post a Comment

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

nginx - phpPgAdmin - log in works but I have to login again after clicking on any links -

How to deploy a middleman blog inside a rails app? -