Spark Filter data with groupby count -
a dataframe a_df like:
+------+----+-----+ |   uid|year|month| +------+----+-----+ |     1|2017|   03|       1|2017|   05| |     2|2017|   01| |     3|2017|   02| |     3|2017|   04| |     3|2017|   05| +------+----+-----+   i want filter column uid occurrence time more 2 times, expected result:
+------+----+-----+ |   uid|year|month| +------+----+-----+ |     3|2017|   02| |     3|2017|   04| |     3|2017|   05| +------+----+-----+   how can result scala? solution:
val condition_uid = a_df.groupby("uid")                   .agg(count("*").alias("cnt"))                   .filter("cnt > 2").select("uid") val results_df = a_df.join(condition_uid, seq("uid"))   is there better answer?
i think using window function perfect solution since not have rejoin dataframe.
val window = window.partitionby("uid").orderby("year")  df.withcolumn("count", count("uid").over(window))   .filter($"count" > 2).drop("count").show   output:
+---+----+-----+-----+ |uid|year|month|count| +---+----+-----+-----+ |  1|2017|   03|    2| |  1|2017|   05|    2| |  2|2017|   01|    1| +---+----+-----+-----+      
Comments
Post a Comment