Spark Filter data with groupby count -
a dataframe a_df like:
+------+----+-----+ | uid|year|month| +------+----+-----+ | 1|2017| 03| 1|2017| 05| | 2|2017| 01| | 3|2017| 02| | 3|2017| 04| | 3|2017| 05| +------+----+-----+
i want filter column uid occurrence time more 2 times, expected result:
+------+----+-----+ | uid|year|month| +------+----+-----+ | 3|2017| 02| | 3|2017| 04| | 3|2017| 05| +------+----+-----+
how can result scala? solution:
val condition_uid = a_df.groupby("uid") .agg(count("*").alias("cnt")) .filter("cnt > 2").select("uid") val results_df = a_df.join(condition_uid, seq("uid"))
is there better answer?
i think using window function perfect solution since not have rejoin dataframe.
val window = window.partitionby("uid").orderby("year") df.withcolumn("count", count("uid").over(window)) .filter($"count" > 2).drop("count").show
output:
+---+----+-----+-----+ |uid|year|month|count| +---+----+-----+-----+ | 1|2017| 03| 2| | 1|2017| 05| 2| | 2|2017| 01| 1| +---+----+-----+-----+
Comments
Post a Comment