pyspark - Getting difference between value and its lag in Spark -
i have sparkr dataframe shown below. want create monthdiff column months between dates, grouped each name. how can this? 
#set data frame team <- data.frame(name = c("thomas", "thomas", "thomas", "thomas", "bill", "bill", "bill"),   dates = c('2017-01-05', '2017-02-23', '2017-03-16', '2017-04-08', '2017-06-08','2017-07-24','2017-09-05')) #create spark dataframe team <- createdataframe(team) #convert dates date type team <- withcolumn(team, 'dates', cast(team$dates, 'date')) here's i've tried far, resulting in errors:
team <- agg(groupby(team, 'name'), monthdiff=c(na, months_between(team$dates, lag(team$dates)))) team <- agg(groupby(team, 'name'), monthdiff=months_between(team$dates, lag(team$dates))) team <- agg(groupby(team, 'name'), monthdiff=months_between(select(team, 'dates'), lag(select(team, 'dates')))) expected output:
name    | dates     | monthdiff ------------------------------- thomas  |2017-01-05 |  na thomas  |2017-02-23 |  1 thomas  |2017-03-16 |  1 thomas  |2017-04-08 |  1 bill    |2017-06-08 |  na bill    |2017-07-24 |  1 bill    |2017-09-05 |  2 
based on post, adapted code sparkr answer.
#create 'lagdates' variable lag of dates window <- orderby(windowpartitionby("name"), team$dates) team <- withcolumn(team, 'lagdates', over(lag(team$dates), window))  #get months_between dates , lagdates team <- withcolumn(team, 'monthdiff', round(months_between(team$dates, team$lagdates)))  name  | dates      | lagdates  | monthdiff ------------------------------------------ bill  | 2017-06-08 |null       | null bill  | 2017-07-24 |2017-06-08 |    2 bill  | 2017-09-05 |2017-07-24 |    1 thomas| 2017-01-05 |null       | null thomas| 2017-02-23 |2017-01-05 |    2 thomas| 2017-03-16 |2017-02-23 |    1 thomas| 2017-04-08 |2017-03-16 |    1 
Comments
Post a Comment