r - Identify time gap in large data -
i'm working on function identify gap in series of start/end dates. output should false if start date begins later 1 day after of previous end dates.
data:
df <- data.frame('id' = c('1','1','1','1','1','1'), 'start' = as.date(c('2010-01-01', '2010-01-03', '2010-01-05', '2010-01-09','2010-02-01', '2010-02-10')), 'end' = as.date(c('2010-01-03', '2010-01-22', '2010-01-07', '2010-01-12', '2010-02-10', '2010-02-12')))
desired output:
id start end continuous 1 1 2010-01-01 2010-01-03 false 2 1 2010-01-03 2010-01-22 true 3 1 2010-01-05 2010-01-07 true 4 1 2010-01-09 2010-01-12 true 5 1 2010-02-01 2010-02-10 false 6 1 2010-02-10 2010-02-12 true
this code gets desired result on small dataset:
df$continuous <- sapply(split(df, df$id), function(x) { lapply(1:nrow(x), function(y) { any(x$start[y] - x$end[-(y:nrow(x$end))] <= 1) }) })
however, applying bigger set (>100,000 observations) many different ids still creates wrong output. example:
id start end continuous 2 2015-01-15 2015-01-15 false 2 2015-01-16 2015-01-17 true 2 2015-01-16 2015-01-17 false #wrong, should true 2 2015-01-17 2015-01-19 true 2 2015-01-20 2015-01-22 true 2 2015-01-22 2015-01-23 false #wrong, should true 2 2015-01-26 2015-01-26 true 2 2015-01-26 2015-01-30 true 2 2015-01-26 2015-01-26 false #wrong, should true 2 2015-02-01 2015-02-06 true #wrong, should false
anyone knows why?
Comments
Post a Comment