r - Identify time gap in large data -


i'm working on function identify gap in series of start/end dates. output should false if start date begins later 1 day after of previous end dates.

data:

df <- data.frame('id' = c('1','1','1','1','1','1'), 'start' = as.date(c('2010-01-01', '2010-01-03', '2010-01-05', '2010-01-09','2010-02-01', '2010-02-10')),                  'end' = as.date(c('2010-01-03', '2010-01-22', '2010-01-07', '2010-01-12', '2010-02-10', '2010-02-12'))) 

desired output:

  id      start        end  continuous 1  1 2010-01-01 2010-01-03 false 2  1 2010-01-03 2010-01-22 true 3  1 2010-01-05 2010-01-07 true 4  1 2010-01-09 2010-01-12 true 5  1 2010-02-01 2010-02-10 false 6  1 2010-02-10 2010-02-12 true  

this code gets desired result on small dataset:

df$continuous <-   sapply(split(df, df$id),                 function(x) {                   lapply(1:nrow(x),                          function(y) {                            any(x$start[y] - x$end[-(y:nrow(x$end))] <= 1)                          })                 }) 

however, applying bigger set (>100,000 observations) many different ids still creates wrong output. example:

 id         start       end            continuous  2    2015-01-15   2015-01-15             false  2    2015-01-16   2015-01-17             true  2    2015-01-16   2015-01-17            false #wrong, should true  2    2015-01-17   2015-01-19             true  2    2015-01-20   2015-01-22             true  2    2015-01-22   2015-01-23            false #wrong, should true  2    2015-01-26   2015-01-26             true  2    2015-01-26   2015-01-30             true  2    2015-01-26   2015-01-26            false #wrong, should true  2    2015-02-01   2015-02-06             true #wrong, should false 

anyone knows why?


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -