r - Variable binning based on column value -


this difficult describe (thus vague title) i'll give example data want. have 2 dataframes

set.seed(5)     df = data.frame(id  = sort(rep(c("2006_01", "2006_02", "2006_03", "2006_04"), length.out = 100)),              increment = rep(seq(from = 1, = 25, = 1), length.out = 100),             var1 = rnorm(100)) 

and

set.seed(23) df2 = data.frame(id  = sort(rep(c("2006_01", "2006_02", "2006_03", "2006_04"), length.out = 200)),               distance = rep(seq(from = 1, = 50, = 1), length.out = 200),               var2 = rnorm(200)) 

they both transects $increment , $distance being measures of how far along transect measurements taken. longer 1 has twice many measurements because higher resolution, representing same transect. join them based on proportion of distance through transect. calculated proportion column each following code:

df = ddply(df, "id", transform, proportion = increment/max(increment)) 

and

df2 = ddply(df2, "id", transform, proportion = distance/max(distance)) 

the output want is, join 2 dataframes (keeping var1 , var2). because df2 higher resolution, way join df bin it, real data isn't regular sample data, need dynamically bin based on proportion columns it's calculating means in df2 proportion values fall between each set of proportion values in df

to try , summarize, i'm trying join sets of data have different resolutions calculating mean values of higher resolution data set of points fall within resolution of lower resolution set.

-edit try , add desired output-

the first few lines of output dataframe following:

id        increment        var1           var2 2006_001         1  -0.84085548     -0.1207349 2006_001         2   1.38435934       1.353328 2006_001         3  -1.25549186       1.052048 2006_001         4   0.07014277      0.3705596 

one method through tidyverse packages.

library(dplyr) library(tidyr) # nest, unnest library(purrr) # pmap 

since 2 data.frames different dimensions, find it's nice nest data column.

df2 <- df2 %>%   group_by(id) %>%   mutate(     proportion = (distance - min(distance)) / diff(range(distance))   ) %>%   nest(.key = "dist") df2 # # tibble: 4 × 2 #        id              dist #    <fctr>            <list> # 1 2006_01 <tibble [50 × 3]> # 2 2006_02 <tibble [50 × 3]> # 3 2006_03 <tibble [50 × 3]> # 4 2006_04 <tibble [50 × 3]>  df3 <- df %>%   group_by(id) %>%   mutate(     proportion = (increment - min(increment)) / diff(range(increment))   ) %>%   nest(.key = "incr") %>%   left_join(df2, = "id") %>%   mutate(     incr = pmap(list(incr, dist),                 function(a, b) {                   zz <- tail(a$proportion, n = -1) -                     (tail(a$proportion, n = -1) - head(a$proportion, n = -1)) / 2                   a$var2 <- as.numeric(                     by(b$var2, cut(b$proportion, c(-1, zz, 2), labels = false), mean)                   )                                   })   ) 

now things set in df2, start doing same in df, , combine them side-by-side:

df3 <- df %>%   group_by(id) %>%   mutate(     proportion = (increment - min(increment)) / diff(range(increment))   ) %>%   nest(.key = "incr") %>%   left_join(df2, = "id") df3 # # tibble: 4 × 3 #        id              incr              dist #    <fctr>            <list>            <list> # 1 2006_01 <tibble [25 × 3]> <tibble [50 × 3]> # 2 2006_02 <tibble [25 × 3]> <tibble [50 × 3]> # 3 2006_03 <tibble [25 × 3]> <tibble [50 × 3]> # 4 2006_04 <tibble [25 × 3]> <tibble [50 × 3]> 

note allows associate x rows of 1 data.frame y rows of another, when joined via id.

df3 %>%   mutate(     incr = pmap(list(incr, dist),                 function(a, b) {                   # offset between breaks, ...                   breaks <- tail(a$proportion, n = -1) -                     (tail(a$proportion, n = -1) - head(a$proportion, n = -1)) / 2                   # ... bookends ensure 100% membership                   breaks <- c(-1, breaks, 2)                   a$var2 <- as.numeric(                     by(b$var2, cut(b$proportion, breaks), mean)                   )                                   })   ) %>%   select(id, incr) %>%   unnest() %>%   select(-proportion) # # tibble: 100 × 4 #         id increment        var1        var2 #     <fctr>     <dbl>       <dbl>       <dbl> # 1  2006_01         1 -0.84085548 -0.12073489 # 2  2006_01         2  1.38435934  1.35332759 # 3  2006_01         3 -1.25549186  1.05204780 # 4  2006_01         4  0.07014277  0.37055960 # 5  2006_01         5  1.71144087  0.81060839 # 6  2006_01         6 -0.60290798 -0.41412345 # 7  2006_01         7 -0.47216639  0.09643082 # 8  2006_01         8 -0.63537131 -0.45411977 # 9  2006_01         9 -0.28577363 -0.48124606 # 10 2006_01        10  0.13810822  0.34763251 # # ... 90 more rows 

Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -