r - Variable binning based on column value -
this difficult describe (thus vague title) i'll give example data want. have 2 dataframes
set.seed(5) df = data.frame(id = sort(rep(c("2006_01", "2006_02", "2006_03", "2006_04"), length.out = 100)), increment = rep(seq(from = 1, = 25, = 1), length.out = 100), var1 = rnorm(100))
and
set.seed(23) df2 = data.frame(id = sort(rep(c("2006_01", "2006_02", "2006_03", "2006_04"), length.out = 200)), distance = rep(seq(from = 1, = 50, = 1), length.out = 200), var2 = rnorm(200))
they both transects $increment
, $distance
being measures of how far along transect measurements taken. longer 1 has twice many measurements because higher resolution, representing same transect. join them based on proportion of distance through transect. calculated proportion column each following code:
df = ddply(df, "id", transform, proportion = increment/max(increment))
and
df2 = ddply(df2, "id", transform, proportion = distance/max(distance))
the output want is, join 2 dataframes (keeping var1
, var2
). because df2
higher resolution, way join df
bin it, real data isn't regular sample data, need dynamically bin based on proportion
columns it's calculating means in df2
proportion values fall between each set of proportion values in df
to try , summarize, i'm trying join sets of data have different resolutions calculating mean values of higher resolution data set of points fall within resolution of lower resolution set.
-edit try , add desired output-
the first few lines of output dataframe following:
id increment var1 var2 2006_001 1 -0.84085548 -0.1207349 2006_001 2 1.38435934 1.353328 2006_001 3 -1.25549186 1.052048 2006_001 4 0.07014277 0.3705596
one method through tidyverse packages.
library(dplyr) library(tidyr) # nest, unnest library(purrr) # pmap
since 2 data.frames different dimensions, find it's nice nest
data column.
df2 <- df2 %>% group_by(id) %>% mutate( proportion = (distance - min(distance)) / diff(range(distance)) ) %>% nest(.key = "dist") df2 # # tibble: 4 × 2 # id dist # <fctr> <list> # 1 2006_01 <tibble [50 × 3]> # 2 2006_02 <tibble [50 × 3]> # 3 2006_03 <tibble [50 × 3]> # 4 2006_04 <tibble [50 × 3]> df3 <- df %>% group_by(id) %>% mutate( proportion = (increment - min(increment)) / diff(range(increment)) ) %>% nest(.key = "incr") %>% left_join(df2, = "id") %>% mutate( incr = pmap(list(incr, dist), function(a, b) { zz <- tail(a$proportion, n = -1) - (tail(a$proportion, n = -1) - head(a$proportion, n = -1)) / 2 a$var2 <- as.numeric( by(b$var2, cut(b$proportion, c(-1, zz, 2), labels = false), mean) ) }) )
now things set in df2
, start doing same in df
, , combine them side-by-side:
df3 <- df %>% group_by(id) %>% mutate( proportion = (increment - min(increment)) / diff(range(increment)) ) %>% nest(.key = "incr") %>% left_join(df2, = "id") df3 # # tibble: 4 × 3 # id incr dist # <fctr> <list> <list> # 1 2006_01 <tibble [25 × 3]> <tibble [50 × 3]> # 2 2006_02 <tibble [25 × 3]> <tibble [50 × 3]> # 3 2006_03 <tibble [25 × 3]> <tibble [50 × 3]> # 4 2006_04 <tibble [25 × 3]> <tibble [50 × 3]>
note allows associate x
rows of 1 data.frame y
rows of another, when joined via id
.
df3 %>% mutate( incr = pmap(list(incr, dist), function(a, b) { # offset between breaks, ... breaks <- tail(a$proportion, n = -1) - (tail(a$proportion, n = -1) - head(a$proportion, n = -1)) / 2 # ... bookends ensure 100% membership breaks <- c(-1, breaks, 2) a$var2 <- as.numeric( by(b$var2, cut(b$proportion, breaks), mean) ) }) ) %>% select(id, incr) %>% unnest() %>% select(-proportion) # # tibble: 100 × 4 # id increment var1 var2 # <fctr> <dbl> <dbl> <dbl> # 1 2006_01 1 -0.84085548 -0.12073489 # 2 2006_01 2 1.38435934 1.35332759 # 3 2006_01 3 -1.25549186 1.05204780 # 4 2006_01 4 0.07014277 0.37055960 # 5 2006_01 5 1.71144087 0.81060839 # 6 2006_01 6 -0.60290798 -0.41412345 # 7 2006_01 7 -0.47216639 0.09643082 # 8 2006_01 8 -0.63537131 -0.45411977 # 9 2006_01 9 -0.28577363 -0.48124606 # 10 2006_01 10 0.13810822 0.34763251 # # ... 90 more rows
Comments
Post a Comment