r - Best practice for handling different datasets with same type of data but different column names -

August 15, 2015

for example, suppose want build package analyzing customer transactions. in nice world, every transactions dataset like

   transactionid customerid transactiondate 1:             1          1      2017-01-01 2:             2          2      2017-01-15 3:             3          1      2017-05-20 4:             4          3      2017-06-11

then make nice functions like

num_customers <- function(transactions){   length(unique(transactions$customerid)) }

in reality, column names people use vary. (e.g. "customerid", "customerid", , "cust_id" might used different companies).

my question is, best way me deal this? plan on relying heavily on data.table, instinct make users provide mapping column names ones use attribute of table like

mytransactions <- data.table(   transaction_id = c(1l, 2l, 3l, 4l),   customer_id = c(1l, 2l, 1l, 3l),   transaction_date = as.date(c("2017-01-01", "2017-01-15", "2017-05-20", "2017-06-11")) ) setattr(   mytransactions,    name = "colmap",   value = c(transactionid="transaction_id", customerid="customer_id", transactiondate="transaction_date") ) attributes(mytransactions)

however, unfortunately, subset data attribute gets removed.

attributes(mytransactions[1:2])

if expect data have specific shape , set of attributes, define class. it's easy in r using s3 system, since need change class attribute.

the best way let users create s3 objects through function. keep original "feel" of adapting existing datasets, have users provide dataset , name columns use different values. default argument values can keep package code succinct , reward standards-respecting users.

transaction_table <- function(dataset,                               cust_id    = "customerid",                               trans_id   = "transactionid",                               trans_date = "transactiondate") {   keep_columns <- c(     customerid      = cust_id,     transactionid   = trans_id,     transactiondate = trans_date   )   out_table <- dataset[, keep_columns, = false]   setnames(out_table, names(keep_columns))   setattr(out_table, "class", c("transaction_table", class(out_table)))   out_table }   standardized <- transaction_table(   mytransactions,   cust_id    = "customer_id",   trans_id   = "transaction_id",   trans_date = "transaction_date" ) standardized #    customerid transactionid transactiondate # 1:          1             1      2017-01-01 # 2:          2             2      2017-01-15 # 3:          1             3      2017-05-20 # 4:          3             4      2017-06-11

as bonus, can take full advantage of s3 system, defining class-specific methods generic functions.

print.transaction_table <- function(x, ...) {   time_range <- range(standardized[["transactiondate"]])   formatted_range <- strftime(time_range)   cat("transactions from", formatted_range[1], "to", formatted_range[2], "\n")   nextmethod() }   print(standardized) # transactions 2017-01-01 2017-06-11  #    customerid transactionid transactiondate # 1:          1             1      2017-01-01 # 2:          2             2      2017-01-15 # 3:          1             3      2017-05-20 # 4:          3             4      2017-06-11

Search This Blog

LP

r - Best practice for handling different datasets with same type of data but different column names -

Comments

Post a Comment

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

nginx - phpPgAdmin - log in works but I have to login again after clicking on any links -

How to deploy a middleman blog inside a rails app? -