r - Best practice for handling different datasets with same type of data but different column names -
for example, suppose want build package analyzing customer transactions. in nice world, every transactions dataset like
   transactionid customerid transactiondate 1:             1          1      2017-01-01 2:             2          2      2017-01-15 3:             3          1      2017-05-20 4:             4          3      2017-06-11 then make nice functions like
num_customers <- function(transactions){   length(unique(transactions$customerid)) } in reality, column names people use vary. (e.g. "customerid", "customerid", , "cust_id" might used different companies).
my question is, best way me deal this? plan on relying heavily on data.table, instinct make users provide mapping column names ones use attribute of table like
mytransactions <- data.table(   transaction_id = c(1l, 2l, 3l, 4l),   customer_id = c(1l, 2l, 1l, 3l),   transaction_date = as.date(c("2017-01-01", "2017-01-15", "2017-05-20", "2017-06-11")) ) setattr(   mytransactions,    name = "colmap",   value = c(transactionid="transaction_id", customerid="customer_id", transactiondate="transaction_date") ) attributes(mytransactions) however, unfortunately, subset data attribute gets removed.
attributes(mytransactions[1:2]) 
if expect data have specific shape , set of attributes, define class. it's easy in r using s3 system, since need change class attribute.
the best way let users create s3 objects through function. keep original "feel" of adapting existing datasets, have users provide dataset , name columns use different values. default argument values can keep package code succinct , reward standards-respecting users.
transaction_table <- function(dataset,                               cust_id    = "customerid",                               trans_id   = "transactionid",                               trans_date = "transactiondate") {   keep_columns <- c(     customerid      = cust_id,     transactionid   = trans_id,     transactiondate = trans_date   )   out_table <- dataset[, keep_columns, = false]   setnames(out_table, names(keep_columns))   setattr(out_table, "class", c("transaction_table", class(out_table)))   out_table }   standardized <- transaction_table(   mytransactions,   cust_id    = "customer_id",   trans_id   = "transaction_id",   trans_date = "transaction_date" ) standardized #    customerid transactionid transactiondate # 1:          1             1      2017-01-01 # 2:          2             2      2017-01-15 # 3:          1             3      2017-05-20 # 4:          3             4      2017-06-11 as bonus, can take full advantage of s3 system, defining class-specific methods generic functions.
print.transaction_table <- function(x, ...) {   time_range <- range(standardized[["transactiondate"]])   formatted_range <- strftime(time_range)   cat("transactions from", formatted_range[1], "to", formatted_range[2], "\n")   nextmethod() }   print(standardized) # transactions 2017-01-01 2017-06-11  #    customerid transactionid transactiondate # 1:          1             1      2017-01-01 # 2:          2             2      2017-01-15 # 3:          1             3      2017-05-20 # 4:          3             4      2017-06-11 
Comments
Post a Comment