r - Sparklyr handing categorical variables -
sparklyr handling categorical variables
i came r background , used categorical variables being handled in backend (as factor). sparklyr quite confusing using string_indexer
or onehotencoder
.
for example, have number of variables has been encoded numerical variables in original dataset categorical. want use them categorical variables not sure doing correctly.
library(sparklyr) library(dplyr) sessioninfo() sc <- spark_connect(master = "local", version = spark_version) spark_version(sc) set.seed(1) exampledf <- data.frame (id = 1:10, resp = sample(c(100:205), 10, replace = true), numb = sample(1:10, 10)) example <- copy_to(sc, exampledf) pred <- example %>% mutate(resp = as.character(resp)) %>% sdf_mutate(resp_cat = ft_string_indexer(resp)) %>% ml_decision_tree(response = "resp_cat", features = "numb") %>% sdf_predict() pred
the prediction model not categorical. see below. mean have convert prediction resp_cat , resp?
r version 3.4.0 (2017-04-21) platform: x86_64-redhat-linux-gnu (64-bit) running under: centos linux 7 (core) spark_version(sc) [1] ‘2.1.1.2.6.1.0’ source: table<sparklyr_tmp_74e340c5607c> [?? x 6] database: spark_connection id numb resp resp_cat id74e35c6b2dbb prediction <int> <int> <chr> <dbl> <dbl> <dbl> 1 1 10 150 8 0 8.000000 2 2 3 191 4 1 4.000000 3 3 4 146 9 2 9.000000 4 4 9 125 5 3 5.000000 5 5 8 107 2 4 2.000000 6 6 2 110 1 5 1.000000 7 7 5 133 3 6 5.333333 8 8 7 154 6 7 5.333333 9 9 1 170 0 8 0.000000 10 10 6 143 7 9 5.333333
in general spark depends on column metadata when handling categorical data. in pipeline handled stringindexer
(ft_string_indexer
). ml predict labels, not original strings. use indextostring
transformer provided ft_index_to_string
.
in spark indextostring
can use either a provided list of labels or column
metadata. unfortunately sparklyr
implementation limited in 2 ways:
- it can use metadata, not set on prediction column.
ft_string_indexer
discards trained model cannot used extract lables.
it possible missed something, looks you'll have map predictions manually, example joining
transformed data:
pred %>% select(prediction=resp_cat, resp_prediction=resp) %>% distinct() %>% right_join(pred)
joining, = "prediction" # source: lazy query [?? x 9] # database: spark_connection prediction resp_prediction id numb resp resp_cat id777a79821e1e <dbl> <chr> <int> <int> <chr> <dbl> <dbl> 1 7 171 1 3 171 7 0 2 0 153 2 10 153 0 1 3 3 132 3 8 132 3 2 4 5 122 4 7 122 5 3 5 6 198 5 4 198 6 4 6 2 164 6 9 164 2 5 7 4 137 7 6 137 4 6 8 1 184 8 5 184 1 7 9 0 153 9 1 153 0 8 10 1 184 10 2 184 1 9 # ... more rows, , 2 more variables: rawprediction <list>, # probability <list>
explanation:
pred %>% select(prediction=resp_cat, resp_prediction=resp) %>% distinct()
creates mapping prediction (encoded label) original label. rename resp_cat
prediction
can serve join key, , resp
resp_prediction
avoid conflict actual resp
.
finally apply right equijoin:
... %>% right_join(pred)
note:
you should specify type of tree:
ml_decision_tree( response = "resp_cat", features = "numb",type = "classification")
Comments
Post a Comment