sparklyr handling categorical variables

i came r background , used categorical variables being handled in backend (as factor). sparklyr quite confusing using string_indexer or onehotencoder.

for example, have number of variables has been encoded numerical variables in original dataset categorical. want use them categorical variables not sure doing correctly.

library(sparklyr) library(dplyr) sessioninfo() sc <- spark_connect(master = "local", version = spark_version) spark_version(sc) set.seed(1)     exampledf <- data.frame (id = 1:10, resp = sample(c(100:205), 10, replace = true),                       numb = sample(1:10, 10))  example <- copy_to(sc, exampledf)  pred <- example %>% mutate(resp = as.character(resp)) %>%                 sdf_mutate(resp_cat = ft_string_indexer(resp)) %>%                 ml_decision_tree(response = "resp_cat", features = "numb") %>%                 sdf_predict() pred 

the prediction model not categorical. see below. mean have convert prediction resp_cat , resp?

r version 3.4.0 (2017-04-21) platform: x86_64-redhat-linux-gnu (64-bit) running under: centos linux 7 (core)  spark_version(sc) [1] ‘’  source:   table<sparklyr_tmp_74e340c5607c> [?? x 6] database: spark_connection       id  numb  resp resp_cat id74e35c6b2dbb prediction      <int> <int> <chr>    <dbl>          <dbl>      <dbl>  1     1    10   150        8              0   8.000000  2     2     3   191        4              1   4.000000  3     3     4   146        9              2   9.000000  4     4     9   125        5              3   5.000000  5     5     8   107        2              4   2.000000  6     6     2   110        1              5   1.000000  7     7     5   133        3              6   5.333333  8     8     7   154        6              7   5.333333  9     9     1   170        0              8   0.000000 10    10     6   143        7              9   5.333333 

in general spark depends on column metadata when handling categorical data. in pipeline handled stringindexer (ft_string_indexer). ml predict labels, not original strings. use indextostring transformer provided ft_index_to_string.

in spark indextostring can use either a provided list of labels or column metadata. unfortunately sparklyr implementation limited in 2 ways:

  • it can use metadata, not set on prediction column.
  • ft_string_indexer discards trained model cannot used extract lables.

it possible missed something, looks you'll have map predictions manually, example joining transformed data:

pred %>%    select(prediction=resp_cat, resp_prediction=resp) %>%    distinct() %>%    right_join(pred) 
joining, = "prediction" # source:   lazy query [?? x 9] # database: spark_connection    prediction resp_prediction    id  numb  resp resp_cat id777a79821e1e         <dbl>           <chr> <int> <int> <chr>    <dbl>          <dbl>  1          7             171     1     3   171        7              0  2          0             153     2    10   153        0              1  3          3             132     3     8   132        3              2  4          5             122     4     7   122        5              3  5          6             198     5     4   198        6              4  6          2             164     6     9   164        2              5  7          4             137     7     6   137        4              6  8          1             184     8     5   184        1              7  9          0             153     9     1   153        0              8 10          1             184    10     2   184        1              9 # ... more rows, , 2 more variables: rawprediction <list>, #   probability <list> 


pred %>%    select(prediction=resp_cat, resp_prediction=resp) %>%    distinct()  

creates mapping prediction (encoded label) original label. rename resp_cat prediction can serve join key, , resp resp_prediction avoid conflict actual resp.

finally apply right equijoin:

... %>%  right_join(pred) 


you should specify type of tree:

ml_decision_tree(   response = "resp_cat", features = "numb",type = "classification") 


