pyspark - spark-ml normalizer loses metadata -


i'm using dataset categorical features in pyspark indexed , one-hot encoded. after fitting pipeline extract encoded features using metadata of features column. when include normalizer in pipeline lose metadata of categorical features. see example below:

train.show() +-----+---+----+----+ |admit|gre| gpa|rank| +-----+---+----+----+ |  0.0|380|3.61|   3| |  1.0|660|3.67|   3| |  1.0|800| 4.0|   1| |  1.0|640|3.19|   4| |  0.0|520|2.93|   4| +-----+---+----+----+  pyspark.ml.feature import stringindexer, onehotencoder, vectorassembler, normalizer #indexer categorical features rank_indexer = stringindexer(inputcol = 'rank', outputcol = 'rank_ind', handleinvalid="skip") #encoder categorical features rank_encoder = onehotencoder(inputcol = 'rank_ind', outputcol = 'rank_enc') # assembler assembler = vectorassembler(inputcols=['gre','gpa','rank_enc'], outputcol="featuresvect") # create normalizer normalizer = normalizer(inputcol="featuresvect", outputcol="features", p=1.0)  stages = [rank_indexer] + [rank_encoder] + [assembler] + [normalizer]  pyspark.ml import pipeline final_pipeline = pipeline(     stages = stages )  pipelinemodel = final_pipeline.fit(train) data = pipelinemodel.transform(train)  data.schema['features'].metadata {} ## empty dictionary  ## excluding normalizer results in metadata: {u'ml_attr': {u'attrs': {u'binary': [{u'idx': 2, u'name': u'rank_enc_2'},     {u'idx': 3, u'name': u'rank_enc_3'},     {u'idx': 4, u'name': u'rank_enc_4'}],    u'numeric': [{u'idx': 0, u'name': u'gre'}, {u'idx': 1, u'name': u'gpa'}]},   u'num_attrs': 5}} 

is normal behavior? how can include normalizer without losing metadata?

in opinion doesn't make sense use normalizer on one-hot encoded data in first place. in spark, ohe useful 2 type models:

  • multinomial naive bayes.
  • linear models.

in first case normalization render features useless (multinomial model can utilize binary features). in second case make interpretation of model close impossible.

even if ignore above normalized data cannot interpreted binary features anymore, therefore discarding metadata seems valid behavior.

related why standardscaler not attach metadata output column?


Comments

Popular posts from this blog

python Tkinter Capturing keyboard events save as one single string -

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

javascript - VueJS2 and the Window Object - how to use? -