pyspark - spark-ml normalizer loses metadata -
i'm using dataset categorical features in pyspark indexed , one-hot encoded. after fitting pipeline extract encoded features using metadata of features column. when include normalizer in pipeline lose metadata of categorical features. see example below:
train.show() +-----+---+----+----+ |admit|gre| gpa|rank| +-----+---+----+----+ | 0.0|380|3.61| 3| | 1.0|660|3.67| 3| | 1.0|800| 4.0| 1| | 1.0|640|3.19| 4| | 0.0|520|2.93| 4| +-----+---+----+----+ pyspark.ml.feature import stringindexer, onehotencoder, vectorassembler, normalizer #indexer categorical features rank_indexer = stringindexer(inputcol = 'rank', outputcol = 'rank_ind', handleinvalid="skip") #encoder categorical features rank_encoder = onehotencoder(inputcol = 'rank_ind', outputcol = 'rank_enc') # assembler assembler = vectorassembler(inputcols=['gre','gpa','rank_enc'], outputcol="featuresvect") # create normalizer normalizer = normalizer(inputcol="featuresvect", outputcol="features", p=1.0) stages = [rank_indexer] + [rank_encoder] + [assembler] + [normalizer] pyspark.ml import pipeline final_pipeline = pipeline( stages = stages ) pipelinemodel = final_pipeline.fit(train) data = pipelinemodel.transform(train) data.schema['features'].metadata {} ## empty dictionary ## excluding normalizer results in metadata: {u'ml_attr': {u'attrs': {u'binary': [{u'idx': 2, u'name': u'rank_enc_2'}, {u'idx': 3, u'name': u'rank_enc_3'}, {u'idx': 4, u'name': u'rank_enc_4'}], u'numeric': [{u'idx': 0, u'name': u'gre'}, {u'idx': 1, u'name': u'gpa'}]}, u'num_attrs': 5}}
is normal behavior? how can include normalizer without losing metadata?
in opinion doesn't make sense use normalizer
on one-hot encoded data in first place. in spark, ohe useful 2 type models:
- multinomial naive bayes.
- linear models.
in first case normalization render features useless (multinomial model can utilize binary features). in second case make interpretation of model close impossible.
even if ignore above normalized data cannot interpreted binary features anymore, therefore discarding metadata seems valid behavior.
related why standardscaler not attach metadata output column?
Comments
Post a Comment