tensorflow - Principle of setting 'hash_bucket_size' parameter? -
question 1:
in wide_n_deep_tutorial.py
, there hyper-parameter named hash_bucket_size
both tf.feature_column.categorical_column_with_hash_bucket
, tf.feature_column.crossed_column
methods, , value hash_bucket_size=1000
.
but why 1000? how set parameter ?
question 2: second question crossed_columns
, is,
crossed_columns = [ tf.feature_column.crossed_column( ["education", "occupation"], hash_bucket_size=1000), tf.feature_column.crossed_column( [age_buckets, "education", "occupation"], hash_bucket_size=1000), tf.feature_column.crossed_column( ["native_country", "occupation"], hash_bucket_size=1000) ]
in wide_n_deep_tutorial.py
,
why choose ["education", "occupation"]
, [age_buckets, "education", "occupation"]
, ["native_country", "occupation"]
crossed_columns
, there rule of thumb ?
for hash_bucket
the general idea ideally result of hash functions should not result in collisions (otherwise you/the algorithm not able distinguish between 2 cases). hence 1000 in case 'just' value. if @ unique entries occupation , country (16 , 43) you'll see number high enough:
edb@lapelidb:/tmp$ cat adult.data | cut -d , -f 7 | sort | uniq -c | wc -l 16 edb@lapelidb:/tmp$ cat adult.data | cut -d , -f 14 | sort | uniq -c | wc -l 43
feature crossing
i think rule of thumb there crossing makes sense if combination of features has meaning. in example education , occupation linked. second 1 make sense define people 'junior engineer ph.d' vs 'senior cleaning staff without degree'. typical example see quite crossing of longitude , latitude since have more meaning individually.
Comments
Post a Comment