tensorflow - Principle of setting 'hash_bucket_size' parameter? -


question 1:

in wide_n_deep_tutorial.py, there hyper-parameter named hash_bucket_size both tf.feature_column.categorical_column_with_hash_bucket , tf.feature_column.crossed_column methods, , value hash_bucket_size=1000.

but why 1000? how set parameter ?

question 2: second question crossed_columns, is,

crossed_columns = [ tf.feature_column.crossed_column( ["education", "occupation"], hash_bucket_size=1000), tf.feature_column.crossed_column( [age_buckets, "education", "occupation"], hash_bucket_size=1000), tf.feature_column.crossed_column( ["native_country", "occupation"], hash_bucket_size=1000) ]

in wide_n_deep_tutorial.py,

why choose ["education", "occupation"], [age_buckets, "education", "occupation"] , ["native_country", "occupation"] crossed_columns, there rule of thumb ?

for hash_bucket

the general idea ideally result of hash functions should not result in collisions (otherwise you/the algorithm not able distinguish between 2 cases). hence 1000 in case 'just' value. if @ unique entries occupation , country (16 , 43) you'll see number high enough:

edb@lapelidb:/tmp$ cat  adult.data | cut -d , -f 7 | sort  | uniq -c  | wc -l 16 edb@lapelidb:/tmp$ cat  adult.data | cut -d , -f 14 | sort  | uniq -c  | wc -l 43 

feature crossing

i think rule of thumb there crossing makes sense if combination of features has meaning. in example education , occupation linked. second 1 make sense define people 'junior engineer ph.d' vs 'senior cleaning staff without degree'. typical example see quite crossing of longitude , latitude since have more meaning individually.


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -