java - How does Spark's Word2Vec function with regard to the RDD's rows when creating a model? -


after experimenting spark's word2vec code , data text8 provided here word2vec example ended getting java heap errors discussed in spark word2vec example using text8 file.

after reading thread came conclusion need break data chunks divided new lines loaded different rdd-rows , not causing java heap errors. njustice said in thread: word2vec works best 1 sentence per row of rdd

is so?

according knowledge word2vec skip-gram model (used spark) uses window create training samples, for-example text: the black car window size of 2 , the input word generate 2 training samples: (the, black), (the, car).

but in case data divided rdd-rows, how window behave? able create training samples combining words 2 rows? or run window on each row individually , combine results @ end somehow? doubt works on more 1 rdd-row because created 2 models using same data distributed words differently on rdd-rows , cosine similarity of 2 words different in each model.


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -