Applying Machine Learning on Array String Spark Java -
i have parquet data file contains multiple array string elements. these array strings elements variable in length. apache spark data frame schema shown below:
root |-- dl: array (nullable = true) | |-- element: string (containsnull = true) |-- browsing_history: array (nullable = true) | |-- element: string (containsnull = true) |-- bhistid: string (nullable = true) |-- browser: string (nullable = true) |-- device: string (nullable = true) |-- geoloc: string (nullable = true) |-- os: string (nullable = true) |-- platform: string (nullable = true) |-- vendor: string (nullable = true) |-- dlist: array (nullable = true) | |-- element: string (containsnull = true) |-- rl: array (nullable = true) | |-- element: string (containsnull = true) |-- cliid: string (nullable = true) |-- date: string (nullable = true) |-- p_id: string (nullable = true) |-- userid: string (nullable = true)
the parameters dl, browsing_history variable in length, depending on browse. example : browsing_history variable contains :
|[ua:freeuser, login:google, gender:male, ua:preflang:french, ua:preflang:english, ua:preflang: spanish, int:rtlang:english, 2316:76, age : 46, ua: play : hip hop music] | |[crm:functionalarea:itsoftware-business/systemsanalysis, crm:languagesknown:english|crm:proficiency:intermediate|read:y|write:y|speak:, crm:languagesknown:hindi|crm:proficiency:intermediate|read:y|write:y|speak:y, crm:languagesknown:portugese|crm:proficiency:expert|read:y|write:y|speak:y, crm:education-degree-ug:b.com.(commerce), gender:male, crm:noticeperiod:1days, crm:employmentstatus:fulltimeemployed, age:37, crm:annualincome:3, crm:experience:9, login:email, 2320:14, 1320:34]
i explode data browsing_history in variables as:gender, age, login etc. , left out variable array string containing browsing history (ua, crm, play etc) varies in length each id. how can apply machine learning algorithm on these kind of array strings as input in spark java? word2vec not seem work along ml algorithms such lr, svm etc.
Comments
Post a Comment