apache spark - Adding the results of pyspark kmeans algorithm to dataframe? -


i have spark dataframe containing geo-information.

my_df.show(2)  ## +----+----+-----------+----------+ ## | x0 | x1 | longitude | latitude | ## +----+----+-----------+----------+ ## | ...| ...| 51.043    | 13.6847  |  ## | ...| ...| 42.6753   | 23.3218  | 

i took longitude , latitude out of dataframe , caluculated centerpoints kmeans library pyspark.

#trains k-means model k = 120 model = kmeans.train(dataset, k) print ("final centers: " + str(model.clustercenters)) 

the output

final centers: [array([ 51.04307692,  13.68474126]), array([-33.434     , -70.58366667]), array([ 42.67533333,  23.32185981]), array([ 45.876, -61.492]), array([ 53.07465714,   8.4655    ]), array([   4.594,  114.262]), array([ 48.15665306,  11.54269728]), array([ 51.51729851,   7.49838806]), array([ 48.76316125,   9.15357859]), .... 

anyone idea how add matching centers dataframe?

## +----+----+-----------+----------+-----------+----------+ ## | x0 | x1 | longitude | latitude | mean_long | mean_lat | ## +----+----+-----------+----------+-----------+----------+ ## | ...| ...| 51.043    | 13.6847  | 50.000    | 15.000   | ## | ...| ...| 42.6753   | 23.3218  | 50.000    | 15.000   | 

if decided use dataframes should use new pyspark.ml api, not legacy pyspark.mllib. provides number of clustering methods, including k-means, , predict method attach prediction column dataframe.

please check ml documentation details (api , required input types):


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -