apache spark - Adding the results of pyspark kmeans algorithm to dataframe? -

August 15, 2014

i have spark dataframe containing geo-information.

my_df.show(2)  ## +----+----+-----------+----------+ ## | x0 | x1 | longitude | latitude | ## +----+----+-----------+----------+ ## | ...| ...| 51.043    | 13.6847  |  ## | ...| ...| 42.6753   | 23.3218  |

i took longitude , latitude out of dataframe , caluculated centerpoints kmeans library pyspark.

#trains k-means model k = 120 model = kmeans.train(dataset, k) print ("final centers: " + str(model.clustercenters))

the output

final centers: [array([ 51.04307692,  13.68474126]), array([-33.434     , -70.58366667]), array([ 42.67533333,  23.32185981]), array([ 45.876, -61.492]), array([ 53.07465714,   8.4655    ]), array([   4.594,  114.262]), array([ 48.15665306,  11.54269728]), array([ 51.51729851,   7.49838806]), array([ 48.76316125,   9.15357859]), ....

anyone idea how add matching centers dataframe?

## +----+----+-----------+----------+-----------+----------+ ## | x0 | x1 | longitude | latitude | mean_long | mean_lat | ## +----+----+-----------+----------+-----------+----------+ ## | ...| ...| 51.043    | 13.6847  | 50.000    | 15.000   | ## | ...| ...| 42.6753   | 23.3218  | 50.000    | 15.000   |

if decided use dataframes should use new pyspark.ml api, not legacy pyspark.mllib. provides number of clustering methods, including k-means, , predict method attach prediction column dataframe.

please check ml documentation details (api , required input types):

https://spark.apache.org/docs/latest/ml-clustering.html#k-means

Search This Blog

LP

apache spark - Adding the results of pyspark kmeans algorithm to dataframe? -

Comments

Post a Comment

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

nginx - phpPgAdmin - log in works but I have to login again after clicking on any links -

How to deploy a middleman blog inside a rails app? -