apache spark - Adding the results of pyspark kmeans algorithm to dataframe? -
i have spark dataframe containing geo-information.
my_df.show(2)  ## +----+----+-----------+----------+ ## | x0 | x1 | longitude | latitude | ## +----+----+-----------+----------+ ## | ...| ...| 51.043    | 13.6847  |  ## | ...| ...| 42.6753   | 23.3218  | i took longitude , latitude out of dataframe , caluculated centerpoints kmeans library pyspark.
#trains k-means model k = 120 model = kmeans.train(dataset, k) print ("final centers: " + str(model.clustercenters)) the output
final centers: [array([ 51.04307692,  13.68474126]), array([-33.434     , -70.58366667]), array([ 42.67533333,  23.32185981]), array([ 45.876, -61.492]), array([ 53.07465714,   8.4655    ]), array([   4.594,  114.262]), array([ 48.15665306,  11.54269728]), array([ 51.51729851,   7.49838806]), array([ 48.76316125,   9.15357859]), .... anyone idea how add matching centers dataframe?
## +----+----+-----------+----------+-----------+----------+ ## | x0 | x1 | longitude | latitude | mean_long | mean_lat | ## +----+----+-----------+----------+-----------+----------+ ## | ...| ...| 51.043    | 13.6847  | 50.000    | 15.000   | ## | ...| ...| 42.6753   | 23.3218  | 50.000    | 15.000   | 
if decided use dataframes should use new pyspark.ml api, not legacy pyspark.mllib. provides number of clustering methods, including k-means, , predict method attach prediction column dataframe.
please check ml documentation details (api , required input types):
Comments
Post a Comment