apache spark - Adding the results of pyspark kmeans algorithm to dataframe? -
i have spark dataframe containing geo-information.
my_df.show(2) ## +----+----+-----------+----------+ ## | x0 | x1 | longitude | latitude | ## +----+----+-----------+----------+ ## | ...| ...| 51.043 | 13.6847 | ## | ...| ...| 42.6753 | 23.3218 |
i took longitude , latitude out of dataframe , caluculated centerpoints kmeans library pyspark.
#trains k-means model k = 120 model = kmeans.train(dataset, k) print ("final centers: " + str(model.clustercenters))
the output
final centers: [array([ 51.04307692, 13.68474126]), array([-33.434 , -70.58366667]), array([ 42.67533333, 23.32185981]), array([ 45.876, -61.492]), array([ 53.07465714, 8.4655 ]), array([ 4.594, 114.262]), array([ 48.15665306, 11.54269728]), array([ 51.51729851, 7.49838806]), array([ 48.76316125, 9.15357859]), ....
anyone idea how add matching centers dataframe?
## +----+----+-----------+----------+-----------+----------+ ## | x0 | x1 | longitude | latitude | mean_long | mean_lat | ## +----+----+-----------+----------+-----------+----------+ ## | ...| ...| 51.043 | 13.6847 | 50.000 | 15.000 | ## | ...| ...| 42.6753 | 23.3218 | 50.000 | 15.000 |
if decided use dataframes
should use new pyspark.ml
api, not legacy pyspark.mllib
. provides number of clustering methods, including k-means, , predict method attach prediction column dataframe
.
please check ml documentation details (api , required input types):
Comments
Post a Comment