javascript - Text clustering: chosing the k in k means -
after eliminating stop words , applied stemming process in set of documents, applied bisecting k-means in javascript in order cluster set of documents received web pages finding similarity between them.
what should method finding how many cluster should created when having text-based clusters? saw methods such elbow, silhouette, or information criterion approaches, assuming don't have information of clusters create, other methods seem better fit numeric clustering, not on text-based clusters.
can entropy measure in helping me find right number of clusters after applying bisecting k-means in text clustering? or f-measure? mean stop dividing cluster after value reached? large sets of data?
short answer:
you can use termfequency- inversedocument-frequency (tf-idf). emphasises rare words used in single document, , penalizes words when found in documents. if applied pca tfidf on dataset, might use "scree plot" (~ elbow method) find suitable number of clusters.
long example:
the following example of not using kmeans, , example uses few long documents, , has been decided there 2 "clusters" (using principal components , tf-idf, actually), uses real data in creative way:
in phd dissertation documenting "textmining" package tm
developed r
software, author of tm
, ingo feinerer, gives example (chapter 10) how stylometry, cluster/identify 5 books "wizard of oz" series. 1 of these books authorship disputed (there 2 authors in series, thompson , baum, contributions 1 of books unknown).
feinerer chops the documents 500-line chunks build termdocumentmatrix, performs variants of principal component analysis (pca), 1 tfidf, on matrix, , shows visual inspection of pca plots disputed book tends authored thompson. parts might have been written baum.
in plot indicated points inside pinkish wiggly oval (drawn me). green dots chunks book known authorship (t.) , yellow dots book of unknown / disputed authorship. (the points fall close in plot. that's evidence here; it's qualitative, 1 example of many in pdf)
the tf-idf pca plot on page 95 looks similar.
i have not given r code because don't know if r, , post getting long, , can read in pdf.
(and don't know implementations of tf-idf in javascript).
Comments
Post a Comment