scikit learn - Vectorizer the combination of words in Python -


i have dataset medical text data , apply tf-idf vectorizer on them , calculate tf idf score words this:

import pandas pd sklearn.feature_extraction.text import tfidfvectorizer tf  vect = tf(min_df=60,stop_words='english')  dtm = vect.fit_transform(df)  l=vect.get_feature_names()   x=pd.dataframe(dtm.toarray(), columns=vect.get_feature_names()) 

so question following-while i'm applying tfidfvectorizer splits text in distinct words example: "pain", "headache", "nausea" , on. how can words combination in output of tfidfvectorizer example: "severe pain", "cluster headache", "nausea vomiting". thanks

use ngram_range parameter:

vect = tf(min_df=60, stop_words='english', ngram_range=(1,2)) 

or (depending on goals):

vect = tf(min_df=60, stop_words='english', ngram_range=(2,2)) 

Comments

Popular posts from this blog

PHP and MySQL WP -

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

go - golang pprof for c library code -