scikit learn - Vectorizer the combination of words in Python -
i have dataset medical text data , apply tf-idf vectorizer on them , calculate tf idf score words this:
import pandas pd sklearn.feature_extraction.text import tfidfvectorizer tf vect = tf(min_df=60,stop_words='english') dtm = vect.fit_transform(df) l=vect.get_feature_names() x=pd.dataframe(dtm.toarray(), columns=vect.get_feature_names()) so question following-while i'm applying tfidfvectorizer splits text in distinct words example: "pain", "headache", "nausea" , on. how can words combination in output of tfidfvectorizer example: "severe pain", "cluster headache", "nausea vomiting". thanks
use ngram_range parameter:
vect = tf(min_df=60, stop_words='english', ngram_range=(1,2)) or (depending on goals):
vect = tf(min_df=60, stop_words='english', ngram_range=(2,2))
Comments
Post a Comment