python - Gensim.models.phrases.Phrases does not work when input comes from NLTK stemmed tokens -


this code:

def build_trigram_model(corpus):     corpus = lemmer(nlp(corpus))     bigram = phrases.phrases(corpus, min_count=2, threshold=40)     bigram_phraser = phrases.phraser(bigram)     trigram = phrases.phrases(bigram_phraser[corpus], min_count=2, threshold=50)     trigram_phraser = phrases.phraser(trigram)     return bigram_phraser, trigram_phraser  def punct_space(token):     """     helper function eliminate punctuation, spaces , numbers.     """     return token.is_punct or token.is_space or token.like_num  def lemmer(tokens):     word_space = []     stemmer = snowballstemmer("english")     token in tokens:         if not punct_space(token):             word_space.append(stemmer.stem(str(token).lower()))     return word_space   bigram_phraser, trigram_phraser = build_trigram_model(corpus) 

this returns like:

bigram_phraser.phrasegrams  {(b'\xef\xa3\xaf', b'\xef\xa3\xb0'): (189, 49.569954185342624),  (b'\xef\xa3\xba', b'\xef\xa3\xbb'): (189, 49.162979913552),  (b'\xce\xbf', b'\xcf\x82'): (11, 52.7203947368421),  (b'\xef\xa3\xb1', b'\xef\xa3\xb2'): (13, 78.54622410336378),  (b'\xef\xa3\xb2', b'\xef\xa3\xb3'): (12, 74.75279850746269),  (b'\xef\xa3\xbc', b'\xef\xa3\xbd'): (8, 73.94232987312573),  (b'\xef\xa3\xbd', b'\xef\xa3\xbe'): (7, 65.46977124183007),  (b'\xc9\xa1', b'\xcc\x8a'): (29, 64.73618071658315),  (b'\xef\x9d\xb4', b'\xef\x9d\xb2'): (51, 105.61094674556212),  (b'\xef\x9d\xa3', b'\xef\x9d\xac'): (26, 53.88736340711684)} 

when running new text through model, no collocations found.

however, when use following stemmer:

def lemmer(tokens):     """lemmatize words"""     word_space = []     sent in tokens.sents:         sentence = []         token in sent:             if not punct_space(token):                 if token.lemma_=='-pron-':                     sentence.append(token.lower_)                 else:                     sentence.append(token.lemma_)         word_space.append(sentence)     return word_space 

everything works supposed to. dtype returned both stemmers list of strings, cannot problem. ideas why happens? thanks!

  • platform: linux-4.4.0-1030-aws-x86_64-with-debian-stretch-sid
  • python version: 3.6.1


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -