python - Gensim.models.phrases.Phrases does not work when input comes from NLTK stemmed tokens -
this code:
def build_trigram_model(corpus): corpus = lemmer(nlp(corpus)) bigram = phrases.phrases(corpus, min_count=2, threshold=40) bigram_phraser = phrases.phraser(bigram) trigram = phrases.phrases(bigram_phraser[corpus], min_count=2, threshold=50) trigram_phraser = phrases.phraser(trigram) return bigram_phraser, trigram_phraser def punct_space(token): """ helper function eliminate punctuation, spaces , numbers. """ return token.is_punct or token.is_space or token.like_num def lemmer(tokens): word_space = [] stemmer = snowballstemmer("english") token in tokens: if not punct_space(token): word_space.append(stemmer.stem(str(token).lower())) return word_space bigram_phraser, trigram_phraser = build_trigram_model(corpus)
this returns like:
bigram_phraser.phrasegrams {(b'\xef\xa3\xaf', b'\xef\xa3\xb0'): (189, 49.569954185342624), (b'\xef\xa3\xba', b'\xef\xa3\xbb'): (189, 49.162979913552), (b'\xce\xbf', b'\xcf\x82'): (11, 52.7203947368421), (b'\xef\xa3\xb1', b'\xef\xa3\xb2'): (13, 78.54622410336378), (b'\xef\xa3\xb2', b'\xef\xa3\xb3'): (12, 74.75279850746269), (b'\xef\xa3\xbc', b'\xef\xa3\xbd'): (8, 73.94232987312573), (b'\xef\xa3\xbd', b'\xef\xa3\xbe'): (7, 65.46977124183007), (b'\xc9\xa1', b'\xcc\x8a'): (29, 64.73618071658315), (b'\xef\x9d\xb4', b'\xef\x9d\xb2'): (51, 105.61094674556212), (b'\xef\x9d\xa3', b'\xef\x9d\xac'): (26, 53.88736340711684)}
when running new text through model, no collocations found.
however, when use following stemmer:
def lemmer(tokens): """lemmatize words""" word_space = [] sent in tokens.sents: sentence = [] token in sent: if not punct_space(token): if token.lemma_=='-pron-': sentence.append(token.lower_) else: sentence.append(token.lemma_) word_space.append(sentence) return word_space
everything works supposed to. dtype returned both stemmers list of strings, cannot problem. ideas why happens? thanks!
- platform: linux-4.4.0-1030-aws-x86_64-with-debian-stretch-sid
- python version: 3.6.1
Comments
Post a Comment