syntax - Python and nGrams -
aster user here trying move on python basic text analytics. trying replicate output of aster ngram in python using nltk or other module. need able ngrams of 1 thru 4. output csv.
data:
unique_id, text_narrative
output needed:
unique_id, ngram(token), ngram(frequency)
example output:
- 023345 "i" 1
- 023345 "love" 1
- 023345 "python" 1
i wrote simple version python
's standard library, educational reasons.
production code should use spacy
, pandas
import collections operator import itemgetter @ open("input.csv",'r') f: data = [l.split(',', 2) l in f.readlines()] spaced = lambda t: (t[0][0],' '.join(map(at(1), t))) if t[0][0]==t[1][0] else [] unigrams = [(i,w) i, d in data w in d.split()] bigrams = filter(any, map(spaced, zip(unigrams, unigrams[1:] ))) trigrams = filter(any, map(spaced, zip(unigrams, unigrams[1:], unigrams[2:]))) open("output.csv", 'w') f: ngram in [unigrams, bigrams, trigrams]: counts = collections.counter(ngram) t,count in counts.items(): f.write("{i},{w},{c}\n".format(c=count, i=t[0], w=t[1]))
Comments
Post a Comment