syntax - Python and nGrams -


aster user here trying move on python basic text analytics. trying replicate output of aster ngram in python using nltk or other module. need able ngrams of 1 thru 4. output csv.

data:

unique_id, text_narrative 

output needed:

unique_id, ngram(token), ngram(frequency) 

example output:

  • 023345 "i" 1
  • 023345 "love" 1
  • 023345 "python" 1

i wrote simple version python's standard library, educational reasons.

production code should use spacy , pandas

import collections operator import itemgetter @ open("input.csv",'r') f:     data = [l.split(',', 2) l in f.readlines()] spaced = lambda t: (t[0][0],' '.join(map(at(1), t))) if t[0][0]==t[1][0] else [] unigrams = [(i,w) i, d in data w in d.split()] bigrams = filter(any, map(spaced, zip(unigrams, unigrams[1:] ))) trigrams = filter(any, map(spaced, zip(unigrams, unigrams[1:], unigrams[2:]))) open("output.csv", 'w') f:     ngram in [unigrams, bigrams, trigrams]:         counts = collections.counter(ngram)         t,count in counts.items():             f.write("{i},{w},{c}\n".format(c=count, i=t[0], w=t[1])) 

Comments

Popular posts from this blog

python Tkinter Capturing keyboard events save as one single string -

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

javascript - Z-index in d3.js -