python 3.x - Adding documents to gensim model -
i have class wrapping various objects required calculating lsi similarity:
class similarityfiles: def __init__(self, file_name, tokenized_corpus, stoplist=none): if stoplist none: self.filtered_corpus = tokenized_corpus else: self.filtered_corpus = [] convo in tokenized_corpus: self.filtered_corpus.append([token token in convo if token not in stoplist]) self.dictionary = corpora.dictionary(self.filtered_corpus) self.corpus = [self.dictionary.doc2bow(text) text in self.filtered_corpus] self.lsi = models.lsimodel(self.corpus, id2word=self.dictionary, num_topics=100) self.index = similarities.matrixsimilarity(self.lsi[self.corpus]) i want add function class allow adding documents corpus , updating model accordingly. i've found dictionary.add_documents, , model.add_documents, there 2 things aren't clear me:
- when create lsi model, 1 of parameters function receives
id2word=dictionary. when updating model, how tell use updated dictionary? unnecessary, or make difference? - how update index? looks documentation if use
similarityclass, , notmatrixsimilarityclass, can add documents index, don't see such functionalitymatrixsimilarity. if understood correctly,matrixsimilaritybetter if input corpus contains dense vectors (which does, because i'm using lsi model). have changesimilaritycan update index? or, conversely, what's complexity of creating index? if it's insignificant, should create new index updated corpus, follows:
code:
self.dictionary.add_documents(new_docs) # new_docs after filtering stop words new_corpus = [self.dictionary.doc2bow(text) text in new_docs] self.lsi.add_documents(new_corpus) self.index = similarities.matrixsimilarity(self.lsi[self.corpus]) thanks. :)
Comments
Post a Comment