numpy - Python - Data structure of csr_matrix -
i studying tfidf. have used tfidf_vectorizer.fit_transform. return csr_matrix, can not understand structure of result.
- data input:
documents = ( "the sky blue", "the sun bright", "the sun in sky bright", "we can see shining sun, bright sun" )
- statement:
tfidf_vectorizer = tfidfvectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(documents) print(tfidf_matrix)
- the result:
(0, 9) 0.34399327143
(0, 7) 0.519713848879
(0, 4) 0.420753151645
(0, 0) 0.659191117868
(1, 9) 0.426858009784
(1, 4) 0.522108621994
(1, 8) 0.522108621994
(1, 1) 0.522108621994
(2, 9) 0.526261040111
(2, 7) 0.397544332095
(2, 4) 0.32184639876
(2, 8) 0.32184639876
(2, 1) 0.32184639876
(2, 3) 0.504234576856
(3, 9) 0.390963088213
(3, 8) 0.47820398015
(3, 1) 0.239101990075
(3, 10) 0.374599471224
(3, 2) 0.374599471224
(3, 5) 0.374599471224
(3, 6) 0.374599471224
tfidf_matrix csr_matrix. find on this, there no structure same result: scipy.sparse.csr_matrix
what structure of value (0, 9) 0.34399327143 ?
without vectorize can recreate matrix, more or less, sequence of operations:
in [703]: documents = ( "the sky blue", "the sun bright", "the sun in sky bright", "we can see shining sun bright sun" )
get list of lists of words (all lower case):
in [704]: alist = [l.lower().split() l in documents]
get sorted list of words (unique):
in [705]: aset = set() in [706]: [aset.update(l) l in alist] out[706]: [none, none, none, none] in [707]: unq = sorted(list(aset)) in [708]: unq out[708]: ['blue', 'bright', 'can', 'in', 'is', 'see', 'shining', 'sky', 'sun', 'the', 'we']
go through alist
, collect word counts. rows
sentence number, cols
unique word index
in [709]: rows, cols, data = [],[],[] in [710]: i,row in enumerate(alist): ...: c in row: ...: rows.append(i) ...: cols.append(unq.index(c)) ...: data.append(1) ...:
make sparse matrix data:
in [711]: m = sparse.csr_matrix((data,(rows,cols))) in [712]: m out[712]: <4x11 sparse matrix of type '<class 'numpy.int32'>' 21 stored elements in compressed sparse row format> in [713]: print(m) (0, 0) 1 (0, 4) 1 (0, 7) 1 (0, 9) 1 (1, 1) 1 .... (3, 9) 2 (3, 10) 1 in [714]: m.a # viewed 2d array out[714]: array([[1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], [0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0], [0, 1, 0, 1, 1, 0, 0, 1, 1, 2, 0], [0, 1, 1, 0, 0, 1, 1, 0, 2, 2, 1]], dtype=int32)
since using sklearn
, can reproduce matrix with:
in [717]: sklearn import feature_extraction in [718]: tf = feature_extraction.text.tfidfvectorizer() in [719]: tfm = tf.fit_transform(documents) in [720]: tfm out[720]: <4x11 sparse matrix of type '<class 'numpy.float64'>' 21 stored elements in compressed sparse row format> in [721]: print(tfm) (0, 9) 0.34399327143 (0, 7) 0.519713848879 (0, 4) 0.420753151645 .... (3, 5) 0.374599471224 (3, 6) 0.374599471224 in [722]: tfm.a out[722]: array([[ 0.65919112, 0. , 0. , 0. , 0.42075315, 0. , 0. , 0.51971385, 0. , 0.34399327, 0. ],.... [ 0. , 0.23910199, 0.37459947, 0. , 0. , 0.37459947, 0.37459947, 0. , 0.47820398, 0.39096309, 0.37459947]])
the actual data stored 3 attribute arrays:
in [723]: tfm.indices out[723]: array([ 9, 7, 4, 0, 9, 4, 8, 1, 9, 7, 4, 8, 1, 3, 9, 8, 1, 10, 2, 5, 6], dtype=int32) in [724]: tfm.data out[724]: array([ 0.34399327, 0.51971385, 0.42075315, 0.65919112, 0.42685801, ... 0.37459947]) in [725]: tfm.indptr out[725]: array([ 0, 4, 8, 14, 21], dtype=int32)
the indices
values individual rows tell words occur in sentence:
in [726]: np.array(unq)[m[0,].indices] out[726]: array(['blue', 'is', 'sky', 'the'], dtype='<u7') in [727]: np.array(unq)[m[3,].indices] out[727]: array(['bright', 'can', 'see', 'shining', 'sun', 'the', 'we'], dtype='<u7')
Comments
Post a Comment