numpy - Python - Data structure of csr_matrix -


i studying tfidf. have used tfidf_vectorizer.fit_transform. return csr_matrix, can not understand structure of result.

  • data input:

documents = ( "the sky blue", "the sun bright", "the sun in sky bright", "we can see shining sun, bright sun" )

  • statement:
tfidf_vectorizer = tfidfvectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(documents) print(tfidf_matrix) 
  • the result:

(0, 9) 0.34399327143
(0, 7) 0.519713848879
(0, 4) 0.420753151645
(0, 0) 0.659191117868
(1, 9) 0.426858009784
(1, 4) 0.522108621994
(1, 8) 0.522108621994
(1, 1) 0.522108621994
(2, 9) 0.526261040111
(2, 7) 0.397544332095
(2, 4) 0.32184639876
(2, 8) 0.32184639876
(2, 1) 0.32184639876
(2, 3) 0.504234576856
(3, 9) 0.390963088213
(3, 8) 0.47820398015
(3, 1) 0.239101990075
(3, 10) 0.374599471224
(3, 2) 0.374599471224
(3, 5) 0.374599471224
(3, 6) 0.374599471224

tfidf_matrix csr_matrix. find on this, there no structure same result: scipy.sparse.csr_matrix

what structure of value (0, 9) 0.34399327143 ?

without vectorize can recreate matrix, more or less, sequence of operations:

in [703]: documents = ( "the sky blue", "the sun bright", "the sun in sky bright", "we can see shining sun bright sun" ) 

get list of lists of words (all lower case):

in [704]: alist = [l.lower().split() l in documents] 

get sorted list of words (unique):

in [705]: aset = set() in [706]: [aset.update(l) l in alist] out[706]: [none, none, none, none] in [707]: unq = sorted(list(aset)) in [708]: unq out[708]:  ['blue',  'bright',  'can',  'in',  'is',  'see',  'shining',  'sky',  'sun',  'the',  'we'] 

go through alist , collect word counts. rows sentence number, cols unique word index

in [709]: rows, cols, data = [],[],[] in [710]: i,row in enumerate(alist):      ...:     c in row:      ...:         rows.append(i)      ...:         cols.append(unq.index(c))      ...:         data.append(1)      ...:          

make sparse matrix data:

in [711]: m = sparse.csr_matrix((data,(rows,cols))) in [712]: m out[712]:  <4x11 sparse matrix of type '<class 'numpy.int32'>'     21 stored elements in compressed sparse row format> in [713]: print(m)   (0, 0)    1   (0, 4)    1   (0, 7)    1   (0, 9)    1   (1, 1)    1   ....   (3, 9)    2   (3, 10)   1 in [714]: m.a        # viewed 2d array out[714]:  array([[1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0],        [0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0],        [0, 1, 0, 1, 1, 0, 0, 1, 1, 2, 0],        [0, 1, 1, 0, 0, 1, 1, 0, 2, 2, 1]], dtype=int32) 

since using sklearn, can reproduce matrix with:

in [717]: sklearn import feature_extraction in [718]: tf = feature_extraction.text.tfidfvectorizer() in [719]: tfm = tf.fit_transform(documents) in [720]: tfm out[720]:  <4x11 sparse matrix of type '<class 'numpy.float64'>'     21 stored elements in compressed sparse row format> in [721]: print(tfm)   (0, 9)    0.34399327143   (0, 7)    0.519713848879   (0, 4)    0.420753151645   ....   (3, 5)    0.374599471224   (3, 6)    0.374599471224 in [722]: tfm.a out[722]:  array([[ 0.65919112,  0.        ,  0.        ,  0.        ,  0.42075315,          0.        ,  0.        ,  0.51971385,  0.        ,  0.34399327,          0.        ],....        [ 0.        ,  0.23910199,  0.37459947,  0.        ,  0.        ,          0.37459947,  0.37459947,  0.        ,  0.47820398,  0.39096309,          0.37459947]]) 

the actual data stored 3 attribute arrays:

in [723]: tfm.indices out[723]:  array([ 9,  7,  4,  0,  9,  4,  8,  1,  9,  7,  4,  8,  1,  3,  9,  8,  1,        10,  2,  5,  6], dtype=int32) in [724]: tfm.data out[724]:  array([ 0.34399327,  0.51971385,  0.42075315,  0.65919112,  0.42685801,        ...         0.37459947]) in [725]: tfm.indptr out[725]: array([ 0,  4,  8, 14, 21], dtype=int32) 

the indices values individual rows tell words occur in sentence:

in [726]: np.array(unq)[m[0,].indices] out[726]:  array(['blue', 'is', 'sky', 'the'],       dtype='<u7') in [727]: np.array(unq)[m[3,].indices] out[727]:  array(['bright', 'can', 'see', 'shining', 'sun', 'the', 'we'],       dtype='<u7') 

Comments

Popular posts from this blog

python Tkinter Capturing keyboard events save as one single string -

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

javascript - Z-index in d3.js -