python - PDF parsing: using pdfminer and pandas -

February 15, 2010

i trying parse pdf file csv format. in pdf, there table without frame, method suggested here not work. idea use pdfminer analyze layout of pdf, locate textlines, , match bbox location of each textlines reconstruct table.

so far have sorted text lines "left" , "right" column comparing x0 coordinates of each textline objects, , going matching left , right lines based on y0 coordinates. when trying put content of each lines pandas dataframe, got typeerrorl cannot concatenate non-ndframe object. please help.

my code follow:

testfile = 'file location' page_layouts = extract_layout_by_page(testfile) l_lines = [] r_lines = [] elem in page_layouts[0]:     if isinstance(elem, pdfminer.layout.lttextboxhorizontal):        l in elem:            (x0,y0,x1,y1) = l.bbox            if x0 <= 65.35 , x0 >=65.33:                l_lines.append(l)            elif x0 <= 280.1 , x0 >= 279.9:                r_lines.append(l)  csv = pd.dataframe() csv['l'] = 0 csv['r'] = 0  in r_lines:     x = i.get_text().encode('ascii','ignore')     csv['r'].append(x)

thank in advance.

Search This Blog

LP

python - PDF parsing: using pdfminer and pandas -

Comments

Post a Comment

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

nginx - phpPgAdmin - log in works but I have to login again after clicking on any links -

How to deploy a middleman blog inside a rails app? -