python - PDF parsing: using pdfminer and pandas -
i trying parse pdf file csv format. in pdf, there table without frame, method suggested here not work. idea use pdfminer analyze layout of pdf, locate textlines, , match bbox location of each textlines reconstruct table.
so far have sorted text lines "left" , "right" column comparing x0 coordinates of each textline objects, , going matching left , right lines based on y0 coordinates. when trying put content of each lines pandas dataframe, got typeerrorl cannot concatenate non-ndframe object. please help.
my code follow:
testfile = 'file location' page_layouts = extract_layout_by_page(testfile) l_lines = [] r_lines = [] elem in page_layouts[0]: if isinstance(elem, pdfminer.layout.lttextboxhorizontal): l in elem: (x0,y0,x1,y1) = l.bbox if x0 <= 65.35 , x0 >=65.33: l_lines.append(l) elif x0 <= 280.1 , x0 >= 279.9: r_lines.append(l) csv = pd.dataframe() csv['l'] = 0 csv['r'] = 0 in r_lines: x = i.get_text().encode('ascii','ignore') csv['r'].append(x)
thank in advance.
Comments
Post a Comment