matlab - Python, text detection OCR -
i trying extract data scanned form. form has standard format similar 1 shown in image below:
i have tried using pytesseract (tesseract ocr) detect image's text , has done decent job @ finding text , converting image text. gives me detected text without keeping format of data.
i able below:
find particular piece of text , find associated data below or beside it. similar question using opencv detect text region in image using opencv
is there way can following:
- either find text boxes on form, perform ocr on each box , see 1 closest match "witnesess:" text, find sections below , perform separate ocr on those.
- or if form standard , know approximate location of "witness" text section can specify general location in opencv , extract below text , perform ocr on it.
edit: have tried below code try detect specific regions of text. not identifying text regions.
import cv2 img = cv2.imread('t2.jpg') mser = cv2.mser_create() img = cv2.resize(img, (img.shape[1]*2, img.shape[0]*2)) gray = cv2.cvtcolor(img, cv2.color_bgr2gray) vis = img.copy() regions = mser.detectregions(gray) hulls = [cv2.convexhull(p.reshape(-1, 1, 2)) p in regions[0]] cv2.polylines(vis, hulls, 1, (0,255,0)) cv2.imshow('img', vis)
here result:
i think have answer in own post. did similar , how did it:
//id_image loaded cv2.imread temp_image = id_image[start_y:end_y,start_x:end_x] img = image.fromarray(temp_image) text = pytesseract.image_to_string(img, config="-psm 7")
so basically, if format predefined, need know location of fields want text of (which know), crop it, , apply ocr (tesseract) extraction.
in case need import pytesseract, pil, cv2, numpy
.
Comments
Post a Comment