pycognaize.document.page.Page

class Page(page_number, document_id, path, image_height=None, image_width=None)[source]

Bases: object

Representing a page of a document in pycognaize

Parameters:

page_number (int)
document_id (str)
path (str)
image_height (int)
image_width (int)

Methods

`draw`
`draw_ocr_boxes`	Draw boxes where text was detected and return the modified
`draw_ocr_text`
`draw_rectangle`
`extract_area_words`	Finds the words on the page which are included in the area
`extract_words_in_tag_area`
`free_form_text`	Return a text string from the ocr dictionary
`get_image`	Converts image of page in bytes
`get_ocr`	OCR of the page
`get_ocr_formatted`	Dict of words, paragraphs each containing their tag data
`get_page_data`	Data of the page
`search_text`	Detects the coordinates of the text in ocr of the page
`word_to_extraction_tag`	Construct ExtractionTag object from word

Attributes

`REGEX_NO_ALPHANUM_CHARS`
`doc_id`	Document id of the page
`image_arr`	Numpy array of the page image
`image_bytes`	Image of page in bytes
`image_height`	Height of the page image
`image_width`	Width of the page image
`line_tags`	Makes the words extraction tags in the lines of pages :return: list of lists of tags, where each list represents a line, and each tag in that list is a word on that line represented as an Extraction tag, with its coordinates in the document
`lines`	Detects lines of the page
`ocr`	Formatted ocr of page
`ocr_raw`
`ocr_tags`	Makes the words extraction tags in the ocr data of pages. :return: dict of lists of tags, where each list represents formatted OCR of a page, and each tag in that list is the OCR data represented as an Extraction tag, with its coordinates in the document.
`page_number`	Page number of page
`path`	Path of the source document

property doc_id: Document id of the page

draw_ocr_boxes(img=None)[source]

Draw boxes where text was detected and return the modified: numpy array image

Parameters:: img (Optional[ndarray]) – Input image as numpy array. If not provided, use a copy of the instance image
Return type:: ndarray
Returns:: numpy array of the image with word boxes

extract_area_words(left, right, top, bottom, threshold=0.5, return_tags=False, line_by_line=False)[source]

Finds the words on the page which are included in the area: resulted from given coordinates.

Parameters:

threshold (float) – Threshold value as a fraction (value between 0 and 1), default value is 0.5
left ([<class ‘int’>, <class ‘float’>]) – left coordinate
right ([<class ‘int’>, <class ‘float’>]) – right coordinate
top ([<class ‘int’>, <class ‘float’>]) – top coordinate
bottom ([<class ‘int’>, <class ‘float’>]) – bottom coordinate
return_tags (bool) – if True, returns tags of the words embedded in given area
line_by_line (bool) – if True, returns a list of lists, where each nested list is a line

Return type:

Optional[list]

Returns:

list of words, each element in the list is dictionary representing the coordinates, ocr_text of word, and word_id_number

free_form_text()[source]

Return a text string from the ocr dictionary

Return type:: str

get_image()[source]

Converts image of page in bytes

Return type:: bytes

get_ocr()[source]

OCR of the page

Return type:: Optional[dict]

get_ocr_formatted(stick_coords=False, return_tags=False)[source]

Dict of words, paragraphs each containing their tag data

Parameters:

stick_coords (bool)
return_tags (bool)

Return type:

Union[dict, List[ExtractionTag]]

get_page_data()[source]

Data of the page

Return type:: None

property image_arr: ndarray: Numpy array of the page image

property image_bytes: bytes: Image of page in bytes

property image_height: int: Height of the page image

property image_width: int: Width of the page image

property line_tags: list: Makes the words extraction tags in the lines of pages :return: list of lists of tags,

where each list represents a line, and each tag in that list is a word on that line represented as an Extraction tag, with its coordinates in the document

property lines: List[List[dict]]

Detects lines of the page

Returns:: list of lists of dicts, where each list represents a line, and each dict in that list is a word on that line, with its coordinates, ocr_text and word_id_number

property ocr: dict: Formatted ocr of page

property ocr_tags: dict: Makes the words extraction tags in the ocr data of pages. :return: dict of lists of tags, where each list represents

formatted OCR of a page, and each tag in that list is the OCR data represented as an Extraction tag, with its coordinates in the document.

property page_number: Page number of page

property path: Path of the source document

search_text(text, case_sensitive=False, sort=False, clean=True, area=None, cleanup_regex=re.compile('[^a-zA-Z\\\\d)\\\\[\\\\](-.,]'), return_tags=False)[source]

Detects the coordinates of the text in ocr of the page: If the text is not found in the page return None

Parameters:

text (str)
case_sensitive – If True, the search will be case-sensitive
sort (bool) – If True, ocr_data will be ordered by word_id_number key before searching
clean (bool) – If true, disregard all non-alphanumeric character from the search
area (dict) – If a dict with coordinates (pixels) is given only search for text in specified area
cleanup_regex (re._pattern_type) – Optional. Provide the regex for cleanup to be used (has effect only if clean=True)
return_tags (bool) – if True, the words in found text are converted into tags.
text

Returns:

List of dictionaries with word coordinates (keys: left, right, top, bottom, matched_words. matched_words includes the original word coordinate

data for the matched words)

Return type:

list

word_to_extraction_tag(word)[source]

Construct ExtractionTag object from word

Parameters:: word (dict)
Return type:: ExtractionTag