pycognaize.document.page.Page

class Page(page_number, document_id, path, image_height=None, image_width=None)[source]

Bases: object

Representing a page of a document in pycognaize

Parameters:
  • page_number (int)

  • document_id (str)

  • path (str)

  • image_height (int)

  • image_width (int)

Methods

draw

draw_ocr_boxes

Draw boxes where text was detected and return the modified

draw_ocr_text

draw_rectangle

extract_area_words

Finds the words on the page which are included in the area

extract_words_in_tag_area

free_form_text

Return a text string from the ocr dictionary

get_image

Converts image of page in bytes

get_ocr

OCR of the page

get_ocr_formatted

Dict of words, paragraphs each containing their tag data

get_page_data

Data of the page

search_text

Detects the coordinates of the text in ocr of the page

word_to_extraction_tag

Construct ExtractionTag object from word

Attributes

REGEX_NO_ALPHANUM_CHARS

doc_id

Document id of the page

image_arr

Numpy array of the page image

image_bytes

Image of page in bytes

image_height

Height of the page image

image_width

Width of the page image

line_tags

Makes the words extraction tags in the lines of pages :return: list of lists of tags, where each list represents a line, and each tag in that list is a word on that line represented as an Extraction tag, with its coordinates in the document

lines

Detects lines of the page

ocr

Formatted ocr of page

ocr_raw

ocr_tags

Makes the words extraction tags in the ocr data of pages. :return: dict of lists of tags, where each list represents formatted OCR of a page, and each tag in that list is the OCR data represented as an Extraction tag, with its coordinates in the document.

page_number

Page number of page

path

Path of the source document

property doc_id

Document id of the page

draw_ocr_boxes(img=None)[source]
Draw boxes where text was detected and return the modified

numpy array image

Parameters:

img (Optional[ndarray]) – Input image as numpy array. If not provided, use a copy of the instance image

Return type:

ndarray

Returns:

numpy array of the image with word boxes

extract_area_words(left, right, top, bottom, threshold=0.5, return_tags=False, line_by_line=False)[source]
Finds the words on the page which are included in the area

resulted from given coordinates.

Parameters:
  • threshold (float) – Threshold value as a fraction (value between 0 and 1), default value is 0.5

  • left ([<class ‘int’>, <class ‘float’>]) – left coordinate

  • right ([<class ‘int’>, <class ‘float’>]) – right coordinate

  • top ([<class ‘int’>, <class ‘float’>]) – top coordinate

  • bottom ([<class ‘int’>, <class ‘float’>]) – bottom coordinate

  • return_tags (bool) – if True, returns tags of the words embedded in given area

  • line_by_line (bool) – if True, returns a list of lists, where each nested list is a line

Return type:

Optional[list]

Returns:

list of words, each element in the list is dictionary representing the coordinates, ocr_text of word, and word_id_number

free_form_text()[source]

Return a text string from the ocr dictionary

Return type:

str

get_image()[source]

Converts image of page in bytes

Return type:

bytes

get_ocr()[source]

OCR of the page

Return type:

Optional[dict]

get_ocr_formatted(stick_coords=False, return_tags=False)[source]

Dict of words, paragraphs each containing their tag data

Parameters:
  • stick_coords (bool)

  • return_tags (bool)

Return type:

Union[dict, List[ExtractionTag]]

get_page_data()[source]

Data of the page

Return type:

None

property image_arr: ndarray

Numpy array of the page image

property image_bytes: bytes

Image of page in bytes

property image_height: int

Height of the page image

property image_width: int

Width of the page image

property line_tags: list

Makes the words extraction tags in the lines of pages :return: list of lists of tags,

where each list represents a line, and each tag in that list is a word on that line represented as an Extraction tag, with its coordinates in the document

property lines: List[List[dict]]

Detects lines of the page

Returns:

list of lists of dicts, where each list represents a line, and each dict in that list is a word on that line, with its coordinates, ocr_text and word_id_number

property ocr: dict

Formatted ocr of page

property ocr_tags: dict

Makes the words extraction tags in the ocr data of pages. :return: dict of lists of tags, where each list represents

formatted OCR of a page, and each tag in that list is the OCR data represented as an Extraction tag, with its coordinates in the document.

property page_number

Page number of page

property path

Path of the source document

search_text(text, case_sensitive=False, sort=False, clean=True, area=None, cleanup_regex=re.compile('[^a-zA-Z\\\\d)\\\\[\\\\](-.,]'), return_tags=False)[source]
Detects the coordinates of the text in ocr of the page

If the text is not found in the page return None

Parameters:
  • text (str)

  • case_sensitive – If True, the search will be case-sensitive

  • sort (bool) – If True, ocr_data will be ordered by word_id_number key before searching

  • clean (bool) – If true, disregard all non-alphanumeric character from the search

  • area (dict) – If a dict with coordinates (pixels) is given only search for text in specified area

  • cleanup_regex (re._pattern_type) – Optional. Provide the regex for cleanup to be used (has effect only if clean=True)

  • return_tags (bool) – if True, the words in found text are converted into tags.

  • text

Returns:

List of dictionaries with word coordinates (keys: left, right, top, bottom, matched_words. matched_words includes the original word coordinate

data for the matched words)

Return type:

list

word_to_extraction_tag(word)[source]

Construct ExtractionTag object from word

Parameters:

word (dict)

Return type:

ExtractionTag