pycognaize.document.page.Page
- class Page(page_number, document_id, path, image_height=None, image_width=None)[source]
Bases:
object
Representing a page of a document in pycognaize
- Parameters:
page_number (
int
)document_id (
str
)path (
str
)image_height (
int
)image_width (
int
)
Methods
draw
Draw boxes where text was detected and return the modified
draw_ocr_text
draw_rectangle
Finds the words on the page which are included in the area
extract_words_in_tag_area
Return a text string from the ocr dictionary
Converts image of page in bytes
OCR of the page
Dict of words, paragraphs each containing their tag data
Data of the page
Detects the coordinates of the text in ocr of the page
Construct ExtractionTag object from word
Attributes
REGEX_NO_ALPHANUM_CHARS
Document id of the page
Numpy array of the page image
Image of page in bytes
Height of the page image
Width of the page image
Makes the words extraction tags in the lines of pages :return: list of lists of tags, where each list represents a line, and each tag in that list is a word on that line represented as an Extraction tag, with its coordinates in the document
Detects lines of the page
Formatted ocr of page
ocr_raw
Makes the words extraction tags in the ocr data of pages. :return: dict of lists of tags, where each list represents formatted OCR of a page, and each tag in that list is the OCR data represented as an Extraction tag, with its coordinates in the document.
Page number of page
Path of the source document
- property doc_id
Document id of the page
- draw_ocr_boxes(img=None)[source]
- Draw boxes where text was detected and return the modified
numpy array image
- Parameters:
img (
Optional
[ndarray
]) – Input image as numpy array. If not provided, use a copy of the instance image- Return type:
ndarray
- Returns:
numpy array of the image with word boxes
- extract_area_words(left, right, top, bottom, threshold=0.5, return_tags=False, line_by_line=False)[source]
- Finds the words on the page which are included in the area
resulted from given coordinates.
- Parameters:
threshold (
float
) – Threshold value as a fraction (value between 0 and 1), default value is 0.5left ([<class ‘int’>, <class ‘float’>]) – left coordinate
right ([<class ‘int’>, <class ‘float’>]) – right coordinate
top ([<class ‘int’>, <class ‘float’>]) – top coordinate
bottom ([<class ‘int’>, <class ‘float’>]) – bottom coordinate
return_tags (
bool
) – if True, returns tags of the words embedded in given arealine_by_line (
bool
) – if True, returns a list of lists, where each nested list is a line
- Return type:
Optional
[list
]- Returns:
list of words, each element in the list is dictionary representing the coordinates, ocr_text of word, and word_id_number
- get_ocr_formatted(stick_coords=False, return_tags=False)[source]
Dict of words, paragraphs each containing their tag data
- Parameters:
stick_coords (
bool
)return_tags (
bool
)
- Return type:
Union
[dict
,List
[ExtractionTag
]]
- property image_arr: ndarray
Numpy array of the page image
- property image_bytes: bytes
Image of page in bytes
- property image_height: int
Height of the page image
- property image_width: int
Width of the page image
- property line_tags: list
Makes the words extraction tags in the lines of pages :return: list of lists of tags,
where each list represents a line, and each tag in that list is a word on that line represented as an Extraction tag, with its coordinates in the document
- property lines: List[List[dict]]
Detects lines of the page
- Returns:
list of lists of dicts, where each list represents a line, and each dict in that list is a word on that line, with its coordinates, ocr_text and word_id_number
- property ocr: dict
Formatted ocr of page
- property ocr_tags: dict
Makes the words extraction tags in the ocr data of pages. :return: dict of lists of tags, where each list represents
formatted OCR of a page, and each tag in that list is the OCR data represented as an Extraction tag, with its coordinates in the document.
- property page_number
Page number of page
- property path
Path of the source document
- search_text(text, case_sensitive=False, sort=False, clean=True, area=None, cleanup_regex=re.compile('[^a-zA-Z\\\\d)\\\\[\\\\](-.,]'), return_tags=False)[source]
- Detects the coordinates of the text in ocr of the page
If the text is not found in the page return None
- Parameters:
text (
str
)case_sensitive – If True, the search will be case-sensitive
sort (
bool
) – If True, ocr_data will be ordered by word_id_number key before searchingclean (
bool
) – If true, disregard all non-alphanumeric character from the searcharea (
dict
) – If a dict with coordinates (pixels) is given only search for text in specified areacleanup_regex (re._pattern_type) – Optional. Provide the regex for cleanup to be used (has effect only if clean=True)
return_tags (
bool
) – if True, the words in found text are converted into tags.text
- Returns:
List of dictionaries with word coordinates (keys: left, right, top, bottom, matched_words. matched_words includes the original word coordinate
data for the matched words)
- Return type:
list