pycognaize.document.document.Document

class Document(input_fields, output_fields, pages, classification_labels, html_info, metadata, data_path=None)[source]

Bases: object

Definition of input and output for a single document, depending on a given model

Parameters:

input_fields (FieldCollection[str, List[Field]])
output_fields (FieldCollection[str, List[Field]])
pages (Dict[int, Page])
classification_labels (Dict[str, ClassificationLabels])
html_info (HTML)
metadata (Dict[str, Any])
data_path (Optional[str])

Methods

`fetch_document`	Get the document object, given a document id and recipe id
`from_dict`	Document object created from data of dict :type raw: `dict` :param raw: document dictionary :type data_path: `str` :param data_path: path to the documents OCR and page images
`get_df_with_tied_field_values`	Return the dataframe of the TableTag, where each cell
`get_first_tied_field`	Return the first field that is in the same location as the given tag
`get_first_tied_field_value`	Return the value of the first field that is in the
`get_first_tied_tag`	Return the first tag that is in the same location as the given tag
`get_first_tied_tag_value`	Return the value of the first tag that is in the same
`get_layout_fields`
`get_layout_text`	Sample usage:
`get_matching_table_cells_for_tag`	Create a list which includes the original extraction tag,
`get_table_cell_overlap`	Create a list which includes the original extraction tag,
`get_tied_fields`	Given an ExtractionTag, return all the fields that contain
`get_tied_tags`	Given a single tag, return all other tags in the document that are
`load_ocr`
`load_page_images`	Get all images of the pages in the document (Using multiprocessing)
`load_page_ocr`	Get all OCR of the pages in the document (Using multiprocessing)
`to_dict`	Converts Document object to dict
`to_pdf`	Adds tags of input_fields and output_fields to the bytes object representing the pdf file of the document.

Attributes

`data_path`	Returns the path to the document data
`document_src`	Returns the source of the document
`html`	Returns HTML object
`id`	Returns the pycognaize id of the document
`is_xbrl`	Returns True if document is XBRL, otherwise False
`metadata`	Returns document metadata
`pages`	Returns a dictionary, where each key is the page number and values are Page objects
`x`	Returns a dictionary, where keys are input field names and values are list of Field objects
`y`	Returns a dictionary, where keys are output field names and values are list of Field objects

property data_path: str | None: Returns the path to the document data

property document_src: Returns the source of the document

classmethod fetch_document(recipe_id, doc_id, api_host=None, x_auth=None)[source]

Get the document object, given a document id and recipe id

Parameters:

recipe_id – ID of the document AI (the second ID in the url)
doc_id – ID of the document (the ID in the annotation view URL of the document)
api_host (Optional[str]) –
https://<ENVIRONMENT NAME>-api.cognaize.com. If not provided will default to the environment variable

”API_HOST”
x_auth (Optional[str]) –
X-Authorization token If not provided will default to the environment variable

”X_AUTH_TOKEN”

classmethod from_dict(raw, data_path)[source]

Document object created from data of dict :type raw: dict :param raw: document dictionary :type data_path: str :param data_path: path to the documents OCR and page images

Return type:

Document

Parameters:

raw (dict)
data_path (str)

get_df_with_tied_field_values(table_tag, pn_filter=<function Document.<lambda>>)[source]

Return the dataframe of the TableTag, where each cell: value is replaced with the values in the fields of tied values (e.i. values that are in the same physical location as the cell)

Parameters:

table_tag (TableTag) – Input TableTag
pn_filter (Callable) – : If provided, only fields with names passing the filter will be considered

Return type:

DataFrame

Returns:

Dataframe of the TableTag

get_first_tied_field(tag, pn_filter=<function Document.<lambda>>)[source]

Return the first field that is in the same location as the given tag

Parameters:

tag (ExtractionTag) – Input ExtractionTag
pn_filter (Callable) – If provided, only fields with names passing the filter will be considered

Return type:

Tuple[str, Field]

Returns:

If match found, return Tuple of the matching pname and Field, otherwise return None

get_first_tied_field_value(tag, pn_filter=<function Document.<lambda>>)[source]

Return the value of the first field that is in the: same location as the given tag

Parameters:

tag (ExtractionTag) – Input ExtractionTag
pn_filter (Callable) – If provided, only tags that are in fields with names passing the filter will be considered

Returns:

get_first_tied_tag(tag, pn_filter=<function Document.<lambda>>)[source]

Return the first tag that is in the same location as the given tag

Parameters:

tag (ExtractionTag) – Input ExtractionTag
pn_filter (Callable) – If provided, only tags that are in fields with names passing the filter will be considered

Return type:

Tuple[str, ExtractionTag]

Returns:

If match found, return Tuple of the matching pname and ExtractionTag, otherwise return None

get_first_tied_tag_value(tag, pn_filter=<function Document.<lambda>>)[source]

Return the value of the first tag that is in the same: location as the given tag

Parameters:

tag (ExtractionTag) – Input ExtractionTag
pn_filter (Callable) – If provided, only tags that are in fields with names passing the filter will be considered

Returns:

get_layout_text(field_type, field_filter=<function Document.<lambda>>, sorting_function=None, table_parser=<bound method TableField.parse_table of <class 'pycognaize.document.field.table_field.TableField'>>)[source]

Sample usage: ```

doc = Document.fetch_document(recipe_id=”649a7c0180d898001055a354”,
doc_id=”65db38f7dc54d400119ae1f3”)

def parse_table(table_field: TableField) -> str:
df = table_field.tags[0].df new_header = df.iloc[0] df = df[1:] df.columns = new_header df_text = df.to_markdown(index=False) return df_text

doc_text = doc.get_layout_text(
field_type=”both”, field_filter=lambda pname, field: pname != ‘table’, sorting_function=lambda x: (x.tags[0].top, x.tags[0].left), table_parser=parse_table)

for page_number, page_text in enumerate(doc_text, start=1):
print(f”———PAGE {page_number}—————

“): print(page_text)

```

Parameters:

field_type (Literal['input', 'output', 'both'])
field_filter (Callable)
sorting_function (Optional[Callable])
table_parser (Callable)

Return type:

list[str]

static get_matching_table_cells_for_tag(tag, table_tags, one_to_one)[source]

Create a list which includes the original extraction tag,: the corresponding table tag and Cell objects and the IOU of the intersection

Parameters:

tag (BoxTag) – The tag for which matching table and cells should be found
table_tags (List[TableTag]) – List of `table_tag`s
one_to_one (bool) – If true, for each tag only one corresponding cell will be returned

Return type:

List[Tuple[BoxTag, TableTag, Cell, float]]

Returns:

List of tuples, which include the original extraction tag, the corresponding table tag and Cell objects and the IOU of the intersection

get_table_cell_overlap(source_field, one_to_one)[source]

Create a list which includes the original extraction tag,: the corresponding table tag and Cell objects and the IOU of the intersection

Parameters:

source_field (str) – Name of the field, for which to return the corresponding table cells
one_to_one (bool) – If true, for each tag only one corresponding cell will be returned

Return type:

List[Tuple[BoxTag, TableTag, Cell, float]]

Returns:

List of tuples, which include the original extraction tag, the corresponding table tag and Cell objects and the IOU of the intersection

get_tied_fields(tag, field_type='both', threshold=0.5, pn_filter=<function Document.<lambda>>)[source]

Given an ExtractionTag, return all the fields that contain: tags in the same physical location.

Parameters:

tag (ExtractionTag) – Input ExtractionTag
field_type (str) – Types of fields to consider {input/output/both}
threshold (float) – The IoU threshold to consider the tags in the same location
pn_filter (Callable) – If provided, only fields with names passing the filter will be considered

Return type:

Dict[str, List[Field]]

Returns:

Dictionary where key is pname and value is List of Field objects

get_tied_tags(tag, field_type='both', threshold=0.9, pn_filter=<function Document.<lambda>>)[source]

Given a single tag, return all other tags in the document that are: in the same physical location in the document

Parameters:

tag (ExtractionTag) – Input ExtractionTag
field_type (str) – Types of fields to consider {input/output/both}
threshold (float) – The IoU threshold to consider the tags in the same location
pn_filter (Callable) – If provided, only tags that are in fields with names passing the filter will be considered

Return type:

Dict[str, List[ExtractionTag]]

Returns:

Dictionary where key is pname and value is List of ExtractionTag objects

property html: Returns HTML object

property id: str: Returns the pycognaize id of the document

property is_xbrl: bool: Returns True if document is XBRL, otherwise False

load_page_images(page_filter=<function Document.<lambda>>)[source]

Get all images of the pages in the document (Using multiprocessing)

Parameters:: page_filter (Callable)
Return type:: None

load_page_ocr(page_filter=<function Document.<lambda>>, stick_coords=False)[source]

Get all OCR of the pages in the document (Using multiprocessing)

Parameters:

page_filter (Callable)
stick_coords (bool)

Return type:

None

property metadata: Dict[str, Any]: Returns document metadata

property pages: Dict[int, Page]: Returns a dictionary, where each key is the page number and values are Page objects

to_dict()[source]

Converts Document object to dict

Return type:: dict

to_pdf(input_fields=None, output_fields=None, input_color='deeppink1', output_color='deepskyblue3', input_opacity=0.2, output_opacity=0.3)[source]

Adds tags of input_fields and output_fields to the bytes object representing the pdf file of the document.

Parameters:

input_fields (Optional[List[str]]) – Input fields
output_fields (Optional[List[str]]) – Output fields
input_color (str) – The color of the annotation rectangle of the input field
output_color (str) – The color of the annotation rectangle of the output field
input_opacity (float) – The opacity of the annotation rectangle of the input field
output_opacity (float) – The opacity of the annotation rectangle of the output field

Return type:

bytes

Returns:

bytes object of the pdf

property x: FieldCollection[str, List[Field]]: Returns a dictionary, where keys are input field names and values are list of Field objects

property y: FieldCollection[str, List[Field]]: Returns a dictionary, where keys are output field names and values are list of Field objects