pycognaize.document.document.Document

class Document(input_fields, output_fields, pages, classification_labels, html_info, metadata, data_path=None)[source]

Bases: object

Definition of input and output for a single document, depending on a given model

Parameters:

Methods

fetch_document

Get the document object, given a document id and recipe id

from_dict

Document object created from data of dict :type raw: dict :param raw: document dictionary :type data_path: str :param data_path: path to the documents OCR and page images

get_df_with_tied_field_values

Return the dataframe of the TableTag, where each cell

get_first_tied_field

Return the first field that is in the same location as the given tag

get_first_tied_field_value

Return the value of the first field that is in the

get_first_tied_tag

Return the first tag that is in the same location as the given tag

get_first_tied_tag_value

Return the value of the first tag that is in the same

get_layout_fields

get_layout_text

Sample usage:

get_matching_table_cells_for_tag

Create a list which includes the original extraction tag,

get_table_cell_overlap

Create a list which includes the original extraction tag,

get_tied_fields

Given an ExtractionTag, return all the fields that contain

get_tied_tags

Given a single tag, return all other tags in the document that are

load_ocr

load_page_images

Get all images of the pages in the document (Using multiprocessing)

load_page_ocr

Get all OCR of the pages in the document (Using multiprocessing)

to_dict

Converts Document object to dict

to_pdf

Adds tags of input_fields and output_fields to the bytes object representing the pdf file of the document.

Attributes

data_path

Returns the path to the document data

document_src

Returns the source of the document

html

Returns HTML object

id

Returns the pycognaize id of the document

is_xbrl

Returns True if document is XBRL, otherwise False

metadata

Returns document metadata

pages

Returns a dictionary, where each key is the page number and values are Page objects

x

Returns a dictionary, where keys are input field names and values are list of Field objects

y

Returns a dictionary, where keys are output field names and values are list of Field objects

property data_path: str | None

Returns the path to the document data

property document_src

Returns the source of the document

classmethod fetch_document(recipe_id, doc_id, api_host=None, x_auth=None)[source]

Get the document object, given a document id and recipe id

Parameters:
  • recipe_id – ID of the document AI (the second ID in the url)

  • doc_id – ID of the document (the ID in the annotation view URL of the document)

  • api_host (Optional[str]) –

    https://<ENVIRONMENT NAME>-api.cognaize.com. If not provided will default to the environment variable

    ”API_HOST”

  • x_auth (Optional[str]) –

    X-Authorization token If not provided will default to the environment variable

    ”X_AUTH_TOKEN”

classmethod from_dict(raw, data_path)[source]

Document object created from data of dict :type raw: dict :param raw: document dictionary :type data_path: str :param data_path: path to the documents OCR and page images

Return type:

Document

Parameters:
  • raw (dict)

  • data_path (str)

get_df_with_tied_field_values(table_tag, pn_filter=<function Document.<lambda>>)[source]
Return the dataframe of the TableTag, where each cell

value is replaced with the values in the fields of tied values (e.i. values that are in the same physical location as the cell)

Parameters:
  • table_tag (TableTag) – Input TableTag

  • pn_filter (Callable) – : If provided, only fields with names passing the filter will be considered

Return type:

DataFrame

Returns:

Dataframe of the TableTag

get_first_tied_field(tag, pn_filter=<function Document.<lambda>>)[source]

Return the first field that is in the same location as the given tag

Parameters:
  • tag (ExtractionTag) – Input ExtractionTag

  • pn_filter (Callable) – If provided, only fields with names passing the filter will be considered

Return type:

Tuple[str, Field]

Returns:

If match found, return Tuple of the matching pname and Field, otherwise return None

get_first_tied_field_value(tag, pn_filter=<function Document.<lambda>>)[source]
Return the value of the first field that is in the

same location as the given tag

Parameters:
  • tag (ExtractionTag) – Input ExtractionTag

  • pn_filter (Callable) – If provided, only tags that are in fields with names passing the filter will be considered

Returns:

get_first_tied_tag(tag, pn_filter=<function Document.<lambda>>)[source]

Return the first tag that is in the same location as the given tag

Parameters:
  • tag (ExtractionTag) – Input ExtractionTag

  • pn_filter (Callable) – If provided, only tags that are in fields with names passing the filter will be considered

Return type:

Tuple[str, ExtractionTag]

Returns:

If match found, return Tuple of the matching pname and ExtractionTag, otherwise return None

get_first_tied_tag_value(tag, pn_filter=<function Document.<lambda>>)[source]
Return the value of the first tag that is in the same

location as the given tag

Parameters:
  • tag (ExtractionTag) – Input ExtractionTag

  • pn_filter (Callable) – If provided, only tags that are in fields with names passing the filter will be considered

Returns:

get_layout_text(field_type, field_filter=<function Document.<lambda>>, sorting_function=None, table_parser=<bound method TableField.parse_table of <class 'pycognaize.document.field.table_field.TableField'>>)[source]

Sample usage: ```

doc = Document.fetch_document(recipe_id=”649a7c0180d898001055a354”,

doc_id=”65db38f7dc54d400119ae1f3”)

def parse_table(table_field: TableField) -> str:

df = table_field.tags[0].df new_header = df.iloc[0] df = df[1:] df.columns = new_header df_text = df.to_markdown(index=False) return df_text

doc_text = doc.get_layout_text(

field_type=”both”, field_filter=lambda pname, field: pname != ‘table’, sorting_function=lambda x: (x.tags[0].top, x.tags[0].left), table_parser=parse_table)

for page_number, page_text in enumerate(doc_text, start=1):

print(f”———PAGE {page_number}—————

“)

print(page_text)

```

Parameters:
  • field_type (Literal['input', 'output', 'both'])

  • field_filter (Callable)

  • sorting_function (Optional[Callable])

  • table_parser (Callable)

Return type:

list[str]

static get_matching_table_cells_for_tag(tag, table_tags, one_to_one)[source]
Create a list which includes the original extraction tag,

the corresponding table tag and Cell objects and the IOU of the intersection

Parameters:
  • tag (BoxTag) – The tag for which matching table and cells should be found

  • table_tags (List[TableTag]) – List of `table_tag`s

  • one_to_one (bool) – If true, for each tag only one corresponding cell will be returned

Return type:

List[Tuple[BoxTag, TableTag, Cell, float]]

Returns:

List of tuples, which include the original extraction tag, the corresponding table tag and Cell objects and the IOU of the intersection

get_table_cell_overlap(source_field, one_to_one)[source]
Create a list which includes the original extraction tag,

the corresponding table tag and Cell objects and the IOU of the intersection

Parameters:
  • source_field (str) – Name of the field, for which to return the corresponding table cells

  • one_to_one (bool) – If true, for each tag only one corresponding cell will be returned

Return type:

List[Tuple[BoxTag, TableTag, Cell, float]]

Returns:

List of tuples, which include the original extraction tag, the corresponding table tag and Cell objects and the IOU of the intersection

get_tied_fields(tag, field_type='both', threshold=0.5, pn_filter=<function Document.<lambda>>)[source]
Given an ExtractionTag, return all the fields that contain

tags in the same physical location.

Parameters:
  • tag (ExtractionTag) – Input ExtractionTag

  • field_type (str) – Types of fields to consider {input/output/both}

  • threshold (float) – The IoU threshold to consider the tags in the same location

  • pn_filter (Callable) – If provided, only fields with names passing the filter will be considered

Return type:

Dict[str, List[Field]]

Returns:

Dictionary where key is pname and value is List of Field objects

get_tied_tags(tag, field_type='both', threshold=0.9, pn_filter=<function Document.<lambda>>)[source]
Given a single tag, return all other tags in the document that are

in the same physical location in the document

Parameters:
  • tag (ExtractionTag) – Input ExtractionTag

  • field_type (str) – Types of fields to consider {input/output/both}

  • threshold (float) – The IoU threshold to consider the tags in the same location

  • pn_filter (Callable) – If provided, only tags that are in fields with names passing the filter will be considered

Return type:

Dict[str, List[ExtractionTag]]

Returns:

Dictionary where key is pname and value is List of ExtractionTag objects

property html

Returns HTML object

property id: str

Returns the pycognaize id of the document

property is_xbrl: bool

Returns True if document is XBRL, otherwise False

load_page_images(page_filter=<function Document.<lambda>>)[source]

Get all images of the pages in the document (Using multiprocessing)

Parameters:

page_filter (Callable)

Return type:

None

load_page_ocr(page_filter=<function Document.<lambda>>, stick_coords=False)[source]

Get all OCR of the pages in the document (Using multiprocessing)

Parameters:
  • page_filter (Callable)

  • stick_coords (bool)

Return type:

None

property metadata: Dict[str, Any]

Returns document metadata

property pages: Dict[int, Page]

Returns a dictionary, where each key is the page number and values are Page objects

to_dict()[source]

Converts Document object to dict

Return type:

dict

to_pdf(input_fields=None, output_fields=None, input_color='deeppink1', output_color='deepskyblue3', input_opacity=0.2, output_opacity=0.3)[source]

Adds tags of input_fields and output_fields to the bytes object representing the pdf file of the document.

Parameters:
  • input_fields (Optional[List[str]]) – Input fields

  • output_fields (Optional[List[str]]) – Output fields

  • input_color (str) – The color of the annotation rectangle of the input field

  • output_color (str) – The color of the annotation rectangle of the output field

  • input_opacity (float) – The opacity of the annotation rectangle of the input field

  • output_opacity (float) – The opacity of the annotation rectangle of the output field

Return type:

bytes

Returns:

bytes object of the pdf

property x: FieldCollection[str, List[Field]]

Returns a dictionary, where keys are input field names and values are list of Field objects

property y: FieldCollection[str, List[Field]]

Returns a dictionary, where keys are output field names and values are list of Field objects