pycognaize.document.document.Document
- class Document(input_fields, output_fields, pages, classification_labels, html_info, metadata, data_path=None)[source]
Bases:
object
Definition of input and output for a single document, depending on a given model
- Parameters:
input_fields (
FieldCollection
[str
,List
[Field
]])output_fields (
FieldCollection
[str
,List
[Field
]])pages (
Dict
[int
,Page
])classification_labels (
Dict
[str
,ClassificationLabels
])html_info (
HTML
)metadata (
Dict
[str
,Any
])data_path (
Optional
[str
])
Methods
Get the document object, given a document id and recipe id
Document object created from data of dict :type raw:
dict
:param raw: document dictionary :type data_path:str
:param data_path: path to the documents OCR and page imagesReturn the dataframe of the TableTag, where each cell
Return the first field that is in the same location as the given tag
Return the value of the first field that is in the
Return the first tag that is in the same location as the given tag
Return the value of the first tag that is in the same
get_layout_fields
Sample usage:
Create a list which includes the original extraction tag,
Create a list which includes the original extraction tag,
Given an ExtractionTag, return all the fields that contain
Given a single tag, return all other tags in the document that are
load_ocr
Get all images of the pages in the document (Using multiprocessing)
Get all OCR of the pages in the document (Using multiprocessing)
Converts Document object to dict
Adds tags of input_fields and output_fields to the bytes object representing the pdf file of the document.
Attributes
Returns the path to the document data
Returns the source of the document
Returns HTML object
Returns the pycognaize id of the document
Returns True if document is XBRL, otherwise False
Returns document metadata
Returns a dictionary, where each key is the page number and values are Page objects
Returns a dictionary, where keys are input field names and values are list of Field objects
Returns a dictionary, where keys are output field names and values are list of Field objects
- property data_path: str | None
Returns the path to the document data
- property document_src
Returns the source of the document
- classmethod fetch_document(recipe_id, doc_id, api_host=None, x_auth=None)[source]
Get the document object, given a document id and recipe id
- Parameters:
recipe_id – ID of the document AI (the second ID in the url)
doc_id – ID of the document (the ID in the annotation view URL of the document)
api_host (
Optional
[str
]) –https://<ENVIRONMENT NAME>-api.cognaize.com. If not provided will default to the environment variable
”API_HOST”
x_auth (
Optional
[str
]) –X-Authorization token If not provided will default to the environment variable
”X_AUTH_TOKEN”
- classmethod from_dict(raw, data_path)[source]
Document object created from data of dict :type raw:
dict
:param raw: document dictionary :type data_path:str
:param data_path: path to the documents OCR and page images- Return type:
- Parameters:
raw (dict)
data_path (str)
- get_df_with_tied_field_values(table_tag, pn_filter=<function Document.<lambda>>)[source]
- Return the dataframe of the TableTag, where each cell
value is replaced with the values in the fields of tied values (e.i. values that are in the same physical location as the cell)
- Parameters:
table_tag (
TableTag
) – Input TableTagpn_filter (
Callable
) – : If provided, only fields with names passing the filter will be considered
- Return type:
DataFrame
- Returns:
Dataframe of the TableTag
- get_first_tied_field(tag, pn_filter=<function Document.<lambda>>)[source]
Return the first field that is in the same location as the given tag
- Parameters:
tag (
ExtractionTag
) – Input ExtractionTagpn_filter (
Callable
) – If provided, only fields with names passing the filter will be considered
- Return type:
Tuple
[str
,Field
]- Returns:
If match found, return Tuple of the matching pname and Field, otherwise return None
- get_first_tied_field_value(tag, pn_filter=<function Document.<lambda>>)[source]
- Return the value of the first field that is in the
same location as the given tag
- Parameters:
tag (
ExtractionTag
) – Input ExtractionTagpn_filter (
Callable
) – If provided, only tags that are in fields with names passing the filter will be considered
- Returns:
- get_first_tied_tag(tag, pn_filter=<function Document.<lambda>>)[source]
Return the first tag that is in the same location as the given tag
- Parameters:
tag (
ExtractionTag
) – Input ExtractionTagpn_filter (
Callable
) – If provided, only tags that are in fields with names passing the filter will be considered
- Return type:
Tuple
[str
,ExtractionTag
]- Returns:
If match found, return Tuple of the matching pname and ExtractionTag, otherwise return None
- get_first_tied_tag_value(tag, pn_filter=<function Document.<lambda>>)[source]
- Return the value of the first tag that is in the same
location as the given tag
- Parameters:
tag (
ExtractionTag
) – Input ExtractionTagpn_filter (
Callable
) – If provided, only tags that are in fields with names passing the filter will be considered
- Returns:
- get_layout_text(field_type, field_filter=<function Document.<lambda>>, sorting_function=None, table_parser=<bound method TableField.parse_table of <class 'pycognaize.document.field.table_field.TableField'>>)[source]
- doc = Document.fetch_document(recipe_id=”649a7c0180d898001055a354”,
doc_id=”65db38f7dc54d400119ae1f3”)
- def parse_table(table_field: TableField) -> str:
df = table_field.tags[0].df new_header = df.iloc[0] df = df[1:] df.columns = new_header df_text = df.to_markdown(index=False) return df_text
- doc_text = doc.get_layout_text(
field_type=”both”, field_filter=lambda pname, field: pname != ‘table’, sorting_function=lambda x: (x.tags[0].top, x.tags[0].left), table_parser=parse_table)
- for page_number, page_text in enumerate(doc_text, start=1):
print(f”———PAGE {page_number}—————
- Parameters:
field_type (
Literal
['input'
,'output'
,'both'
])field_filter (
Callable
)sorting_function (
Optional
[Callable
])table_parser (
Callable
)
- Return type:
list
[str
]
- static get_matching_table_cells_for_tag(tag, table_tags, one_to_one)[source]
- Create a list which includes the original extraction tag,
the corresponding table tag and Cell objects and the IOU of the intersection
- Parameters:
- Return type:
- Returns:
List of tuples, which include the original extraction tag, the corresponding table tag and Cell objects and the IOU of the intersection
- get_table_cell_overlap(source_field, one_to_one)[source]
- Create a list which includes the original extraction tag,
the corresponding table tag and Cell objects and the IOU of the intersection
- Parameters:
source_field (
str
) – Name of the field, for which to return the corresponding table cellsone_to_one (
bool
) – If true, for each tag only one corresponding cell will be returned
- Return type:
- Returns:
List of tuples, which include the original extraction tag, the corresponding table tag and Cell objects and the IOU of the intersection
- get_tied_fields(tag, field_type='both', threshold=0.5, pn_filter=<function Document.<lambda>>)[source]
- Given an ExtractionTag, return all the fields that contain
tags in the same physical location.
- Parameters:
tag (
ExtractionTag
) – Input ExtractionTagfield_type (
str
) – Types of fields to consider {input/output/both}threshold (
float
) – The IoU threshold to consider the tags in the same locationpn_filter (
Callable
) – If provided, only fields with names passing the filter will be considered
- Return type:
Dict
[str
,List
[Field
]]- Returns:
Dictionary where key is pname and value is List of Field objects
- get_tied_tags(tag, field_type='both', threshold=0.9, pn_filter=<function Document.<lambda>>)[source]
- Given a single tag, return all other tags in the document that are
in the same physical location in the document
- Parameters:
tag (
ExtractionTag
) – Input ExtractionTagfield_type (
str
) – Types of fields to consider {input/output/both}threshold (
float
) – The IoU threshold to consider the tags in the same locationpn_filter (
Callable
) – If provided, only tags that are in fields with names passing the filter will be considered
- Return type:
Dict
[str
,List
[ExtractionTag
]]- Returns:
Dictionary where key is pname and value is List of ExtractionTag objects
- property html
Returns HTML object
- property id: str
Returns the pycognaize id of the document
- property is_xbrl: bool
Returns True if document is XBRL, otherwise False
- load_page_images(page_filter=<function Document.<lambda>>)[source]
Get all images of the pages in the document (Using multiprocessing)
- Parameters:
page_filter (
Callable
)- Return type:
None
- load_page_ocr(page_filter=<function Document.<lambda>>, stick_coords=False)[source]
Get all OCR of the pages in the document (Using multiprocessing)
- Parameters:
page_filter (
Callable
)stick_coords (
bool
)
- Return type:
None
- property metadata: Dict[str, Any]
Returns document metadata
- property pages: Dict[int, Page]
Returns a dictionary, where each key is the page number and values are Page objects
- to_pdf(input_fields=None, output_fields=None, input_color='deeppink1', output_color='deepskyblue3', input_opacity=0.2, output_opacity=0.3)[source]
Adds tags of input_fields and output_fields to the bytes object representing the pdf file of the document.
- Parameters:
input_fields (
Optional
[List
[str
]]) – Input fieldsoutput_fields (
Optional
[List
[str
]]) – Output fieldsinput_color (
str
) – The color of the annotation rectangle of the input fieldoutput_color (
str
) – The color of the annotation rectangle of the output fieldinput_opacity (
float
) – The opacity of the annotation rectangle of the input fieldoutput_opacity (
float
) – The opacity of the annotation rectangle of the output field
- Return type:
bytes
- Returns:
bytes object of the pdf
- property x: FieldCollection[str, List[Field]]
Returns a dictionary, where keys are input field names and values are list of Field objects
- property y: FieldCollection[str, List[Field]]
Returns a dictionary, where keys are output field names and values are list of Field objects