Quick Tutorial
Our main objective in this tutorial is to retrieve the balance sheet table from a financial report, and create an excel output from it.
To start working with pycognaize
, first we need to retrieve the document from a
stored snapshot.
from pycognaize import Snapshot
Note
Snapshot uses environment variables SNAPSHOT_PATH
and SNAPSHOT_ID
to load the snapshot.
Before creating a Snapshot object using Snapshot.get()
,
be sure to set the path and id in the corresponding environment
variables.
os.environ['SNAPSHOT_PATH'] = "PATH_TO_RESOURCE_FOLDER"
os.environ['SNAPSHOT_ID'] = "SNAPSHOT_ID"
Now we can create the Snapshot object
>>> snapshot = Snapshot.get()
>>> snapshot
<pycognaize.document.snapshot.Snapshot object at 0x...>
Alternatively, Snapshot
can be initialized with the corresponding parameters.
snapshot = Snapshot(data_path=..., doc_path=...)
Cognaize SDK supports logging using Cognaize account, to access snapshots from the cloud. To use this feature, you need to login using your Cognaize account.
First of all, API_HOST
environment variable should be set to the Cognaize API host.
os.environ['API_HOST'] = "https://api.cognaize.com"from pycognaize.login import Login login_instance = Login() login_instance.login(email=..., password=...)
Now Snapshot is enabled to access snapshots with id using get_by_id()
,
or downloading snapshot files from cloud storage using download()
.
snapshot = Snapshot.get_by_id(snapshot_id=...) snapshot = Snapshot.download(snapshot_id=..., destination_path=...)Note
login
function will give you access to snapshots assigned to your account. You only need to login once, and the login information will be available until end of runtime
Snapshot is a collection of multiple Document objects
>>> snapshot.documents._ids[:4]
['5eb8ee1c6623f200192a0651', '60215310dbf28200120e6afa', '60b76b3d6f3f980019105dac', '60f5260c7883ab0013d9c184']
As we can see our Snapshot consists of 5 documents, let’s choose one them and have a look at documents structure
>>> document = snapshot.documents['60b76b3d6f3f980019105dac']
>>> document
<pycognaize.document.document.Document object at 0x...>
Additionally, You can retrieve the Document object from the Cloud using the recipe_id and document_id. This method allows you to access OCR data and images associated with the specified document after logging into your Cognaize account
Note
Before creating a document object using Document.fetch_document(recipe_id, document_id)
make sure to specify the corresponding environment variables
X_AUTH_TOKEN
: authentication token for API access
API_HOST
: URL for the Cognaize API .
os.environ['API_HOST'] = "https://api.cognaize.com"
os.environ['X_AUTH_TOKEN'] = "token"
>>> document = Document.fetch_document(recipe_id=..., doc_id=...)
>>> document
<pycognaize.document.document.Document object at 0x...>
Documents are seperated into pages, which we can access
by calling the document.pages
method. Afterwards,
we can select the page we want to work with.
We will choose page 4, as it contains the table that
we need to get.
>>> document.pages
OrderedDict([(1, <Page 1>), (2, <Page 2>), (3, <Page 3>), (4, <Page 4>), (5, <Page 5>), (6, <Page 6>)])
>>> page_4 = document.pages[4]
>>> page_4
<Page 4>
Page object contains OCR data and the image of the page. It also
has a lot of useful functionality that you can learn more about
here.
The page OCR and image data can be loaded using multiprocessing
in order to speed up the process. To use this functionality functions
document.load_page_images()
and document.load_page_ocr()
are available.
For example, to load all page images parallel, we can do
>>> document.get_page_images()
...
Optionally, you can add a filter function to these methods which take the page object and download only the pages that return true when passed to the filter function. For example, to download the OCR only for the odd pages you can call
>>> document.load_page_ocr(lambda page: page.page_number%2)
Moreover in page
objects, you can search for a text in page object
and get its coordinates.
>>> page_4.search_text('Month')
[{'top': 500, 'bottom': 529, 'left': 1254, 'right': 1361, 'matched_words': [{'left': 1254, 'right': 1361, 'top': 500, 'bottom': 529, 'ocr_text': 'Month', 'word_id_number': 60}]}]
You can also generate an image with the annotations.
image = page1.draw_ocr_boxes()
In order to access the table data we need to access the fields which are in
the document object. The document object contains all tagged fields.
Input Fields
are accessed using document.x
and output Fields
are accessed using document.y
.
The output of document.x
is an Ordered Dictionary
that has names (defined in the AI Interface) as keys and list
of
Field
objects as values. We can select all table fields using
document.x['table']
.
>>> fields = document.x
>>> fields
FieldCollection([('table', [<TableField: table>])])
>>> table_field = document.x['table']
>>> table_field
[<TableField: table>]
Note
- There are 5 types of
field
objects Numeric Field
Text Field
Date Field
Table Field
Area Field
Now, as we have accessed tables, we can select the only table on this page, and get tags. Tags provide a lot more functionality that will be covered Tag
>>> table_1 = table_field[0]
>>> table_1
<TableField: table>
Let’s select the table on page 4, and convert it to a pandas dataframe.
TableTag
can output a pandas.DataFrame
using table_tag.df
method.
>>> table_1_tags = table_1.tags[0]
>>> table_1_tags
<TableTag: left: 8.6, right: 92.69999999999999, top: 12.0, bottom: 64.0998>
>>> table_1_tags.df
0 1 2
0 March 31, 2021 December 31, 2020
1 (unaudited)
2 Assets
3 Current assets:
4 Cash $ 10.9 $ —
5 Accounts receivable 11.6 8.6
6 Accounts receivable - related party 5.0 5.7
7 Prepaid expenses 0.3 0.4
8 Total current assets 27.8 14.7
9 Property, plant and equipment 377.6 371.8
10 Less: accumulated depreciation 51.1 46.5
11 Property, plant and equipment, net 326.5 325.3
12 Investment in unconsolidated affiliate 80.2 80.3
13 Other assets 0.5 0.6
14 Total assets $ 435.0 $ 420.9
15 Liabilities and members' equity
16 Current liabilities:
17 Accounts payable $ 29.2 $ 6.8
18 Accounts payable - related party 6.4 2.1
19 Accrued expenses and other liabilities 2.4 4.4
20 Accrued expenses - related party 0.1 0.3
21 Total current liabilities 38.1 13.6
22 Long-term debt 100.0 109.3
23 Deferred revenues 1.6 1.2
24 Other long-term liabilities 2.5 2.5
25 Members' equity 292.8 294.3
26 Total liabilities and members' equity $ 435.0 $ 420.9