Quick Tutorial ============== .. highlight:: python3 .. testsetup:: from pycognaize import Snapshot from pycognaize.tests.resources import RESOURCE_FOLDER import os os.environ['SNAPSHOT_PATH'] = RESOURCE_FOLDER os.environ['SNAPSHOT_ID'] = 'snapshots' Our main objective in this tutorial is to retrieve the balance sheet table from a financial report, and create an excel output from it. To start working with ``pycognaize``, first we need to retrieve the document from a stored :term:`snapshot`. .. code-block:: python from pycognaize import Snapshot .. note:: Snapshot uses environment variables ``SNAPSHOT_PATH`` and ``SNAPSHOT_ID`` to load the snapshot. Before creating a Snapshot object using ``Snapshot.get()``, be sure to set the *path* and *id* in the corresponding environment variables. .. code:: python os.environ['SNAPSHOT_PATH'] = "PATH_TO_RESOURCE_FOLDER" os.environ['SNAPSHOT_ID'] = "SNAPSHOT_ID" Now we can create the Snapshot object .. doctest:: >>> snapshot = Snapshot.get() >>> snapshot # doctest: +ELLIPSIS Alternatively, ``Snapshot`` can be initialized with the corresponding parameters. .. code:: python snapshot = Snapshot(data_path=..., doc_path=...) Cognaize SDK supports logging using Cognaize account, to access snapshots from the cloud. To use this feature, you need to login using your Cognaize account. First of all, ``API_HOST`` environment variable should be set to the Cognaize API host. .. code:: python os.environ['API_HOST'] = "https://api.cognaize.com" .. code:: python from pycognaize.login import Login login_instance = Login() login_instance.login(email=..., password=...) Now Snapshot is enabled to access snapshots with `id` using ``get_by_id()``, or downloading snapshot files from cloud storage using ``download()``. .. code:: python snapshot = Snapshot.get_by_id(snapshot_id=...) snapshot = Snapshot.download(snapshot_id=..., destination_path=...) .. note:: ``login`` function will give you access to snapshots assigned to your account. You only need to login once, and the login information will be available until end of runtime :doc:`Snapshot <../API/_autosummary/pycognaize.document.snapshot.Snapshot>` is a collection of multiple Document objects .. doctest:: >>> snapshot.documents._ids[:4] ['5eb8ee1c6623f200192a0651', '60215310dbf28200120e6afa', '60b76b3d6f3f980019105dac', '60f5260c7883ab0013d9c184'] As we can see our Snapshot consists of 5 documents, let's choose one them and have a look at documents structure .. doctest:: >>> document = snapshot.documents['60b76b3d6f3f980019105dac'] >>> document # doctest: +ELLIPSIS Additionally, You can retrieve the Document object from the Cloud using the `recipe_id` and `document_id`. This method allows you to access :term:`OCR` data and images associated with the specified document after logging into your Cognaize account .. note:: Before creating a document object using ``Document.fetch_document(recipe_id, document_id)`` make sure to specify the corresponding environment variables ``X_AUTH_TOKEN``: authentication token for API access ``API_HOST``: URL for the Cognaize API . .. code:: python os.environ['API_HOST'] = "https://api.cognaize.com" os.environ['X_AUTH_TOKEN'] = "token" .. code:: python >>> document = Document.fetch_document(recipe_id=..., doc_id=...) >>> document # doctest: +ELLIPSIS Documents are seperated into pages, which we can access by calling the ``document.pages`` method. Afterwards, we can select the page we want to work with. We will choose page 4, as it contains the table that we need to get. .. doctest:: >>> document.pages OrderedDict([(1, ), (2, ), (3, ), (4, ), (5, ), (6, )]) >>> page_4 = document.pages[4] >>> page_4 Page object contains :term:`OCR` data and the image of the page. It also has a lot of useful functionality that you can learn more about :doc:`here <../API/_autosummary/pycognaize.document.page.Page>`. The page OCR and image data can be loaded using multiprocessing in order to speed up the process. To use this functionality functions ``document.load_page_images()`` and ``document.load_page_ocr()`` are available. For example, to load all page images parallel, we can do .. code-block:: >>> document.get_page_images() ... Optionally, you can add a filter function to these methods which take the page object and download only the pages that return true when passed to the filter function. For example, to download the OCR only for the odd pages you can call .. code-block:: >>> document.load_page_ocr(lambda page: page.page_number%2) Moreover in ``page`` objects, you can search for a text in page object and get its coordinates. .. doctest:: >>> page_4.search_text('Month') [{'top': 500, 'bottom': 529, 'left': 1254, 'right': 1361, 'matched_words': [{'left': 1254, 'right': 1361, 'top': 500, 'bottom': 529, 'ocr_text': 'Month', 'word_id_number': 60}]}] You can also generate an image with the annotations. .. code-block:: image = page1.draw_ocr_boxes() In order to access the table data we need to access the fields which are in the document object. The document object contains all tagged fields. Input ``Fields`` are accessed using ``document.x`` and output ``Fields`` are accessed using ``document.y``. The output of ``document.x`` is an Ordered Dictionary that has *names* (defined in the :term:`AI Interface`) as keys and ``list`` of :doc:`Field <../API/_autosummary/pycognaize.document.field>` objects as values. We can select all table fields using ``document.x['table']``. .. doctest:: >>> fields = document.x >>> fields FieldCollection([('table', [])]) >>> table_field = document.x['table'] >>> table_field [] .. note:: There are 5 types of ``field`` objects * Numeric Field * Text Field * Date Field * Table Field * Area Field Now, as we have accessed tables, we can select the only table on this page, and get tags. Tags provide a lot more functionality that will be covered :doc:`Tag <../API/_autosummary/pycognaize.document.tag>` .. doctest:: >>> table_1 = table_field[0] >>> table_1 Let's select the table on page 4, and convert it to a pandas dataframe. ``TableTag`` can output a ``pandas.DataFrame`` using ``table_tag.df`` method. .. doctest:: >>> table_1_tags = table_1.tags[0] >>> table_1_tags .. code-block:: python >>> table_1_tags.df 0 1 2 0 March 31, 2021 December 31, 2020 1 (unaudited) 2 Assets 3 Current assets: 4 Cash $ 10.9 $ — 5 Accounts receivable 11.6 8.6 6 Accounts receivable - related party 5.0 5.7 7 Prepaid expenses 0.3 0.4 8 Total current assets 27.8 14.7 9 Property, plant and equipment 377.6 371.8 10 Less: accumulated depreciation 51.1 46.5 11 Property, plant and equipment, net 326.5 325.3 12 Investment in unconsolidated affiliate 80.2 80.3 13 Other assets 0.5 0.6 14 Total assets $ 435.0 $ 420.9 15 Liabilities and members' equity 16 Current liabilities: 17 Accounts payable $ 29.2 $ 6.8 18 Accounts payable - related party 6.4 2.1 19 Accrued expenses and other liabilities 2.4 4.4 20 Accrued expenses - related party 0.1 0.3 21 Total current liabilities 38.1 13.6 22 Long-term debt 100.0 109.3 23 Deferred revenues 1.6 1.2 24 Other long-term liabilities 2.5 2.5 25 Members' equity 292.8 294.3 26 Total liabilities and members' equity $ 435.0 $ 420.9