2.80:Infer Grid (Table Extract Method): Difference between revisions
Dgreenwood (talk | contribs) |
Dgreenwood (talk | contribs) No edit summary |
||
| Line 46: | Line 46: | ||
=== Re-OCRing Tricky Cells === | === Re-OCRing Tricky Cells === | ||
[[file:infer grid ocr.png| | [[file:infer grid ocr.png|left|150px]] | ||
The Infer Grid method also allows you to choose a column and apply a secondary OCR profile to the cells within that column. This is useful for tables that have specialized fonts for values filled inside the cells. | The Infer Grid method also allows you to choose a column and apply a secondary OCR profile to the cells within that column. This is useful for tables that have specialized fonts for values filled inside the cells. | ||
For example, the OCR-A font is not easily read by most modern OCR engines. However, Google's Tesseract OCR engine has some specialized functionality for the font. A document using a column like the one to the | For example, the OCR-A font is not easily read by most modern OCR engines. However, Google's Tesseract OCR engine has some specialized functionality for the font. A document using a column like the one to the left could process most of the document, using an OCR profile that reads conventional fonts, including the column headers such as "Date". Then, the cells inside the grid, containing dates in the OCR-A font, could be reprocessed using another OCR profile that uses the Tesseract engine. | ||
<br clear = all> | |||
== How To == | == How To == | ||
=== Creating a Data Table in Grooper === | |||
<tabs style="margin:20px"> | |||
<tab name="Prereqs" style="margin:20px"> | |||
==== Before you begin ==== | |||
A Data Table is a Data Element used to model and extract a table's information on a document. Just like other [[Data Element]]s, such as [[Data Field]]s and [[Data Section]]s, Data Tables are created as children of a [[Data Model]]. This guide assumes you have created a [[Content Model]] with a [[Data Model]]. | |||
We will use the table below as our example for creating a Data Table. | |||
[[File:Simpletable.png|center]] | |||
</tab> | |||
<tab name="Step 1" style="margin:20px"> | |||
==== Navigate to a Data Model ==== | |||
Using the [[Node Tree]] on the left side of Grooper Design Studio, navigate to the [[Data Model]] you wish to add the Data Table to. Data Tables can be created as children of any Data Model at any hierarchy in a Content Model. | |||
[[file:create a data table 1.png|900px]] | |||
</tab> | |||
<tab name="Step 2" style="margin:20px"> | |||
==== Add a Data Table ==== | |||
Right click the [[Data Model]] object, mouse over "Add" and select "Data Table" | |||
[[file:create a data table 2.png|900px]] | |||
The following window will appear. Name the table whatever you would like and press "OK" when finished. | |||
[[file:create a data table 3.png|center]] | |||
This creates a new Data Table object in the [[Node Tree]] underneath the [[Data Model]]. | |||
[[file:create a data table 4.png|center|900px]] | |||
</tab> | |||
<tab name="Step 3" style="margin:20px"> | |||
==== Add Data Columns ==== | |||
Right click the Data Table object, mouse over "Add" and select "Data Column" | |||
[[File:Create data table 5.png|900px]] | |||
This brings up the following window to name the Data Column. When finished, press "OK" to create the object. | |||
[[file:create data table 6.png|center]] | |||
This creates a new Data Column object in the [[Node Tree]] underneath the [[Data Model]]. | |||
[[file:create data table 7.png|900px]] | |||
</tab> | |||
<tab name="Step 4" style="margin:20px"> | |||
==== Repeat Until Finished ==== | |||
Add as many columns as necessary to complete the table. For our example, we have a single Data Table with five Data Columns, each one named for the corresponding column on the document. | |||
[[file:create data table 8.png|900px]] | |||
</tab> | |||
</tabs> | |||
=== Configure Infer Grid for OMR Checkboxes === | === Configure Infer Grid for OMR Checkboxes === | ||
<tabs style="margin:20px"> | |||
<tab name="Prereqs" style="margin:20px"> | |||
A Data Table is a Data Element used to model and extract a table's information on a document. Just like other [[Data Element]]s, such as [[Data Field]]s and [[Data Section]]s, Data Tables are created as children of a [[Data Model]]. This guide assumes you have created a [[Content Model]] with a [[Data Model]]. | |||
We will use the table below as our example. This is a mockup of a government form using OMR checkboxes to check off whether or not certain critera listed in the "Description" column is met. | |||
[[file:infer grid omr.png]] | |||
</tab> | |||
<tab name="Step 1" style="margin:20px"> | |||
==== Add a Data Table ==== | |||
Create a Data Table with three Data Columns. The five columns for our example are "Operator Name", "Well Name", "Lease Number", "PC", and "Runs". Refer to the [[Infer Grid (Table Extract Method)#How To|Creating a Data Table]] section above for more information on adding a Data Table to a Data Model. | |||
[[file:row match cols 2.png|900px]] | |||
</tab> | |||
<tab name="Step 2" style="margin:20px"> | |||
==== Set the Extract Method ==== | |||
First, set the "Extract Method" property to "Infer Grid". (1) Select the Data Table object in the [[Node Tree]], and (2) select the "Extract Method" property. | |||
[[file:row match cols 3.png|900px]] | |||
Using the dropdown list, select "Row Match". | |||
[[file:row match cols 4.png|900px]] | |||
</tab> | |||
<tab name="Step 3" style="margin:20px"> | |||
==== Configure the Axis Extractor ==== | |||
The first step when configuring Infer Grid for any table is to configure the Axis Extractors. These | |||
</tab> | |||
</tabs> | |||
Revision as of 14:43, 27 January 2020

Infer Grid is one of three methods to extract data from tables on documents. It uses the positional location of row and column headers to interpret where a tabluar grid would be around each value in a table and extract values from each cell in the interpreted grid.
This method extracts information by inferring a grid from the header positions. This is done by assigning an "X Axis Extractor" to match the column headers and a "Y Axis Extractor" to match row headers. A grid is created from the header positions extracted from the two extractors. Furthermore, if table line positions can be obtained from a Line Detection or Line Removal IP Command, only one Axis Extractor is needed. In these cases, the X Axis Extractor can be used to find the column header labels, and the grid will be created using the table lines in the documents layout data. The raw text data obtained from the Recognize activity will populate each cell of the grid according to where it is on the page.
Use Cases
Non-Standard Tables
The Infer Grid method excels at many cases where the table structure is not easily understood by the Row Match or Header-Value methods. This is especially true for tables with table lines present. Examine the table below.

Row Match might work, but it would be a heavy lift. First, each row's pattern is different. There are names on one, addresses on another, phone numbers on another. Every row has a different pattern. It would take some creative configuration. You could try to make a row out of the columns. It would take a series of extractors, be very effort intensive and complicated to set up.
Header-Value would also have problems. The column header labels ("Lender", "Mortgage Broker", etc), would be straightforward. But the value extractors would be tricky. It's possible a generic text segment extractor could get you close, but at least the "Address" row presents problems because it is a two line value instead of a single line. Again, it could be doable, but it would take some effort.
Row Match can do this job with a single extractor. All you would need to do is write an extractor to find the "Y Axis"; so all the column header labels in a row.

Since table lines are present, the text falling inside each cell (obtained via the Recognize activity could be extracted to the corresponding cell in the column

Furthermore, if table lines are not present, Infer Grid can use both both the row and column header labels by using both the "Y Axis Extractor" and "X Axis Extractor" properties. We can use two extractors, one to return all the Y Axis labels and one to return the X Axis labels, and use their positions to infer the table's structure.


OMR Checkboxes
The Infer Grid method is the easiest way to read checkbox states inside a table. Once the table's structure is found using the axis extractors, you can choose which columns contain checkboxes. Grooper will use layout data obtained from a Box Detection or Box Removal IP Command to determine if the box is filled in or left blank. Refer to the tutorial below for more information on how to configure this use.
Re-OCRing Tricky Cells

The Infer Grid method also allows you to choose a column and apply a secondary OCR profile to the cells within that column. This is useful for tables that have specialized fonts for values filled inside the cells.
For example, the OCR-A font is not easily read by most modern OCR engines. However, Google's Tesseract OCR engine has some specialized functionality for the font. A document using a column like the one to the left could process most of the document, using an OCR profile that reads conventional fonts, including the column headers such as "Date". Then, the cells inside the grid, containing dates in the OCR-A font, could be reprocessed using another OCR profile that uses the Tesseract engine.
How To
Creating a Data Table in Grooper
Before you begin
A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.
We will use the table below as our example for creating a Data Table.

Using the Node Tree on the left side of Grooper Design Studio, navigate to the Data Model you wish to add the Data Table to. Data Tables can be created as children of any Data Model at any hierarchy in a Content Model.
Add a Data Table
Right click the Data Model object, mouse over "Add" and select "Data Table"
The following window will appear. Name the table whatever you would like and press "OK" when finished.

This creates a new Data Table object in the Node Tree underneath the Data Model.

Add Data Columns
Right click the Data Table object, mouse over "Add" and select "Data Column"
This brings up the following window to name the Data Column. When finished, press "OK" to create the object.

This creates a new Data Column object in the Node Tree underneath the Data Model.
Configure Infer Grid for OMR Checkboxes
A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.
We will use the table below as our example. This is a mockup of a government form using OMR checkboxes to check off whether or not certain critera listed in the "Description" column is met.
Add a Data Table
Create a Data Table with three Data Columns. The five columns for our example are "Operator Name", "Well Name", "Lease Number", "PC", and "Runs". Refer to the Creating a Data Table section above for more information on adding a Data Table to a Data Model.
Set the Extract Method
First, set the "Extract Method" property to "Infer Grid". (1) Select the Data Table object in the Node Tree, and (2) select the "Extract Method" property.
Using the dropdown list, select "Row Match".
Configure the Axis Extractor
The first step when configuring Infer Grid for any table is to configure the Axis Extractors. These
