2023:Grid Layout (Table Extract Method): Difference between revisions

From Grooper Wiki
No edit summary
No edit summary
 
(21 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{AutoVersion}}
[[image:2023_Grid-Layout_01_About_01.png|frame]]
[[image:2023_Grid-Layout_01_About_01.png|frame]]


<onlyinclude>
<blockquote>{{#lst:Glossary|Grid Layout}}</blockquote>
<blockquote style="font-size:14pt">
''Infer Grid'' is one of many [[Table Extraction]] methods to extract data from tables on documents.  It uses the positional location of row and column headers to interpret where a tabular grid would be around each value in a table and extract values from each cell in the interpreted grid.
</blockquote>


{| class="wikitable" style="margin:left"
! Previous Versions
|-
|
[[Infer Grid (Table Extract Method) - 2.80]]
<br>
|}


This method extracts information by inferring a grid from the row and column header positions.  This is done by assigning an '''''X Axis Extractor''''' to match the column headers and, a '''''Y Axis Extractor''''' to match row headers.  A grid is created from the header positions extracted from the two extractors.   
This method extracts information by inferring a grid from the row and column header positions.  This is done by assigning an '''''X Axis Extractor''''' to match the column headers and, a '''''Y Axis Extractor''''' to match row headers.  A grid is created from the header positions extracted from the two extractors.   


Furthermore, if table line positions can be obtained from a Line Detection or Line Removal '''IP Command''', only the '''''X Axis Extractor''''' is needed.  In these cases, the '''''X Axis Extractor''''' can be used to find the column header labels, and the grid will be created using the table lines in the documents [[Layout Data]]. The raw text data obtained from the '''[[Recognize]]''' activity will populate each cell of the grid according to where it is on the page.
Furthermore, if table line positions can be obtained from a Line Detection or Line Removal '''IP Command''', only the '''''X Axis Extractor''''' is needed.  In these cases, the '''''X Axis Extractor''''' can be used to find the column header labels, and the grid will be created using the table lines in the documents [[Layout Data]]. The raw text data obtained from the '''[[Recognize]]''' activity will populate each cell of the grid according to where it is on the page.
</onlyinclude>


{|cellpadding="10" cellspacing="5"  
{| class="fyi-box"
|-
|
|style="font-size:14pt; color:#f89420; border: 2px solid #f89420; width:40px"|[[File:Asset 22@4x.png]]
'''FYI'''
|style="border: 2px solid #f89420"|
|
In version 2021, '''''Grid Layout''''' replaced the '''''Infer Grid''''' table extract method.  Their logic and function is largely the same.  If you're looking for information on the now deprecated '''''Infer Grid''''' method, [[2.80:Infer Grid (Table Extract Method)|visit this article]]
|}
 
{|class="download-box"
|
[[File:Asset 22@4x.png]]
|
You may download and import the file(s) below into your own Grooper environment (version 2023).  There is a '''Batch''' with the example document(s) discussed in this tutorial, as well as a '''Project''' configured according to its instructions.
You may download and import the file(s) below into your own Grooper environment (version 2023).  There is a '''Batch''' with the example document(s) discussed in this tutorial, as well as a '''Project''' configured according to its instructions.
<br>
<br>
Please upload the '''Project''' to your '''Grooper''' environment before uploading the '''Batch'''. This will allow the documents within the '''Batch''' to maintain their classification status.
Please upload the '''Project''' to your '''Grooper''' environment before uploading the '''Batch'''. This will allow the documents within the '''Batch''' to maintain their classification status.
* [[Media:Grid Layout - Project (v2023).zip]]
* [[Media:2023_Wiki_Grid-Layout_Project.zip]]
* [[Media:Grid Layout - Batch (v2023).zip]]
* [[Media:2023_Wiki_Grid-Layout_Batch.zip]]
|}
|}


== Use Cases ==
== Use Cases ==
=== Non-Standard Tables ===
=== Non-Standard Tables ===
The Grid Layout method excels at many cases where the table structure is not easily understood by the Row Match or Header-Value methods.  This is especially true for tables with table lines present.  Examine the table below.


The Infer Grid method excels at many cases where the table structure is not easily understood by the Row Match or Header-Value methods.  This is especially true for tables with table lines present.  Examine the table below.
[[file:infer grid contact.png|800px]]
 
[[file:infer grid contact.png|center|800px]]


Row Match might work, but it would be a heavy lift.  First, each row's pattern is different.  There are names on one, addresses on another, phone numbers on another.  Every row has a different pattern.  It would take some creative configuration.  You could try to make a row out of the columns.  It would take a series of extractors, be very effort intensive and complicated to set up.
Row Match might work, but it would be a heavy lift.  First, each row's pattern is different.  There are names on one, addresses on another, phone numbers on another.  Every row has a different pattern.  It would take some creative configuration.  You could try to make a row out of the columns.  It would take a series of extractors, be very effort intensive and complicated to set up.
Line 42: Line 38:
Header-Value would also have problems.  The column header labels ("Lender", "Mortgage Broker", etc), would be straightforward.  But the value extractors would be tricky.  It's possible a generic text segment extractor could get you close, but at least the "Address" row presents problems because it is a two line value instead of a single line.  Again, it could be doable, but it would take some effort.
Header-Value would also have problems.  The column header labels ("Lender", "Mortgage Broker", etc), would be straightforward.  But the value extractors would be tricky.  It's possible a generic text segment extractor could get you close, but at least the "Address" row presents problems because it is a two line value instead of a single line.  Again, it could be doable, but it would take some effort.


Infer Grid can do this job with a single extractor.  All you would need to do is write an extractor to find the "X Axis"; so all the column header labels in a row.
Grid Layout can do this job with a single extractor.  All you would need to do is write an extractor to find the "X Axis"; so all the column header labels in a row.


[[file:infer grid contact 2.png|center|800px]]
[[file:infer grid contact 2.png|800px]]


Since table lines are present, the text falling inside each cell (obtained via the [[Recognize]] activity could be extracted to the corresponding cell in the column
Since table lines are present, the text falling inside each cell (obtained via the [[Recognize]] activity could be extracted to the corresponding cell in the column


[[file:infer grid contact 3.png|center]]
[[file:infer grid contact 3.png]]


Furthermore, if table lines are not present, Infer Grid can use both both the row and column header labels by using both the "Y Axis Extractor" and "X Axis Extractor" properties.  We can use two extractors, one to return all the Y Axis labels and one to return the X Axis labels, and use their positions to infer the table's structure.
Furthermore, if table lines are not present, Grid Layout can use both both the row and column header labels by using both the "Y Axis Extractor" and "X Axis Extractor" properties.  We can use two extractors, one to return all the Y Axis labels and one to return the X Axis labels, and use their positions to infer the table's structure.


[[file:infer grid contact 4.png|center]]
[[file:infer grid contact 4.png]]




[[file:infer grid contact 5.png|center]]
[[file:infer grid contact 5.png]]


=== OMR Checkboxes ===
=== OMR Checkboxes ===
Line 61: Line 57:
OMR stands for "Optical Mark Recognition".  It is a a way to determine if a checkbox is marked or not on a document.  If you think back to your grade school days and remember taking tests and filling in bubbles on an answer sheet, you already have experience with OMR!  Those answer sheets are fed through a machine that reads the "checkbox state" of the boxes, either filled in (checked) or not.  There are many examples of current documents where checkboxes are used to record a boolean response ("true or false" or "yes or no"), a multiple choice response, or other information.  Grooper uses OMR to read those checkbox states.
OMR stands for "Optical Mark Recognition".  It is a a way to determine if a checkbox is marked or not on a document.  If you think back to your grade school days and remember taking tests and filling in bubbles on an answer sheet, you already have experience with OMR!  Those answer sheets are fed through a machine that reads the "checkbox state" of the boxes, either filled in (checked) or not.  There are many examples of current documents where checkboxes are used to record a boolean response ("true or false" or "yes or no"), a multiple choice response, or other information.  Grooper uses OMR to read those checkbox states.


The Infer Grid method is the easiest way to read checkbox states inside a table.  Once the table's structure is found using the axis extractors, you can choose which columns contain checkboxes.  Grooper will use [[Layout Data]] obtained from a Box Detection or Box Removal IP Command to determine if the box is filled in or left blank.  Refer to the [[Infer Grid (Table Extract Method)#Configure Infer Grid for OMR Checkboxes|tutorial below]] for more information on how to configure this use.
The Grid Layout method is the easiest way to read checkbox states inside a table.  Once the table's structure is found using the axis extractors, you can choose which columns contain checkboxes.  Grooper will use [[Layout Data]] obtained from a Box Detection or Box Removal IP Command to determine if the box is filled in or left blank.  Refer to the [[#Configure Grid Layout for OMR Check Boxes|tutorial below]] for more information on how to configure this use.


{|style="margin:auto"
{|style="margin:auto"
|Marking the "Farm" and "Simulator" columns as OMR Columns in the Infer Grid Property Panel will return a value of "True" if the box is checked and "False" if it is blank.
|Marking the "Farm" and "Simulator" columns as OMR Columns in the Grid Layout Property Panel will return a value of "True" if the box is checked and "False" if it is blank.
|-
|-
|
|
[[file:infer grid omr.png|center]]
[[file:infer grid omr.png]]
|}
|}


Line 73: Line 69:


[[file:infer grid ocr.png|left|150px]]
[[file:infer grid ocr.png|left|150px]]
The Infer Grid method also allows you to choose a column and apply a secondary OCR profile to the cells within that column.  This is useful for tables that have specialized fonts for values filled inside the cells.   
The Grid Layout method also allows you to choose a column and apply a secondary OCR profile to the cells within that column.  This is useful for tables that have specialized fonts for values filled inside the cells.   


For example, the OCR-A font is not easily read by most modern OCR engines.  However, Google's Tesseract OCR engine has some specialized functionality for the font.  A document using a column like the one to the left could process most of the document, using an OCR profile that reads conventional fonts, including the column headers such as "Date".  Then, the cells inside the grid, containing dates in the OCR-A font, could be reprocessed using another OCR profile that uses the Tesseract engine.
For example, the OCR-A font is not easily read by most modern OCR engines.  However, Google's Tesseract OCR engine has some specialized functionality for the font.  A document using a column like the one to the left could process most of the document, using an OCR profile that reads conventional fonts, including the column headers such as "Date".  Then, the cells inside the grid, containing dates in the OCR-A font, could be reprocessed using another OCR profile that uses the Tesseract engine.
Line 81: Line 77:


=== Configuring Grid Layout for Tables with Lines ===
=== Configuring Grid Layout for Tables with Lines ===
<tabs style="margin:20px">
<tab name="Prereqs" style="margin:20px">
==== Before you begin ====
==== Before you begin ====
A Data Table is a Data Element used to model and extract a table's information on a document.  Just like other [[Data Element]]s, such as [[Data Field]]s and [[Data Section]]s, Data Tables are created as children of a [[Data Model]].  This guide assumes you have created a [[Content Model]] with a [[Data Model]].
A Data Table is a Data Element used to model and extract a table's information on a document.  Just like other [[Data Element]]s, such as [[Data Field]]s and [[Data Section]]s, Data Tables are created as children of a [[Data Model]].  This guide assumes you have created a [[Content Model]] with a [[Data Model]].
Line 89: Line 82:
We will use the table below as our example for creating a Data Table.
We will use the table below as our example for creating a Data Table.


[[File:Simpletable.png|center]]
[[File:Simpletable.png]]


</tab>
<tab name="Step 1" style="margin:20px">
==== Navigate to a Data Model ====
==== Navigate to a Data Model ====
Using the [[Node Tree]] on the left side of Grooper Design Studio, navigate to the [[Data Model]] you wish to add the Data Table to.  Data Tables can be created as children of any Data Model at any hierarchy in a Content Model.
[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_01.png]]
==== Add a Data Table ====
Right click the [[Data Model]] object, mouse over "Add" and select "Data Table"
[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_02.png]]
The following window will appear.  Name the table whatever you would like and press "OK" when finished.


Using the [[Node Tree]] on the left side of Grooper Design Studio, navigate to the [[Data Model]] you wish to add the Data Table to.  Data Tables can be created as children of any Data Model at any hierarchy in a Content Model.
[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_03.png]]
 
 
This creates a new Data Table object in the [[Node Tree]] underneath the [[Data Model]].
 
[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_04.png]]
 
==== Add Data Columns ====
Right click the Data Table object, mouse over "Add" and select "Data Column"
 
[[File:2023_Grid-Layout_02_How-To_01_Table-with-Lines_05.png]]




[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_01.png|900px]]
This brings up the following window to name the Data Column.  When finished, press "OK" to create the object.


</tab>
[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_06.png]]
<tab name="Step 2" style="margin:20px">
==== Add a Data Table ====


Right click the [[Data Model]] object, mouse over "Add" and select "Data Table"


This creates a new Data Column object in the [[Node Tree]] underneath the [[Data Model]].


[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_02.png|900px]]
[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_07.png]]


==== Repeat Until Finished ====
Add as many columns as necessary to complete the table.  For our example, we have a single Data Table with five Data Columns, each one named for the corresponding column on the document.


The following window will appear.  Name the table whatever you would like and press "OK" when finished.
[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_08.png]]


==== Configure Extract Method ====
With the '''Data Table''' and its child '''Data Columns''' created, it is not time to configure extraction for this table. We will beging by configuring the '''''Extract Method''''' for this '''Data Table'''.


[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_03.png|center]]


# With the appropriate Data Columns added, seelct the parent Data Table.
# Click the drop-down for the '''''Extract Method''''' property.
# Select the ''Grid Layout'' option.


This creates a new Data Table object in the [[Node Tree]] underneath the [[Data Model]].
[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_09.png]]




[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_04.png|center|900px]]
# Click the arrow to expand the sub-properties of the '''''Extract Method''''' property.
# Click the drop-down for the '''''X Axis Extractor''''' property.
# Select the Pattern Match option in the drop-down menu.


<span style="font-size:125%">'''[[#Configuring Grid Layout for Tables with Lines|Back to top to continue to next tab]]'''</span>
[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_10.png]]
</tab>
<tab name="Step 3" style="margin:20px">
==== Add Data Columns ====


Right click the Data Table object, mouse over "Add" and select "Data Column"


* Click the ellipsis button to bring up the pattern editor dialog box.


[[File:2023_Grid-Layout_02_How-To_01_Table-with-Lines_05.png|900px]]
[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_11.png]]


==== Configure Pattern for X-Axis Extractor ====
We now need to write a pattern that will act to define the horizontal definition of our table's "grid". The easiest way to do this is to return a result that contains all the column headers of our table. Furthermore, the result returned needs to be subdivided into sub-instances that will match, exactly, the names of their respective '''Data Columns'''.


This brings up the following window to name the Data Column.  When finished, press "OK" to create the object.


* Set the pattern to match the headers seen on the document, being careful to put a space after each entry before returning the line.


[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_06.png|center]]
[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_12.png]]




This creates a new Data Column object in the [[Node Tree]] underneath the [[Data Model]].
# Highlight "Order Date", being careful to not also select the white space. Right-click the highlighted text ...
# ... select the "Create Group" option in the sub-menu. You can also use the hotkey "Ctrl+G"


[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_13.png]]


[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_07.png|900px]]


</tab>
* This will put the highlighted text inside the parenthesis with <code>?<></code>. It will also place the cursor inside the <code><></code> and allow you to name the group. Be sure to replace spaces with <code>_</code>. The name of the group should match the name of the desired '''Data Column''' exactly.
<tab name="Step 4" style="margin:20px">
==== Repeat Until Finished ====


Add as many columns as necessary to complete the table.  For our example, we have a single Data Table with five Data Columns, each one named for the corresponding column on the document.
[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_14.png]]




[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_08.png|900px]]
# Repeat the group naming process for the remaining parts of the pattern.
# Click "OK" when done.


</tab>
[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_15.png]]


<tab name="Step 5" style="margin:20px">
==== Test Table Extraction Results ====
==== Configure Extract Method ====
The '''Data Table''' and its child '''Data Columns''' have been created. And, given that the target document has been '''Recognized''' and has layoutdata containing line information, an X-Axis (horizontal) extractor has been created. The combination of the lines on the document, and the appropriately configured horizontal extractor should be enough to return results. This will now be tested.
With the '''Data Table''' and its child '''Data Columns''' created, it is not time to configure extraction for this table. We will beging by configuring the '''''Extract Method''''' for this '''Data Table'''.


Follow the instructions in the screeshots below.


<br>
# Click the "Save" button to save changes made.
# Click on the "Tester" tab.


[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_09.png|900px]]
[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_16.png]]


<br>


[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_10.png|900px]]
# Click the "Test" button.
# Because this document has layout dta, the Grid will be created by the lines of the page.


<br>
[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_17.png]]


[[file:2023_Grid-Layout_02_How-To_01_Table-with-Lines_11.png|900px]]
=== Configuring Grid Layout for Tables without Lines ===
</tab>


</tabs>
==== Before you begin ====
We will now take a look at a very similar table as what we previously worked with. The key difference with this table is that there are no lines defining the structure of the table. We as humans can easily intuit its structure, but without the lines present '''Grooper''' will need more guidance to understand the table's structure.


=== Configure Infer Grid for OMR Checkboxes ===
If a table has lines the X-Axis, or "hoizontal", extractor of our ''Grid Layout'' approach will essentially define the upper boundary of our table and use line information to then draw the grid of our table. Without the lines, however, an X-Axis extractor will not be enough. How will '''Grooper''' understand the structure of the table without the lines?


<tabs style="margin:20px">
We will configure what's referred to as a "Y-Axis", or "vertical", extractor to allow '''Grooper''' to "infer the grid" of the table by taking the bounds of the returned results of these two extractors and drawing said grind based on the intersection of the two.
<tab name="Prereqs" style="margin:20px">
{|cellpadding="10" cellspacing="5"
|-style="background-color:#f89420; color:white"
|style="font-size:14pt"|'''!'''||Some of the tabs in this tutorial are longer than the others.  Please scroll to the bottom of each step's tab before going to the step.
|}


A Data Table is a Data Element used to model and extract a table's information on a document.  Just like other [[Data Element]]s, such as [[Data Field]]s and [[Data Section]]s, Data Tables are created as children of a [[Data Model]].  This guide assumes you have created a [[Content Model]] with a [[Data Model]].


We will use the table below as our example.  This is a mockup of a government form using OMR checkboxes to check off whether or not certain critera listed in the "Description" column is met.
[[file:2023_Grid-Layout_02_How-To_02_Table-without-Lines_00.png]]


[[file:infer grid omr.png|center]]
==== Configuring Table for Grid Layout ====
</tab>
First things first we need a '''Data Table''' with appropriate child '''Data Columns'''. We then need to set the '''''Extract Method''''' and set our '''''X-Axis Extractor'''''.
<tab name="Step 1" style="margin:20px">
==== Obtain the Document's Layout Data ====


This method heavily relies on [[Layout Data]] in order to work.  Before we can use Infer Grid to extract this table's information, we need to know the table's line positions and checkbox states.


That means we will need to do some image processing using the following IP Commands
# The "Table without Lines" setup is very similar to the setup of the "Table with Lines".
# The '''''Extract Method''''' is set to ''Grid Layout''.
# The '''''X Axis Extractor''''' property is set to ''Pattern Match''.


# Line Removal or Line Detection
[[file:2023_Grid-Layout_02_How-To_02_Table-without-Lines_01.png]]
# Box Removal or Box Detection


You can learn more about image processing for tables [[Header-Value (Table Extract Method)#Image Processing for Tables|visiting this article]].  However, it does not discuss Box Detection as it relates specifically to this use case.
==== Configure Pattern for X-Axis Extractor ====
Next we will edit the pattern of our X-Axis Extractor to find the column headers of our table and return the results in named sub-groups that match our '''Data Columns'''.


The Box Detection and Box Removal commands use Optical Mark Recognition (OMR) to determine if a box is checked or not.  It functions similarly to the Line Detection and Line Removal in that it also is looking for lines.  After all, a box is made of lines.  The Box Detection command is configured to only look at boxes of a certain size, in order to avoid "seeing" larger boxes as checkboxes.  If some thing is "seen" inside the box (through Grooper's "blob detection"), it's checkbox state is "True" and "False" if not.  Both the box's location on the page and its checkbox state are stored in the "LayoutData.json" file (along with lines detected from the Line Detection or Line Removal command).


Below, see the Box Removal IP Command in an IP Profile.
* A pattern targeting headers of the table is the same, with named groups to match the '''Data Columns''.


[[file:2023_Grid-Layout_02_How-To_02_Table-without-Lines_02.png]]


[[file:box remv 1.png|900px]]
==== Create Extractor for Y-Axis Extractor ====
We will now configur our '''''Y-Axis Extractor'''''. This will require a little more logic than a local pattern on the property will allow, so we will leverage a '''Data Type'''.




[[file:box remv 2.png|900px]]
# In order to "define the grid" of the table without lines, an extractor that helps define the structure of the table is made. IF an '''''X Axis Extractor''''' sets the horizontal dimensions, an extractor on the Y will define the vertical.
# A reg-ex pattern is used to find a value that will occur on every row of the table. In this case the pattern is: <code>[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}</code>
# The '''''Collation''''' is set to ''Array'' because the '''''Y Axis Extractor''''' needs to return one result. This one result will be divided into sub-instances that will define the vertical structure of the table.


[[file:2023_Grid-Layout_02_How-To_02_Table-without-Lines_03.png]]


[[file:box remv 3.png|900px]]


* Upon inspecting the result, we can observe the desired output. A single result with an array of sub-elements.


[[file:box remv 4.png|900px]]
[[file:2023_Grid-Layout_02_How-To_02_Table-without-Lines_04.png]]


==== Complete Table Configuration and Test Results ====
With the Y-Axis extractor made, we now need to plug it into our '''Data Table''' and test the results.


In this case, we are using ''both'' Line Removal and Box Removal.  Be careful about the order your IP Commands are operating.  Boxes are made of lines.  If Line Removal runs ''before'' Box Removal, you run the risk of removing all or part of those box lines.  Box Removal should always run before Line Removal in an IP Profile.


# To finish setting up our Grid Layout for a table without lines, select the '''Data Table'''...
# ... set the '''''Y Axis Extractor''''' to ''Reference'' ...
# point the '''''Extractor''''' property to the "Array" '''Data Type''' ...
# ... set the '''''Header Column''''' property to the "Order Date" column. The values of this column are being leveraged to define the structure of the grid, so this property is set to allow the values to be included in extraction.


[[file:box remv 5.png|900px]]
[[file:2023_Grid-Layout_02_How-To_02_Table-without-Lines_05.png]]




</tab>
# Click on the "Tester" tab ...
<tab name="Step 2" style="margin:20px">
# ... click the "Test" button to test extraction ...
==== Add a Data Table ====
# ... and return results.
# However, becasuse there are no lines in the layout data file the grid is created based on the bounds of the results from the X and Y axis extractors. Thus demonstrating a shortcoming of this extraction technique.


Create a Data Table with three Data Columns.  The five columns for our example are "Operator Name", "Well Name", "Lease Number", "PC", and "Runs".  Refer to the [[Infer Grid (Table Extract Method)#How To|Creating a Data Table]] section above for more information on adding a Data Table to a Data Model.
[[file:2023_Grid-Layout_02_How-To_02_Table-without-Lines_06.png]]


=== Configure Grid Layout for OMR Check Boxes ===
A Data Table is a Data Element used to model and extract a table's information on a document.  Just like other [[Data Element]]s, such as [[Data Field]]s and [[Data Section]]s, Data Tables are created as children of a [[Data Model]].  This guide assumes you have created a [[Content Model]] with a [[Data Model]].


[[file:infer grid omr ex 1.png|900px]]
We will use the table below as our example.  This is a mockup of a government form using OMR checkboxes to check off whether or not certain critera listed in the "Description" column is met.


</tab>
[[file:infer grid omr.png]]
<tab name="Step 3" style="margin:20px">
==== Set the Extract Method ====


First, set the "Extract Method" property to "Infer Grid".  (1) Select the Data Table object in the [[Node Tree]]  (2) Select the "Extract Method" property. (3) Using the dropdown list, select "Infer Grid"
==== Obtain the Document's Layout Data ====
This method heavily relies on [[Layout Data]] in order to work. Before we can use Grid Layout to extract this table's information, we need to know the table's line positions and checkbox states.


That means we will need to do some image processing using the following IP Commands


[[file:infer grid omr ex 2.png|900px]]
# Line Removal or Line Detection
# Box Removal or Box Detection


</tab>
You can learn more about image processing for tables [[Header-Value (Table Extract Method)#Image Processing for Tables|visiting this article]].  However, it does not discuss Box Detection as it relates specifically to this use case.
<tab name="Step 4" style="margin:20px">
 
==== Configure the Axis Extractor ====
The Box Detection and Box Removal commands use Optical Mark Recognition (OMR) to determine if a box is checked or not.  It functions similarly to the Line Detection and Line Removal in that it also is looking for lines.  After all, a box is made of lines.  The Box Detection command is configured to only look at boxes of a certain size, in order to avoid "seeing" larger boxes as checkboxes.  If some thing is "seen" inside the box (through Grooper's "blob detection"), it's checkbox state is "True" and "False" if not.  Both the box's location on the page and its checkbox state are stored in the "LayoutData.json" file (along with lines detected from the Line Detection or Line Removal command).


The first step when configuring Infer Grid for any table is to configure the Axis Extractors.  These are extractors written to locate the column and row header label locations.  Once these locations are known, infer grid can interpret a grid structure where it expects to find each cell in the table.
Below, see the Box Removal IP Command in an IP Profile.


For our document, our table uses lines to divide its rows and columns into bounded cells.  Because of this, we can get away with only using a single axis.  We will use the "Y Axis Extractor" property to locate the headers "Description", "Farm", and "Simulator".
[[file:2023_Grid-Layout_02_How-To_03_OMR_01.png]]




[[file:infer grid omr ex 3.png|900px]]
[[file:2023_Grid-Layout_02_How-To_03_OMR_02.png]]




These headers can be found with a simple Internal pattern (Although if your documents are more complicated, you can use a Reference to an extractor created in the Node Tree).  Expand the "X Axis Extractor" property.  Select the "Type" property and choose "Internal" from the dropdown list.
[[file:2023_Grid-Layout_02_How-To_03_OMR_03.png]]




[[file:infer grid omr ex 4.png|900px]]
[[file:2023_Grid-Layout_02_How-To_03_OMR_04.png]]




Select the "Pattern" property and press the ellipsis button at the end to bring up the Pattern Editor.
In this case, we are using ''both'' Line Removal and Box Removal.  Be careful about the order your IP Commands are operating.  Boxes are made of lines.  If Line Removal runs ''before'' Box Removal, you run the risk of removing all or part of those box lines.  Box Removal should always run before Line Removal in an IP Profile.


[[file:2023_Grid-Layout_02_How-To_03_OMR_05.png]]


[[file:infer grid omr ex 5.png|900px]]
==== Add a Data Table ====
Create a Data Table with three Data Columns.  The five columns for our example are "Operator Name", "Well Name", "Lease Number", "PC", and "Runs".  Refer to the [[#Configuring Grid Layout for Tables with Lines|Configuring Grid Layout for Tables with Lines]] section above for more information on adding a Data Table to a Data Model.


[[file:2023_Grid-Layout_02_How-To_03_OMR_06.png]]


We will write a single pattern to match each column headerThe following pattern will work just fine.
==== Set the Extract Method ====
First, set the "Extract Method" property to "Grid Layout"(1) Select the Data Table object in the [[Node Tree]]  (2) Select the "Extract Method" property. (3) Using the dropdown list, select "Grid Layout"


description\t
[[file:2023_Grid-Layout_02_How-To_03_OMR_07.png]]
farm\t
simulator


{|cellpadding="10" cellspacing="5"
==== Configure the Axis Extractor ====
|-style="background-color:#f89420; color:white"
The first step when configuring Grid Layout for any table is to configure the Axis Extractors.  These are extractors written to locate the column and row header label locationsOnce these locations are known, Grid Layout can interpret a grid structure where it expects to find each cell in the table.
|style="font-size:14pt"|'''!'''||This pattern uses tabs as anchors between column header label.  Don't forget to turn on "Tab Marking" in the "Properties" tab. It is found by expanding the "Preprocessing Options" property.
|}


For our document, our table uses lines to divide its rows and columns into bounded cells.  Because of this, we can get away with only using a single axis.  We will use the "Y Axis Extractor" property to locate the headers "Description", "Farm", and "Simulator".


[[file:infer grid omr ex 6.png|900px]]
[[file:2023_Grid-Layout_02_How-To_03_OMR_08.png]]




We've returned all our header labels smashed together as a single result.  But Infer Grid needs individual ''instances'' of each header.  Just like we did in the [[Row Match (Table Extract Method)|#Using Row Match with Named Groups|Using Row Match with Name Groups]] tutorial, we will use named groups to create instancesThis time, we will create instances for each column label, from which Infer Grid will use their positions on the document to create a grid.
These headers can be found with a simple Internal pattern (Although if your documents are more complicated, you can use a Reference to an extractor created in the Node Tree).  Expand the "X Axis Extractor" propertySelect the "Type" property and choose "Internal" from the dropdown list.


Select each label (without the tab character) in the Value Editor and make a named group out of each one.  You can either right click the selection and choose "Create Group" option or use the <code>Ctrl + G</code> hotkey on your keyboard.
[[file:2023_Grid-Layout_02_How-To_03_OMR_09.png]]


{|cellpadding="10" cellspacing="5"
|-style="background-color:#f89420; color:white"
|style="font-size:14pt"|'''!'''||Remember to name the groups the same as their corresponding Data Column.  That way the instances results will populate the correct Data Column in the Data Table.
|}


That will make the full regex the pattern below.
Select the "X-Axis Extractor" property and press the ellipsis button at the end to bring up the Pattern Editor.


(?<Description>description)\t
[[file:2023_Grid-Layout_02_How-To_03_OMR_10.png]]
(?<Farm>farm)\t
(?<Simulator>simulator)




[[file:infer grid omr ex 7.png|900px]]
We will write a single pattern to match each column header.  The following pattern will work just fine.


description
farm
simulator


However, notice this table actually is actually split into two tables side by side.  But, our pattern only matches the headers on the left side and not the right.
{|class="attn-box"
|
&#9888;
|
Take note of the spaces after "description" and "farm".
|}




[[file:infer grid omr ex 8.png|900px]]
[[file:2023_Grid-Layout_02_How-To_03_OMR_11.png]]




If we switch over to the "Text" tab, we can easily identify the problem.
We've returned all our header labels smashed together as a single result.  But Grid Layout needs individual ''instances'' of each header.  Just like we did in the [[Row Match (Table Extract Method)|#Using Row Match with Named Groups|Using Row Match with Name Groups]] tutorial, we will use named groups to create instances.  This time, we will create instances for each column label, from which Grid Layout will use their positions on the document to create a grid.


Select each label (without the tab character) in the Value Editor and make a named group out of each one.  You can either right click the selection and choose "Create Group" option or use the <code>Ctrl + G</code> hotkey on your keyboard.


[[file:infer grid omr ex 9.png|900px]]
{|class="attn-box"
|
&#9888;
|
Remember to name the groups the same as their corresponding Data Column.  That way the instances results will populate the correct Data Column in the Data Table. Be sure to NOT include the white spaces in the named groups.
|}


That will make the full regex the pattern below.


We can easily fix resolve this issue by using fuzzy matching. Switch the the "Properties" tab and change the "Mode" property from "RegEx" To "FuzzyRegEx"
  (?<Description>description)
(?<Farm>farm)
(?<Simulator>simulator)




[[file:infer grid omr ex 10.png|900px]]
[[file:2023_Grid-Layout_02_How-To_03_OMR_12.png]]




Press the "OK" button to exit the Pattern Editor.
Press the "OK" button to exit the Pattern Editor.


At this point, if we test extraction, we can see part of Infer Grid in action.  All of the text inside the Description column's cells on the page is extracted, populating cells the Description Data Column.  With the X Axis extractor we created, the Infer Grid is able to establish where the column headers are on the page.  Then, it uses the line positions obtained from a Line Detection or Line Removal IP Command, to establish the table's structure, mapping out each cell according to their line boundaries.
At this point, if we test extraction, we can see part of Grid Layout in action.  All of the text inside the Description column's cells on the page is extracted, populating cells the Description Data Column.  With the X Axis extractor we created, the Grid Layout is able to establish where the column headers are on the page.  Then, it uses the line positions obtained from a Line Detection or Line Removal IP Command, to establish the table's structure, mapping out each cell according to their line boundaries.


Also notice even though this table starts on one half on the page and then continues on the second half, the data is extracted as if it were a single table.
Also notice even though this table starts on one half on the page and then continues on the second half, the data is extracted as if it were a single table.


 
[[file:2023_Grid-Layout_02_How-To_03_OMR_13.png]]
[[file:infer grid omr ex 11.png|900px]]




However, we don't have any information for the "Farm" and "Simulator" columns.  Those are blank or possibly picking up some errant OCR data.
However, we don't have any information for the "Farm" and "Simulator" columns.  Those are blank or possibly picking up some errant OCR data.


</tab>
<tab name="Step 5" style="margin:20px">
==== Set the OMR Columns ====
==== Set the OMR Columns ====
In order to read the checkbox states, all we need to do is tell Grid Layout they are present in those columns.  This is done using the "OMR Columns" property. 


In order to read the checkbox states, all we need to do is tell Infer Grid they are present in those columns.  This is done using the "OMR Columns" property. 


Select the "OMR Columns" property, and expand the dropdown menu.  This will pop up a list of all the Data Columns in the Data Table.  Simply check the boxes by the columns that contain checkboxes, in this case "Farm" and "Simulator".
Select the "OMR Columns" property, and expand the dropdown menu.  This will pop up a list of all the Data Columns in the Data Table.  Simply check the boxes by the columns that contain checkboxes, in this case "Farm" and "Simulator".


[[file:2023_Grid-Layout_02_How-To_03_OMR_14.png]]


[[file:infer grid omr ex 12.png|900px]]


For each cell in the selected columns, Grid Layout will look at the [[Layout Data]] obtained by a Box Detection or Box Removal IP Command to see if a mark was detected for the box.  If a mark was detected, that means the box is checked and it is assigned the value "True".  If not, that cell is assigned the value "False".


For each cell in the selected columns, Infer Grid will look at the [[Layout Data]] obtained by a Box Detection or Box Removal IP Command to see if a mark was detected for the box.  If a mark was detected, that means the box is checked and it is assigned the value "True".  If not, that cell is assigned the value "False".


Press the "Test Extraction" button to see our results.
Press the "Test Extraction" button to see our results.


 
[[file:2023_Grid-Layout_02_How-To_03_OMR_15.png]]
[[file:infer grid omr ex 13.png|900px]]
 
</tab>
</tabs>
 
[[Category:Articles]]
[[Category:Version 2023]]

Latest revision as of 14:40, 21 November 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520242023

The Grid Layout Table Extract Method uses the positional location of row and column headers to interpret where a tabular grid would be around each value in a table and extract values from each cell in the interpreted grid.


This method extracts information by inferring a grid from the row and column header positions.  This is done by assigning an X Axis Extractor to match the column headers and, a Y Axis Extractor to match row headers.  A grid is created from the header positions extracted from the two extractors. 

Furthermore, if table line positions can be obtained from a Line Detection or Line Removal IP Command, only the X Axis Extractor is needed. In these cases, the X Axis Extractor can be used to find the column header labels, and the grid will be created using the table lines in the documents Layout Data. The raw text data obtained from the Recognize activity will populate each cell of the grid according to where it is on the page.

FYI

In version 2021, Grid Layout replaced the Infer Grid table extract method. Their logic and function is largely the same. If you're looking for information on the now deprecated Infer Grid method, visit this article

You may download and import the file(s) below into your own Grooper environment (version 2023). There is a Batch with the example document(s) discussed in this tutorial, as well as a Project configured according to its instructions.
Please upload the Project to your Grooper environment before uploading the Batch. This will allow the documents within the Batch to maintain their classification status.

Use Cases

Non-Standard Tables

The Grid Layout method excels at many cases where the table structure is not easily understood by the Row Match or Header-Value methods. This is especially true for tables with table lines present. Examine the table below.

Row Match might work, but it would be a heavy lift. First, each row's pattern is different. There are names on one, addresses on another, phone numbers on another. Every row has a different pattern. It would take some creative configuration. You could try to make a row out of the columns. It would take a series of extractors, be very effort intensive and complicated to set up.

Header-Value would also have problems. The column header labels ("Lender", "Mortgage Broker", etc), would be straightforward. But the value extractors would be tricky. It's possible a generic text segment extractor could get you close, but at least the "Address" row presents problems because it is a two line value instead of a single line. Again, it could be doable, but it would take some effort.

Grid Layout can do this job with a single extractor. All you would need to do is write an extractor to find the "X Axis"; so all the column header labels in a row.

Since table lines are present, the text falling inside each cell (obtained via the Recognize activity could be extracted to the corresponding cell in the column

Furthermore, if table lines are not present, Grid Layout can use both both the row and column header labels by using both the "Y Axis Extractor" and "X Axis Extractor" properties. We can use two extractors, one to return all the Y Axis labels and one to return the X Axis labels, and use their positions to infer the table's structure.


OMR Checkboxes

OMR stands for "Optical Mark Recognition". It is a a way to determine if a checkbox is marked or not on a document. If you think back to your grade school days and remember taking tests and filling in bubbles on an answer sheet, you already have experience with OMR! Those answer sheets are fed through a machine that reads the "checkbox state" of the boxes, either filled in (checked) or not. There are many examples of current documents where checkboxes are used to record a boolean response ("true or false" or "yes or no"), a multiple choice response, or other information. Grooper uses OMR to read those checkbox states.

The Grid Layout method is the easiest way to read checkbox states inside a table. Once the table's structure is found using the axis extractors, you can choose which columns contain checkboxes. Grooper will use Layout Data obtained from a Box Detection or Box Removal IP Command to determine if the box is filled in or left blank. Refer to the tutorial below for more information on how to configure this use.

Marking the "Farm" and "Simulator" columns as OMR Columns in the Grid Layout Property Panel will return a value of "True" if the box is checked and "False" if it is blank.

Re-OCRing Tricky Cells

The Grid Layout method also allows you to choose a column and apply a secondary OCR profile to the cells within that column. This is useful for tables that have specialized fonts for values filled inside the cells.

For example, the OCR-A font is not easily read by most modern OCR engines. However, Google's Tesseract OCR engine has some specialized functionality for the font. A document using a column like the one to the left could process most of the document, using an OCR profile that reads conventional fonts, including the column headers such as "Date". Then, the cells inside the grid, containing dates in the OCR-A font, could be reprocessed using another OCR profile that uses the Tesseract engine.


How To

Configuring Grid Layout for Tables with Lines

Before you begin

A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.

We will use the table below as our example for creating a Data Table.

Navigate to a Data Model

Using the Node Tree on the left side of Grooper Design Studio, navigate to the Data Model you wish to add the Data Table to. Data Tables can be created as children of any Data Model at any hierarchy in a Content Model.

Add a Data Table

Right click the Data Model object, mouse over "Add" and select "Data Table"


The following window will appear. Name the table whatever you would like and press "OK" when finished.


This creates a new Data Table object in the Node Tree underneath the Data Model.

Add Data Columns

Right click the Data Table object, mouse over "Add" and select "Data Column"


This brings up the following window to name the Data Column. When finished, press "OK" to create the object.


This creates a new Data Column object in the Node Tree underneath the Data Model.

Repeat Until Finished

Add as many columns as necessary to complete the table. For our example, we have a single Data Table with five Data Columns, each one named for the corresponding column on the document.

Configure Extract Method

With the Data Table and its child Data Columns created, it is not time to configure extraction for this table. We will beging by configuring the Extract Method for this Data Table.


  1. With the appropriate Data Columns added, seelct the parent Data Table.
  2. Click the drop-down for the Extract Method property.
  3. Select the Grid Layout option.


  1. Click the arrow to expand the sub-properties of the Extract Method property.
  2. Click the drop-down for the X Axis Extractor property.
  3. Select the Pattern Match option in the drop-down menu.


  • Click the ellipsis button to bring up the pattern editor dialog box.

Configure Pattern for X-Axis Extractor

We now need to write a pattern that will act to define the horizontal definition of our table's "grid". The easiest way to do this is to return a result that contains all the column headers of our table. Furthermore, the result returned needs to be subdivided into sub-instances that will match, exactly, the names of their respective Data Columns.


  • Set the pattern to match the headers seen on the document, being careful to put a space after each entry before returning the line.


  1. Highlight "Order Date", being careful to not also select the white space. Right-click the highlighted text ...
  2. ... select the "Create Group" option in the sub-menu. You can also use the hotkey "Ctrl+G"


  • This will put the highlighted text inside the parenthesis with ?<>. It will also place the cursor inside the <> and allow you to name the group. Be sure to replace spaces with _. The name of the group should match the name of the desired Data Column exactly.


  1. Repeat the group naming process for the remaining parts of the pattern.
  2. Click "OK" when done.

Test Table Extraction Results

The Data Table and its child Data Columns have been created. And, given that the target document has been Recognized and has layoutdata containing line information, an X-Axis (horizontal) extractor has been created. The combination of the lines on the document, and the appropriately configured horizontal extractor should be enough to return results. This will now be tested.


  1. Click the "Save" button to save changes made.
  2. Click on the "Tester" tab.


  1. Click the "Test" button.
  2. Because this document has layout dta, the Grid will be created by the lines of the page.

Configuring Grid Layout for Tables without Lines

Before you begin

We will now take a look at a very similar table as what we previously worked with. The key difference with this table is that there are no lines defining the structure of the table. We as humans can easily intuit its structure, but without the lines present Grooper will need more guidance to understand the table's structure.

If a table has lines the X-Axis, or "hoizontal", extractor of our Grid Layout approach will essentially define the upper boundary of our table and use line information to then draw the grid of our table. Without the lines, however, an X-Axis extractor will not be enough. How will Grooper understand the structure of the table without the lines?

We will configure what's referred to as a "Y-Axis", or "vertical", extractor to allow Grooper to "infer the grid" of the table by taking the bounds of the returned results of these two extractors and drawing said grind based on the intersection of the two.


Configuring Table for Grid Layout

First things first we need a Data Table with appropriate child Data Columns. We then need to set the Extract Method and set our X-Axis Extractor.


  1. The "Table without Lines" setup is very similar to the setup of the "Table with Lines".
  2. The Extract Method is set to Grid Layout.
  3. The X Axis Extractor property is set to Pattern Match.

Configure Pattern for X-Axis Extractor

Next we will edit the pattern of our X-Axis Extractor to find the column headers of our table and return the results in named sub-groups that match our Data Columns.


  • A pattern targeting headers of the table is the same, with named groups to match the 'Data Columns.

Create Extractor for Y-Axis Extractor

We will now configur our Y-Axis Extractor. This will require a little more logic than a local pattern on the property will allow, so we will leverage a Data Type.


  1. In order to "define the grid" of the table without lines, an extractor that helps define the structure of the table is made. IF an X Axis Extractor sets the horizontal dimensions, an extractor on the Y will define the vertical.
  2. A reg-ex pattern is used to find a value that will occur on every row of the table. In this case the pattern is: [0-9]{1,2}/[0-9]{1,2}/[0-9]{4}
  3. The Collation is set to Array because the Y Axis Extractor needs to return one result. This one result will be divided into sub-instances that will define the vertical structure of the table.


  • Upon inspecting the result, we can observe the desired output. A single result with an array of sub-elements.

Complete Table Configuration and Test Results

With the Y-Axis extractor made, we now need to plug it into our Data Table and test the results.


  1. To finish setting up our Grid Layout for a table without lines, select the Data Table...
  2. ... set the Y Axis Extractor to Reference ...
  3. point the Extractor property to the "Array" Data Type ...
  4. ... set the Header Column property to the "Order Date" column. The values of this column are being leveraged to define the structure of the grid, so this property is set to allow the values to be included in extraction.


  1. Click on the "Tester" tab ...
  2. ... click the "Test" button to test extraction ...
  3. ... and return results.
  4. However, becasuse there are no lines in the layout data file the grid is created based on the bounds of the results from the X and Y axis extractors. Thus demonstrating a shortcoming of this extraction technique.

Configure Grid Layout for OMR Check Boxes

A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.

We will use the table below as our example. This is a mockup of a government form using OMR checkboxes to check off whether or not certain critera listed in the "Description" column is met.

Obtain the Document's Layout Data

This method heavily relies on Layout Data in order to work. Before we can use Grid Layout to extract this table's information, we need to know the table's line positions and checkbox states.

That means we will need to do some image processing using the following IP Commands

  1. Line Removal or Line Detection
  2. Box Removal or Box Detection

You can learn more about image processing for tables visiting this article. However, it does not discuss Box Detection as it relates specifically to this use case.

The Box Detection and Box Removal commands use Optical Mark Recognition (OMR) to determine if a box is checked or not. It functions similarly to the Line Detection and Line Removal in that it also is looking for lines. After all, a box is made of lines. The Box Detection command is configured to only look at boxes of a certain size, in order to avoid "seeing" larger boxes as checkboxes. If some thing is "seen" inside the box (through Grooper's "blob detection"), it's checkbox state is "True" and "False" if not. Both the box's location on the page and its checkbox state are stored in the "LayoutData.json" file (along with lines detected from the Line Detection or Line Removal command).

Below, see the Box Removal IP Command in an IP Profile.





In this case, we are using both Line Removal and Box Removal. Be careful about the order your IP Commands are operating. Boxes are made of lines. If Line Removal runs before Box Removal, you run the risk of removing all or part of those box lines. Box Removal should always run before Line Removal in an IP Profile.

Add a Data Table

Create a Data Table with three Data Columns. The five columns for our example are "Operator Name", "Well Name", "Lease Number", "PC", and "Runs". Refer to the Configuring Grid Layout for Tables with Lines section above for more information on adding a Data Table to a Data Model.

Set the Extract Method

First, set the "Extract Method" property to "Grid Layout". (1) Select the Data Table object in the Node Tree (2) Select the "Extract Method" property. (3) Using the dropdown list, select "Grid Layout"

Configure the Axis Extractor

The first step when configuring Grid Layout for any table is to configure the Axis Extractors. These are extractors written to locate the column and row header label locations. Once these locations are known, Grid Layout can interpret a grid structure where it expects to find each cell in the table.

For our document, our table uses lines to divide its rows and columns into bounded cells. Because of this, we can get away with only using a single axis. We will use the "Y Axis Extractor" property to locate the headers "Description", "Farm", and "Simulator".


These headers can be found with a simple Internal pattern (Although if your documents are more complicated, you can use a Reference to an extractor created in the Node Tree). Expand the "X Axis Extractor" property. Select the "Type" property and choose "Internal" from the dropdown list.


Select the "X-Axis Extractor" property and press the ellipsis button at the end to bring up the Pattern Editor.


We will write a single pattern to match each column header. The following pattern will work just fine.

description 
farm 
simulator

Take note of the spaces after "description" and "farm".



We've returned all our header labels smashed together as a single result. But Grid Layout needs individual instances of each header. Just like we did in the #Using Row Match with Named Groups|Using Row Match with Name Groups tutorial, we will use named groups to create instances. This time, we will create instances for each column label, from which Grid Layout will use their positions on the document to create a grid.

Select each label (without the tab character) in the Value Editor and make a named group out of each one. You can either right click the selection and choose "Create Group" option or use the Ctrl + G hotkey on your keyboard.

Remember to name the groups the same as their corresponding Data Column. That way the instances results will populate the correct Data Column in the Data Table. Be sure to NOT include the white spaces in the named groups.

That will make the full regex the pattern below.

(?<Description>description) 
(?<Farm>farm) 
(?<Simulator>simulator)



Press the "OK" button to exit the Pattern Editor.

At this point, if we test extraction, we can see part of Grid Layout in action. All of the text inside the Description column's cells on the page is extracted, populating cells the Description Data Column. With the X Axis extractor we created, the Grid Layout is able to establish where the column headers are on the page. Then, it uses the line positions obtained from a Line Detection or Line Removal IP Command, to establish the table's structure, mapping out each cell according to their line boundaries.

Also notice even though this table starts on one half on the page and then continues on the second half, the data is extracted as if it were a single table.


However, we don't have any information for the "Farm" and "Simulator" columns. Those are blank or possibly picking up some errant OCR data.

Set the OMR Columns

In order to read the checkbox states, all we need to do is tell Grid Layout they are present in those columns. This is done using the "OMR Columns" property.


Select the "OMR Columns" property, and expand the dropdown menu. This will pop up a list of all the Data Columns in the Data Table. Simply check the boxes by the columns that contain checkboxes, in this case "Farm" and "Simulator".


For each cell in the selected columns, Grid Layout will look at the Layout Data obtained by a Box Detection or Box Removal IP Command to see if a mark was detected for the box. If a mark was detected, that means the box is checked and it is assigned the value "True". If not, that cell is assigned the value "False".


Press the "Test Extraction" button to see our results.