2.80:Infer Grid (Table Extract Method)

From Grooper Wiki

Infer Grid uses the positional location of row and column headers to interpret where a tabluar grid would be around each value in a table and extract values from each cell in the interpreted grid.

Infer Grid is one of three methods to extract data from tables on documents. This method extracts information from tables which have both row and column headers, inferring a grid from the header positions.  This is done by assigning an "X Axis Extractor" to match the column headers and a "Y Axis Extractor" to match row headers.  A grid is created from the header positions extracted from the "X Axis Extractor".  OCR data will populate each cell of the grid according to where it is on the page.  If everything is set up correctly, the inferred grid created will match the boundaries of the table on the document.

How To:  Configure Infer Grid

Consider the following table of monthly profits:

Revenue Expenses Profit
January 10,000 11,000 12,000
February 6,000 6,000 6,000
March 4,000 5,000 6,000

In the Data Model of a Content Model, create a "Data Table" and add as many "Data Columns" as necessary.  We will have four for this example: Month, Revenue, Expenses, and Profit.



Select the "Data Table" object you created ("Monthly Profits" for our example).  Using the property panel, select "Infer Grid" from the "Extract Method" dropdown list.



Expand "Extract Method" by double clicking it or pressing the carat to the left of it to show the configurable properties for "Infer Grid".



Next, set your "X Axis Extractor" and "Y Axis Extractor".  You will create an "X Axis Extractor" to return the values of your column headers.  Whatever is returned by the "Y Axis Extractor" will be the row headers.

  • The X axis extractor should match the entire header row at the top of the table and return sub instances for each individual column.  This can be done using regular expression, named groups or Collation Providers such as Ordered Array.
  • The Y Axis Extractor should match the entire header row on the left side of the table and return sub instances for each individual row.  This also can be done using regular expression, named groups or Collation Providers such as Ordered Array.



For this example, the extractor named "X Axis (Money Headers)" is a Data Type with three Data Formats to find "Revenue" "Expenses" and "Profit" respectively.  On the Data Type level, the collation method was changed to an ordered array looking horizontally. The names of child extractors locating the tables headers (in this case the three Data Formats) should match the names of the Data Table's Data Columns. Upon extraction, the cells underneath the header location returned by the Data Formats will fill the correspondingly named Data Column. For example, the values for each month underneath "Profit" whose header value was found by the Data Format named "Profit" will populate the Data Column named "Profit".



"Y Axis (Month Headers)" was an extractor looking for month names, collating the returned months as an Array. Once Grooper knows where the row and column headers are on a document, Infer Grid can use their positions on the document to figure out where the rest of the cells in the table are.

! Grooper can detect bounded boxes in a table and extract all information falling in the box. "Auto Snap" to lines will create an extraction zone within the boundaries of lines in the table. In other words, each box on the table will be extracted. Auto Snap can be further configured using the "Snap Limits" and "Snap Margin" properties. However! Grooper first needs to know where those lines are. If you wish to take advantage of Auto Snap, first obtain the document's layout data using a "Line Detection" image processing command.


Press the "Test Extraction" button to see our results so far.



Notice the Month column isn't filled in.



You can change the "Header Column" property to the "Data Column" you want to receive row header values.  This will populate this column with the values returned from your Y Axis Extractor.  For our example, we will set it to the Month column.



See the results below. The months our "Y Axis Extractor" returned have populated the Month column of the Data Table and the remaining columns have also been populated.  Notice we never wrote any extractors to find numerical values during this example.  All the numerical values in the table were extracted from the OCR data, using the grid our Data Table inferred from the header values extracted from our X and Y Axis Extractors.



Tabular OMR: Checkboxes and Table Extraction

As of 2.72, the "Infer Grid" method now supports columns containing OMR data.  This makes it much easier to read checkbox information from tables.  The set up is very easy.  Simply mark one or more columns as OMR columns.

Let's take the following table as an example.  We will make a Data Table named "Significant Cow Manipulations" that has three columns.  One for reading the "Plant" check box and one for "Simulator" as well as one for the "Description" column.



! OMR checkbox states (whether they are or are not marked) are obtained in Grooper during Image Processing. You must get this information from a Box Detection command during an Image Processing activity or Recognize activity in order for this extraction technique to work.

Select your Data Table in your Data Model. Under the "Extract Method" settings, select "OMR Columns".



Select the columns you wish to use as OMR Columns, which columns you want to read OMR data from.



Press "Test Extraction" to see the result.  The rows where the box was checked now show "True" where the blank boxes are marked "False".



That's pretty much it! As you can see, for both the rows on the left side of the table as well as the rows on the right, any filled in box has been marked "True".