Asset Management

From Grooper Wiki

Asset management greatly improves your quality of life. A standard naming and foldering convention for your extractors and other assets will reduce the time you spend configuring and troubleshooting extraction. It also allows extractor references to be more easily noticed and understood without navigating into the references list directly.

Data Type Naming Conventions

A standard naming convention for Value Reader and Data Type extractors is particularly helpful. Value Readers have multiple extractor type options (Pattern Match, List Match, Labeled OMR, etc) and Data Types have multiple collation configurations (Key-Value Pair, Ordered Array, etc) that change the way data is returned. Furthermore, extractors are used all over Grooper, not only to extract values to populate a Data Model, but to exclude values from extraction, to limit the scope of a document where extraction is performed, to classify documents, and more.

If you start naming your extractors "Value 1", "Value 2", "Value 3" and so on, these names are vague both in terms of what value they are extracting and how they are getting the value. A simple coded prefix to the extractors name can give users an idea of how that extractor works at a simple glance of the name. We prescribe you use the following prefix/suffix naming convention:

USAGE [TYPE/COLLATION] - Content

  • Extractors have jobs. They do not return values for nothing! The returned values are used for data model extraction, document classification, page separation and more. The "Usage" prefix provides Grooper users a quick look at how the extractor is being used.
  • The "Content" suffix provides users information about what data the extractor is targeting. The content is dependent on whatever it is you're trying to get out of the document.
  • The "Type/Collation" prefix is optional. It details a Value Reader's configured Extractor Type or a Data Type's configured Collation Provider.
    • This can be helpful or distracting depending on your familiarity with Grooper and the complexity of the extractor. It is often helpful for a Grooper designer to have a quick look at what type of extractor they're looking at in the node tree (e.g. Pattern Match, List Match, Labeled OMR, etc) or how a Data Type is collating results (e.g. Ordered Array, Key-Value List, Split, etc). Include this prefix if it helps your extractor asset identification. Discard it if you find it distracting.

Usage Prefixes

The extractor should at least be named according to how its being used in Grooper. What is its job? Is it collecting values for a Data Field in a Data Model? Is it defining section instances for a Data Section? Is it an exclusion extractor or an input filter?

The Usage Prefix should provide a quick look at how the Value Reader or Data Type is being used.

Usage Usage Prefix Example Name
Generic Extractor (referenced by other extractors) Option 1: VAL
Option 2: No prefix
Option 1: VAL - Text Segment
Option 2: Text Segment
Field Value Extractor (referenced by a Data Field) FV or VE FV - Invoice Number
Exclusion Extractor EXCL EXCL - Page Header
Subtraction Extractor SUB SUB - Page Footer
Input Filter Extractor INPT or IF INPT - Payment Info
Data Section Extractor SEC SEC - Payment Info
Table Extractors
Row extractor for Row Match TBL-RM TBL-RM - Payment Info
X-Axis extractor for Grid Layout TBL-X TBL-X - Payment Info
Y-Axis extractor for Grid Layout TBL-Y TBL-Y - Payment Info
Header extractor for Header-Value (depricated) or Tabular Layout TBL-HE or TBL-COL TBL-HE - Payment Info [Payment Date]
Note: For a Header Extractor, it's often helpful to have the Data Table's name and Data Column's name in the Content Suffix. Place the Data Column's name in square brackets or parenthesis after the Data Table's name.
Footer extractor for Header-Value (depricated) or Tabular Layout TBL-FOOT TBL-FOOT- Payment Info (Total Line)
Note: For a Footer Extractor, you could place what is used as a footer in square brackets or parenthesis, giving the user more info than it's a footer extractor for the "Payment Info" table.
Column Value Extractor (referenced by a Data Column) TBL-CV CV - Invoice Number
Classification Extractors
Positive Classification Extractor CLAS CLAS - Invoice

Note: Typically, the Content Suffix will be that Document Type's name being positively classified.

Negative Classification Extractor CLAS-NEG CLAS-NEG - Invoice
Feature Extractor CLAS-FEAT CLAS-FEAT - Invoice Bigrams

Note: Typically, the Content Suffix will be the kinds of features being collected.

Separation Extractors
Change in Value Separation Extractor SEP-CIV SEP-CIV - Invoice Number
Pattern-Based Separation Extractor SEP-PB SEP-PB - Invoice Number
EPI Separation/Page Number Extractor SEP-EPI SEP-EPI - Page # of ##
Note: "SEP - EPI Extractor" and "SEP - Page Number Extractor" are accepted alternatives"

Please note, the usage prefixes "TBL" and "CLAS" may seem like overkill. You may think "FEAT - Invoice Bigrams" is just as informative as "CLAS-FEAT - Invoice Bigrams". If you're a seasoned Grooper user, this is most likely true. However, the additional "TBL" and "CLAS" prefixes can be helpful for newer Grooper users to call out the extractor's "job" more explicitly in its name. However, they may not be strictly necessary. You should ultimately adopt the naming convention that works best for your team.

Type and Collation Prefixes

The Type/Collation Prefix can provide useful at-a-glance information about how an extractor is configured. We consider this prefix "optional". Depending on your familiarity with Grooper you may find this more or less helpful. However, because a Value Reader's Extractor Type and a Data Type's Collation Provider impacts how results are returned so much, it is often helpful to know before you even select the extractor in the Node Tree.

For example, a section extractor using Split collation with the Between position locating payment information sections on a document would be named "SEC [SPLT-BTW] - Payment Info". This lets you know not only is this a section extractor (given the "SEC" prefix) but also some information about how the extractor is configured, letting you know more about the extraction logic before you start to edit the extractor's configuration.

See the table below for collation prefix naming.

Collation Provider Extraction Prefix Example Name
Individual Collation none FV - Invoice Number
Combine Collation
Combine using the Individual method CMB FV [CMB] - Invoice Number
Combine using the Flow method CMB-F FV [CMB-F] - Invoice Number
Combine using the Geometric method CMB-G FV [CMB-G] - Invoice Number
Combine using the Sum method CMB-S FV [CMB-S] - Invoice Number
Array Collation
Array with a Horizontal layout ARY-H FV [ARY-H] - Invoice Number
Array with a Vertical layout ARY-V FV [ARY-V] - Invoice Number
Array with a Flow layout ARY-F FV [ARY-F] - Invoice Number
Ordered Array Collation
Ordered Array with a Horizontal layout OA-H FV [OA-H] - Invoice Number
Ordered Array with a Vertical layout OA-V FV [OA-V] - Invoice Number
Ordered Array with a Flow layout OA-F FV [OA-F] - Invoice Number
Key-Value Pair Collation
Key-Value Pair with a Horizontal layout KVP-H FV [KVP-H] - Invoice Number
Key-Value Pair with a Vertical layout KVP-V FV [KVP-V] - Invoice Number
Key-Value Pair with a Flow layout KVP-F FV [KVP-F] - Invoice Number
Key-Value List Collation
Key-Value List with a Horizontal layout KVL-H FV [KVL-H] - Invoice Number
Key-Value List with a Vertical layout KVL-V FV [KVL-V] - Invoice Number
Key-Value List with a Flow layout KVL-F FV [KVL-F] - Invoice Number
Split Collation
Split using Begin position SPLT-BEG SEC [SPLT-BEG] - Invoice Header Details
Split using End position SPLT-END SEC [SPLT-END] - Invoice Header Details
Split using Between position SPLT-BTW SEC [SPLT-BTW] - Invoice Header Details
Split using Around position SPLT-ARD SEC [SPLT-ARD] - Invoice Header Details
Other Collation Providers
Pattern-Based PB SEC [PB] - Invoice Header Details
Multi-Column AND CLAS [AND] - Generic Letters
Multi-Column MC SEC [MC] - Report Columns

Naming Child Extractors of Data Types

There are some naming prescriptions to avoid generic, unhelpful child extractor naming, such as "Format 1", "Format 2", "Format 3" and so on.

Child extractors should be named in a way to identify what is being extracted. When possible, this can be very similar to the regex pattern used. For example, a "Date" Data Type extractor can target and return variety of date formats:

  • 06/12/1985
  • 12 June 1985
  • 1985-06-12

These are all formats for the same date. You should name the child extractors in a way that is descriptive to the data's pattern when possible. The corresponding extractors returning the date formats above could be named the following:

  • ##/##/#### (or mm/dd/yyyy)
  • ## Month #### (or dd Month yyyy)
  • ####-##-## (or yyyy-mm-dd)

That being said, Data Formats don't always target data as structured as date format. In these cases, a short phrase may be more useful. For example, a Value Reader returning days of the month could just be named "Month Names"

Collation Specific Children Conventions

Key-Value Pair and Key-Value List extractors

  • The Key-Value collation types require two extractors. A "key" extractor and a "value" extractor.
    • Referenced Key extractors should be labeled as "KEY - Parent Name"
    • Local children only used by the parent can simply be named "KEY" and "VALUE"

Ordered Array Advice

  • When an Ordered Array's children are used to populate Data Columns in a Data Table, the child extractors can use the same name as the Data Column to auto populate the table.
    • In these cases do not follow these naming conventions and simply name the child after the Data Column. The names must match exactly.
  • You may find it helpful to include the child names in the Content Suffix of an Ordered Array (i.e. [OA-H] Extractor Name [Child 1, Child 2, Child 3, Child 4]")
    • Ordered Arrays are often used to create the row structured for various table extraction methods. Sometimes, multiple Ordered Arrays need to be created to account for multiple row formats, usually resulting from optional columns.
      • For example, you may have a four column table where the third column is optional. The table extractor may need two children, one Ordered Array for rows with all four columns filled and one for all but the third column filled.
        • In this case, you should list the elements present in the array. You would have two children:
          • [OA-H] Table Name [Column 1, Column 2, Column 3, Column 4]
          • [OA-H] Table Name [Column 1, Column 2, Column 4]

Exception to the Rule: Named Instances

There is an exception to our asset naming convention: Named instances.

There are a few spots in Grooper where using a "named instance" can be an extraction shortcut. For example, using the Row Match method, the Data Table will populate Data Columns if the extractor has a named instance that matches the name of the Data Column (no need to configure the Data Column's Value Extractor!). A child extractor can supply the named instance. In this case, you need to ignore all of our naming convention advice and ensure the child extractor's name matches the Data Column's name exactly.

Extractors taking advantage of named instance shortcuts should always be named such that the named instance matches the destination object (such as a Data Column's name).

Benefits

We strongly advise you follow these naming conventions. However, Grooper does not force you to use this prescription (There are always exceptions to the rule). So why do it? Advantages include:

  • Quick identification of collation methods used
    • Is the Data Type a Key-Value Pair? Is it an Ordered Array?
  • Quick identification of usage in Grooper
    • Is it for a Section Extractor? Is it for classification?
    • This is especially helpful when referencing the extractor. (i.e. When setting an Exclusion Extractor property, you will know to look for the "EXCL" prefix.)
  • Standardization means different users will easily be able to identify assets, even when they did not create them.
    • Imagine your Grooper architect leaves the company and you have to replace them with a new employee. Once that employee learns this naming convention, they will be better equipped to navigate and the pre-existing Content Model.
  • Standardization means you will easily be able to identify assets down the line.
    • Six months from now are you going to remember how you built something? These Extractor Prefixes will assist you in getting back up to speed on older projects.
  • Standardization means our Help Desk will get up to speed on your Content Model quickly if you call in for support.
    • This will ultimately save you time, resolving your issue quicker.