Asset Management

From Grooper Wiki
Jump to navigation Jump to search

Asset management greatly improves your quality of life. A standard naming and foldering convention for your extractors and other assets will reduce the time you spend configuring and troubleshooting extraction. It also allows extractor references to be more easily noticed and understood without navigating into the references list directly.

Data Type Naming Conventions

A standard naming convention for Data Type extractors is particularly helpful. This single object has multiple collation configurations (Key-Value Pair, Ordered Array, etc) that change the way data is returned. Furthermore, Data Types are used all over Grooper, not only to extract values, but to exclude values from extraction, to limit the scope of a document where extraction is performed, to classify documents, and more.

If you start naming your extractors "Value 1", "Value 2", "Value 3" and so on, these names are vague both in terms of what value they are extracting and how they are getting the value. A simple coded prefix to the extractors name can give users an idea of how that extractor works at a simple glance of the name. We prescribe you use the following prefix/suffix naming convention: EXTRACTION - Content

The "Extraction" prefix, in all caps, provides Design Studio users information about how the extractor is configured. The "Content" suffix provides users information about what data the extractor is targeting. The content is dependent on whatever it is you're trying to get out of the document. However, the "Extraction" prefix can be particularly helpful to keep your extractors organized.

Collation Prefixes

The Extraction Prefix should at least provide users information about the collation provider used. See the table below for collation prefix naming.

Collation Provider Extraction Prefix Example Name
Individual Collation none Invoice Number
Combine Collation
Combine using the Individual method CMB CMB - Invoice Number
Combine using the Flow method CMB-F CMB-F - Invoice Number
Combine using the Geometric method CMB-G CMB-G - Invoice Number
Combine using the Sum method CMB-S CMB-S - Invoice Number
Array Collation
Array with a Horizontal layout ARY-H ARY-H - Invoice Number
Array with a Vertical layout ARY-V ARY-V - Invoice Number
Array with a Flow layout ARY-F ARY-F - Invoice Number
Ordered Array Collation
Ordered Array with a Horizontal layout OA-H OA-H - Invoice Number
Ordered Array with a Vertical layout OA-V OA-V - Invoice Number
Ordered Array with a Flow layout OA-F OA-F - Invoice Number
Key-Value Pair Collation
Key-Value Pair with a Horizontal layout KVP-H KVP-H - Invoice Number
Key-Value Pair with a Vertical layout KVP-V KVP-V - Invoice Number
Key-Value Pair with a Flow layout KVP-F KVP-F - Invoice Number
Key-Value List Collation
Key-Value List with a Horizontal layout KVL-H KVL-H - Invoice Number
Key-Value List with a Vertical layout KVL-V KVL-V - Invoice Number
Key-Value List with a Flow layout KVL-F KVL-F - Invoice Number
Split Collation
Split using Begin position SPLT-BEG SPLT-BEG - Invoice Number
Split using End position SPLT-END SPLT-END - Invoice Number
Split using Between position SPLT-BTW SPLT-BTW - Invoice Number
Split using Around position SPLT-ARD SPLT-ARD - Invoice Number
Other Collation Providers
Pattern-Based PB PB - Invoice Number
Multi-Column MC MC - Invoice Number

Result Post Processor Prefixes

Post Processor Extraction Prefix Example Name
OMR Reader OMR OMR - Application Type
OCR Reader OCR OCR - Application Number

Usage Prefixes

If the Data Type is used for a purpose other than returning a value (either to a Data Field or to return values referenced by other extractors), it should also be noted in the Extraction Prefix. The Extraction Prefix should provide information about the extractor from general to specific.

For example, a section extractor using split collation with the Between position locating payment information sections would be named "SEC SPLT-BTW - Payment Info"

Usage Extraction Prefix Example Name
Generic Extractor VAL VAL - Text Segment
Value Extractor (supplied to a Data Field) VE VE - Invoice Number
Note: The extractor's collation method should always be indicated in the Extraction Prefix. If the example above used Key Value Pair collation in the Horizontal layout, it should be named "VE KVP-H - Invoice Number" to provide an at-a-glance awareness of how this extractor is configured.
Exclusion Extractor EXCL EXCL - Page Header
Subtraction Extractor SUB SUB - Page Footer
Input Filter Extractor IF IF - Payment Info
Data Section Extractor SEC SEC SPLT-BEG - Payment Info
Note: Again, the extractor's collation method should always be indicated in the Extraction Prefix. The example above would be a Split collated Data Type using the "begin" position to return an instance for a Data Section. If this extractor is just named "SEC - Payment Info" you loose that at-a-glance knowledge of which collation method it's using to create the Data Section instances.
Table Extractors
Row extractor for Row Match TBL-RM TBL-RM OA-H - Payment Info
X-Axis extractor for Infer Grid TBL-IGX TBL-IGX OA-H - Payment Info
Y-Axis extractor for Infer Grid TBL-IGY TBL-IGY OA-V - Payment Info
Header extractor for Header-Value TBL-HE or TBL-COL TBL-HE - Payment Info [Payment Date]
Note: The Content Suffix of a Data Type's name can provide extra information for the user. For example, placing "Payment Date" in square brackets indicates the Header Extractor is locating the "Payment Date" header for the "Payment Info" table.
Footer extractor for Header-Value TBL-FOOT TBL-FOOT- Payment Info [Total Line]
Note: For a Footer Extractor, you could place what is used as a footer in square brackets, giving the user more info than it's a footer extractor for the "Payment Info" table.
Classification Extractors
Positive Classification Extractor CLAS-POS CLAS-POS - Invoice
Negative Classification Extractor CLAS-NEG CLAS-NEG - Invoice
Feature Extractor CLAS-FEAT CLAS-FEAT - Invoice
Note: The usage prefixes "TBL" and "CLAS" may seem like overkill. You may think "POS - Invoice" is just as informative as "CLAS-POS - Invoice". If you're a seasoned Grooper user, this is most likely true. However, the additional "TBL" and "CLAS" prefixes can be helpful for newer Grooper users to call out the extractor's "job" more explicitly in its name. However, they may not be strictly necessary. You should ultimately adopt the naming convention that works best for your team.
Separation Extractors
Change in Value Separation Extractor SEP-CIV SEP-CIV KVP-H - Invoice Number
Pattern-Based Separation Extractor SEP-PB SEP-PB - Invoice Number
EPI Separation/Page Number Extractor EPI EPI - Page # of ##

Naming Children of Data Types

Data Formats

Data Formats are simple extractors that are always created as children of Data Types. One of the main differences is they do not have collation properties. So, there is no need for a collation prefix. However, there are still some naming prescriptions to avoid generic, unhelpful naming, such as "Format 1", "Format 2", "Format 3" and so on.

Data Formats should be named in a way to identify what is being extracted. When possible, this can be very similar to the regex pattern used. For example, a "Date" extractor can have a variety of date formats:

  • 06/12/1985
  • 12 June 1985
  • 1985-06-12

These are all formats for the same date. You should name the Data Formats in a way that is descriptive to the data's pattern when possible. The corresponding Data Formats that extract the date formats above could be named the following:

  • ##/##/####
  • ## Month ####
  • ####-##-##

That being said, Data Formats don't always target data as structured as date format. In these cases, a short phrase may be more useful. For example, a Data Format returning days of the month could just be named "Month Names"

Data Types as Children

Data Types created as children of other Data Types should still use the Extraction Prefix naming convention to call attention to their collation method. Otherwise, they can follow the naming conventions for Data Formats.

Collation Specific Children Conventions

Key-Value Pair and Key-Value List extractors

  • The Key-Value collation types require two extractors. A "key" extractor and a "value" extractor.
    • Referenced Key extractors should be labeled as "KEY - Parent Name"
    • Local children only used by the parent can simply be named "KEY" and "VALUE"

Split extractors

  • The "Between" position requires two extractors. Often, one for the beginning of the instance and one for the end.
    • Referenced extractors indicating the beginning should be labeled "BEGIN - Parent Name"
    • Referenced extractors indicating the end should be labled "END - Parent Name"
    • Local children only used by the parent Data Type can simply be named "BEGIN" and "END"

Ordered Array extractors

  • Ordered Arrays can present some special cases and even exceptions to our rules set out here.
  • When an Ordered Array's children are used to populate Data Columns in a Data Table, the child extractors can use the same name as the Data Column to auto populate the table.
    • In these cases do not follow these naming conventions and simply name the child after the Data Column. The names must match exactly.
  • You may find it helpful to include the child names in the Content Suffix of an Ordered Array (i.e. "OA-H - Extractor Name [Child 1, Child 2, Child 3, Child 4]")
    • Ordered Arrays are often used to create the row structured for various table extraction methods. Sometimes, multiple Ordered Arrays need to be created to account for multiple row formats, usually resulting from optional columns.
      • For example, you may have a four column table where the third column is optional. The table extractor may need two children, one Ordered Array for rows with all four columns filled and one for all but the third column filled.
        • In this case, you should list the elements present in the array. You would have two children:
          • OA-H - Table Name [Column 1, Column 2, Column 3, Column 4]
          • OA-H - Table Name [Column 1, Column 2, Column 4]

Benefits

We strongly advise you follow these naming conventions. However, Grooper does not force you to use this prescription (There are always exceptions to the rule). So why do it? Advantages include:

  • Quick identification of collation methods used
    • Is the Data Type a Key-Value Pair? Is it an Ordered Array?
  • Quick identification of usage in Grooper
    • Is it for a Section Extractor? Is it for classification?
    • This is especially helpful when referencing the extractor. (i.e. When setting an Exclusion Extractor property, you will know to look for the "EXCL" prefix.)
  • Standardization means different users will easily be able to identify assets, even when they did not create them.
    • Imagine your Grooper architect leaves the company and you have to replace them with a new employee. Once that employee learns this naming convention, they will be better equipped to navigate and the pre-existing Content Model.
  • Standardization means you will easily be able to identify assets down the line.
    • Six months from now are you going to remember how you built something? These Extractor Prefixes will assist you in getting back up to speed on older projects.
  • Standardization means our Help Desk will get up to speed on your Content Model quickly if you call in for support.
    • This will ultimately save you time, resolving your issue quicker.

Foldering Conventions

A well organized folder structure can ease the process of locating assets in your Content Model. Even following a good naming convention, it can become a chore to dig through a list of dozens of extractors to find the one you're looking for.

The standard recommended folders are listed below:

  • Classification Extractors
  • Exclusion Extractors
  • Input Filter Extractors
  • Key Extractors
  • Section Extractors
  • Separation Extractors
  • Subtraction Extractors
  • Table Extractors
  • Value Extractors
Asset management 1.png

For more simple Content Models, this may create some unnecessary foldering. This foldering recommendation is "standard" assuming you will create extractors for each folder.

Don't use Input Filter extractors? You don't need an Input Filter Extractors folder. There aren't any tables to extract on your document set? You don't need a Table Extractors folder. Not creating any Data Sections in your Data Model? You don't need a Section Extractors folder. And so on.

Subfoldering

Subfoldering can get as complex as your use case deems necessary.

Asset management 2.png

For example, Key-Value Pairs generally use a "forward direction", meaning the value is to the right of the key (for horizontal layout) or below it (for vertical layout). However, you may need to use the "reverse direction". This flips the orientation between the key and the value, meaning the value is to the left of the key or above it. You may then find it helpful to have a "Key-Value Pairs" folder with both a "Forward" and "Reverse" subfolder.

Asset management 3.png

However, if you only use the "forward direction", like most users do, you may simplify your folder structure by only having a "Key-Value Pairs" folder.

Asset management 4.png

Certain folders, such as the "Table Extractors" can fill up quickly with extractors. You may need to create a series of vertical Ordered Arrays to find column headers for multiple table formats. This is where a foldering structure can really help keep objects in place.

Again, keep in mind a "general to specific" flow when making your folders. With "Table Extractors" as your base folder. "Ordered Arrays" is the next level of specificity. Ordered Arrays can be created to find the table headers, establish the row's pattern for Row Match's row extractor, and more. "Table A" and "Table B" then organize all those various Ordered Arrays into their content specific locations for the document. Finally, individual headers are specific elements of each table format, organized into a "Headers" folder.

Using Foldering In Your Extractor Naming

For complicated Content Models that are heavily reliant on referenced extractors, your foldering can inform how you name your extractors, making it easy to find them when setting their reference.

Asset management 5.png

If we fill out our "Table Extractors" folder from the previous example with extractors, we can see this in action. At the most granular level, the Header Extractors are named "OA-V - Table A - Headers - Column 1". Just looking at the title, you can get an idea of where they are in the folder structure. We know they are an Ordered Array from their Extractor Prefix. Then, we can see the folder path follows down the folder path in the Node Tree: Ordered Arrays > Table A > Headers corresponds to "OA-H - Table A - Headers" in the Data Type's name.


When you end up referencing these extractors (either as results for another Data Type or in your Data Model or elsewhere in Grooper), you can use your folder names to find your extractors and use your extractor names to find the folder it lives in.

Asset management 6.png