2.72:Asset Management

From Grooper Wiki
Revision as of 10:49, 29 December 2022 by Dgreenwood (talk | contribs)

Asset management greatly improves your quality of life. A standard naming and foldering convention for your extractors and other assets will reduce the time you spend configuring and troubleshooting extraction. It also allows extractor references to be more easily noticed and understood without navigating into the references list directly.

Data Type Naming Conventions

A standard naming convention for Data Type extractors is particularly helpful. This single object has multiple collation configurations (Key-Value Pair, Ordered Array, etc) that change the way data is returned. Furthermore, Data Types are used all over Grooper, not only to extract values, but to exclude values from extraction, to limit the scope of a document where extraction is performed, to classify documents, and more.

If you start naming your extractors "Value 1", "Value 2", "Value 3" and so on, these names are vague both in terms of what value they are extracting and how they are getting the value. A simple coded prefix to the extractors name can give users an idea of how that extractor works at a simple glance of the name. We prescribe you use the following prefix/suffix naming convention:

USAGE [TYPE/COLLATION] - Content

  • Extractors have jobs. They do not return values for nothing! The returned values are used for data model extraction, document classification, page separation and more. The "Usage" prefix, in all caps, provides Design Studio users a quick look at how the extractor is being used.
  • The "Content" suffix provides users information about what data the extractor is targeting. The content is dependent on whatever it is you're trying to get out of the document.
  • The "Type/Collation" prefix is optional. It details a Value Reader's configured Extractor Type or a Data Type's configured Collation Provider.
    • This can be helpful or distracting depending on your familiarity with Grooper and the complexity of the extractor. It is often helpful for a Grooper designer to have a quick look at what type of extractor they're looking at in the node tree (e.g. Pattern Match, List Match, Labeled OMR, etc) or how a Data Type is collating results (e.g. Ordered Array, Key-Value List, Split, etc). Include this prefix if it helps your extractor asset identification. Discard it if you find it distracting.

Usage Prefixes

The extractor should at least be named according to how its being used in Grooper. What is its job? Is it collecting values for a Data Field in a Data Model? Is it defining section instances for a Data Section? Is it an exclusion extractor or an input filter?

The Usage Prefix should provide a quick look at how the Value Reader or Data Type is being used.

Usage Usage Prefix Example Name
Generic Extractor (referenced by other extractors) VAL VAL - Text Segment
Field Value Extractor (referenced by a Data Field) FV or VE FV - Invoice Number
Exclusion Extractor EXCL EXCL - Page Header
Subtraction Extractor SUB SUB - Page Footer
Input Filter Extractor INPT or IF INPT - Payment Info
Data Section Extractor SEC SEC - Payment Info
Table Extractors
Row extractor for Row Match TBL-RM TBL-RM - Payment Info
X-Axis extractor for Grid Layout TBL-X TBL-X - Payment Info
Y-Axis extractor for Grid Layout TBL-Y TBL-Y - Payment Info
Header extractor for Header-Value (depricated) or Tabular Layout TBL-HE or TBL-COL TBL-HE - Payment Info [Payment Date]
Note: For a Header Extractor, it's often helpful to have the Data Table's name and Data Column's name in the Content Suffix. Place the Data Column's name in square brackets or parenthesis after the Data Table's name.
Footer extractor for Header-Value (depricated) or Tabular Layout TBL-FOOT TBL-FOOT- Payment Info (Total Line)
Note: For a Footer Extractor, you could place what is used as a footer in square brackets or parenthesis, giving the user more info than it's a footer extractor for the "Payment Info" table.
Column Value Extractor (referenced by a Data Column) TBL-CV CV - Invoice Number
Classification Extractors
Positive Classification Extractor CLAS CLAS - Invoice

Note: Typically, the Content Suffix will be that Document Type's name being positively classified.

Negative Classification Extractor CLAS-NEG CLAS-NEG - Invoice
Feature Extractor CLAS-FEAT CLAS-FEAT - Invoice Bigrams

Note: Typically, the Content Suffix will be the kinds of features being collected.

Separation Extractors
Change in Value Separation Extractor SEP-CIV SEP-CIV - Invoice Number
Pattern-Based Separation Extractor SEP-PB SEP-PB - Invoice Number
EPI Separation/Page Number Extractor SEP-EPI SEP-EPI - Page # of ##
Note: "SEP - EPI Extractor" and "SEP - Page Number Extractor" are accepted alternatives"

Please note, the usage prefixes "TBL" and "CLAS" may seem like overkill. You may think "FEAT - Invoice Bigrams" is just as informative as "CLAS-FEAT - Invoice Bigrams". If you're a seasoned Grooper user, this is most likely true. However, the additional "TBL" and "CLAS" prefixes can be helpful for newer Grooper users to call out the extractor's "job" more explicitly in its name. However, they may not be strictly necessary. You should ultimately adopt the naming convention that works best for your team.