2.72:Asset Management: Difference between revisions

From Grooper Wiki
No edit summary
No edit summary
Line 7: Line 7:
A standard naming convention for '''Data Type''' extractors is particularly helpful.  This single object has multiple collation configurations (''Key-Value Pair'', ''Ordered Array'', etc) that change the way data is returned.  Furthermore, '''Data Types''' are used all over Grooper, not only to extract values, but to exclude values from extraction, to limit the scope of a document where extraction is performed, to classify documents, and more.
A standard naming convention for '''Data Type''' extractors is particularly helpful.  This single object has multiple collation configurations (''Key-Value Pair'', ''Ordered Array'', etc) that change the way data is returned.  Furthermore, '''Data Types''' are used all over Grooper, not only to extract values, but to exclude values from extraction, to limit the scope of a document where extraction is performed, to classify documents, and more.


If you start naming your extractors "Value 1", "Value 2", "Value 3" and so on, these names are vague both in terms of what value they are extracting and ''how'' they are getting the value.  A simple coded prefix to the extractors name can give users an idea of how that extractor works at a simple glance of the name.  We prescribe you use the following prefix/suffix naming convention:  '''EXTRACTION - Content'''
If you start naming your extractors "Value 1", "Value 2", "Value 3" and so on, these names are vague both in terms of what value they are extracting and ''how'' they are getting the value.  A simple coded prefix to the extractors name can give users an idea of how that extractor works at a simple glance of the name.  We prescribe you use the following prefix/suffix naming convention:   


The "Extraction" prefix, in all caps, provides Design Studio users information about how the extractor is configured.  The "Content" suffix provides users information about what data the extractor is targeting.  The content is dependent on whatever it is you're trying to get out of the document.  However, the "Extraction" prefix can be particularly helpful to keep your extractors organized.
'''USAGE [TYPE/COLLATION] - Content'''


=== Collation Prefixes ===
* Extractors have jobs.  They do not return values for nothing!  The returned values are used for data model extraction, document classification, page separation and more. The "Usage" prefix, in all caps, provides Design Studio users a quick look at how the extractor is being used. 
* The "Content" suffix provides users information about what data the extractor is targeting.  The content is dependent on whatever it is you're trying to get out of the document.
* The "Type/Collation" prefix is optional.  It details a Value Reader's configured Extractor Type or a Data Type's configured Collation Provider.
** This can be helpful or distracting depending on your familiarity with Grooper and the complexity of the extractor.  It is often helpful for a Grooper designer to have a quick look at what type of extractor they're looking at in the node tree (e.g. Pattern Match, List Match, Labeled OMR, etc) or how a Data Type is collating results (e.g. Ordered Array, Key-Value List, Split, etc).  Include this prefix if it helps your extractor asset identification.  Discard it if you find it distracting.


The Extraction Prefix should at least provide users information about the collation provider used.  See the table below for collation prefix naming.
=== Usage Prefixes ===
 
The extractor should at least be named according to how its being used in Grooper.  What is its job?  Is it collecting values for a Data Field in a Data Model?  Is it defining section instances for a Data Section?  Is it an exclusion extractor or an input filter?
 
The Usage Prefix should provide a quick look at how the Value Reader or Data Type is being used.
 
{|cellpadding=10 cellspacing=5 style="margin:auto; width:700px"
|-style="background-color:#36B0A7; color:white"
|'''Usage'''||'''Usage Prefix'''||'''Example Name'''
|-style="background-color:#ddf5f5
|Generic Extractor (referenced by other extractors)||VAL||VAL - Text Segment
|-style="background-color:#ddf5f5
|Field Value Extractor (referenced by a '''Data Field''')||FV ''or'' VE||FV - Invoice Number
|-style="background-color:#ddf5f5
|Exclusion Extractor||EXCL||EXCL - Page Header
|-style="background-color:#ddf5f5
|Subtraction Extractor||SUB||SUB - Page Footer
|-style="background-color:#ddf5f5
|Input Filter Extractor||INPT ''or'' IF||INPT - Payment Info
|-style="background-color:#ddf5f5
|Data Section Extractor||SEC||SEC - Payment Info
|-style="background-color:#36B0A7; color:white"
|colspan=3|Table Extractors
|-style="background-color:#ddf5f5
|Row extractor for Row Match||TBL-RM||TBL-RM - Payment Info
|-style="background-color:#ddf5f5
|X-Axis extractor for Grid Layout||TBL-X||TBL-X - Payment Info
|-style="background-color:#ddf5f5
|Y-Axis extractor for Grid Layout||TBL-Y||TBL-Y - Payment Info
|-style="background-color:#ddf5f5
|Header extractor for Header-Value (depricated) or Tabular Layout||TBL-HE ''or'' TBL-COL||TBL-HE - Payment Info [Payment Date]
|-
|colspan=3"|Note: For a Header Extractor, it's often helpful to have the Data Table's name and Data Column's name in the Content Suffix.  Place the Data Column's name in square brackets or parenthesis after the Data Table's name.
|-style="background-color:#ddf5f5
|Footer extractor for Header-Value (depricated) or Tabular Layout||TBL-FOOT||TBL-FOOT- Payment Info (Total Line)
|-
|colspan=3"|Note: For a Footer Extractor, you could place what is used as a footer in square brackets or parenthesis, giving the user more info than it's a footer extractor for the "Payment Info" table.
|-style="background-color:#ddf5f5
|Column Value Extractor (referenced by a '''Data Column''')||TBL-CV||CV - Invoice Number
|-style="background-color:#36B0A7; color:white"
|colspan=3|Classification Extractors
|-style="background-color:#ddf5f5
|Positive Classification Extractor||CLAS||CLAS - Invoice
|-
|colspan=3"|
Note: Typically, the Content Suffix will be that Document Type's name being positively classified.
|-style="background-color:#ddf5f5
|Negative Classification Extractor||CLAS-NEG||CLAS-NEG - Invoice
|-style="background-color:#ddf5f5
|Feature Extractor||CLAS-FEAT||CLAS-FEAT - Invoice Bigrams
|-
|colspan=3"|
Note: Typically, the Content Suffix will be the kinds of features being collected.
|-style="background-color:#36B0A7; color:white"
|colspan=3|Separation Extractors
|-style="background-color:#ddf5f5
|Change in Value Separation Extractor||SEP-CIV||SEP-CIV - Invoice Number
|-style="background-color:#ddf5f5
|Pattern-Based Separation Extractor||SEP-PB||SEP-PB - Invoice Number
|-style="background-color:#ddf5f5
|EPI Separation/Page Number Extractor||SEP-EPI||SEP-EPI - Page # of ##
|-
|colspan=3"|Note: "SEP - EPI Extractor" and "SEP - Page Number Extractor" are accepted alternatives"
|}
 
Please note, the usage prefixes "TBL" and "CLAS" may seem like overkill.  You may think "FEAT - Invoice Bigrams" is just as informative as "CLAS-FEAT - Invoice Bigrams".  If you're a seasoned Grooper user, this is most likely true.  However, the additional "TBL" and "CLAS" prefixes can be helpful for newer Grooper users to call out the extractor's "job" more explicitly in its name.  However, they may not be strictly necessary.  You should ultimately adopt the naming convention that works best for your team.
 
<!---TEMPORARILY REDACTED
 
=== Type and Collation Prefixes ===
 
The Extraction Prefix should at least provide users information about the collation provider used.   
 
For example, a section extractor using split collation with the "Between" position locating payment information sections would be named "SEC [SPLT-BTW] - Payment Info"
 
See the table below for collation prefix naming.


{|cellpadding=10 cellspacing=5 style="margin:auto; width:700px"
{|cellpadding=10 cellspacing=5 style="margin:auto; width:700px"
Line 91: Line 169:
|}
|}


=== Usage Prefixes ===
=== Naming Child Extractors of Data Types ===


If the Data Type is used for a purpose other than returning a value (either to a Data Field or to return values referenced by other extractors), it should also be noted in the Extraction Prefix.  The Extraction Prefix should provide information about the extractor from general to specific.
There are some naming prescriptions to avoid generic, unhelpful child extractor naming, such as "Format 1", "Format 2", "Format 3" and so on.


For example, a section extractor using split collation with the Between position locating payment information sections would be named "SEC SPLT-BTW - Payment Info"
Child extractors should be named in a way to identify what is being extracted.  When possible, this can be very similar to the regex pattern used.  For example, a "Date" Data Type extractor can target and return variety of date formats:
 
{|cellpadding=10 cellspacing=5 style="margin:auto; width:700px"
|-style="background-color:#36B0A7; color:white"
|'''Usage'''||'''Extraction Prefix'''||'''Example Name'''
|-style="background-color:#ddf5f5
|Generic Extractor||VAL||VAL - Text Segment
|-style="background-color:#ddf5f5
|Value Extractor (supplied to a '''Data Field''')||VE||VE - Invoice Number
|-
|colspan=3|Note: The extractor's collation method should always be indicated in the Extraction Prefix.  If the example above used Key Value Pair collation in the Horizontal layout, it should be named "VE KVP-H - Invoice Number" to provide an at-a-glance awareness of how this extractor is configured.
|-style="background-color:#ddf5f5
|Exclusion Extractor||EXCL||EXCL - Page Header
|-style="background-color:#ddf5f5
|Subtraction Extractor||SUB||SUB - Page Footer
|-style="background-color:#ddf5f5
|Input Filter Extractor||IF||IF - Payment Info
|-style="background-color:#ddf5f5
|Data Section Extractor||SEC||SEC SPLT-BEG - Payment Info
|-
|colspan=3|Note: Again, the extractor's collation method should always be indicated in the Extraction Prefix.  The example above would be a Split collated '''Data Type''' using the "begin" position to return an instance for a '''Data Section'''.  If this extractor is just named "SEC - Payment Info" you loose that at-a-glance knowledge of which collation method it's using to create the '''Data Section''' instances.
|-style="background-color:#36B0A7; color:white"
|colspan=3|Table Extractors
|-style="background-color:#ddf5f5
|Row extractor for Row Match||TBL-RM||TBL-RM OA-H - Payment Info
|-style="background-color:#ddf5f5
|X-Axis extractor for Infer Grid||TBL-IGX||TBL-IGX OA-H - Payment Info
|-style="background-color:#ddf5f5
|Y-Axis extractor for Infer Grid||TBL-IGY||TBL-IGY OA-V - Payment Info
|-style="background-color:#ddf5f5
|Header extractor for Header-Value||TBL-HE ''or'' TBL-COL||TBL-HE - Payment Info [Payment Date]
|-
|colspan=3"|Note: The Content Suffix of a Data Type's name can provide extra information for the user.  For example, placing "Payment Date" in square brackets indicates the Header Extractor is locating the "Payment Date" header for the "Payment Info" table.
|-style="background-color:#ddf5f5
|Footer extractor for Header-Value||TBL-FOOT||TBL-FOOT- Payment Info [Total Line]
|-
|colspan=3"|Note: For a Footer Extractor, you could place what is used as a footer in square brackets, giving the user more info than it's a footer extractor for the "Payment Info" table.
|-style="background-color:#36B0A7; color:white"
|colspan=3|Classification Extractors
|-style="background-color:#ddf5f5
|Positive Classification Extractor||CLAS-POS||CLAS-POS - Invoice
|-style="background-color:#ddf5f5
|Negative Classification Extractor||CLAS-NEG||CLAS-NEG - Invoice
|-style="background-color:#ddf5f5
|Feature Extractor||CLAS-FEAT||CLAS-FEAT - Invoice
|-
|colspan=3"|Note: The usage prefixes "TBL" and "CLAS" may seem like overkill.  You may think "POS - Invoice" is just as informative as "CLAS-POS - Invoice".  If you're a seasoned Grooper user, this is most likely true.  However, the additional "TBL" and "CLAS" prefixes can be helpful for newer Grooper users to call out the extractor's "job" more explicitly in its name.  However, they may not be strictly necessary.  You should ultimately adopt the naming convention that works best for your team.
|-style="background-color:#36B0A7; color:white"
|colspan=3|Separation Extractors
|-style="background-color:#ddf5f5
|Change in Value Separation Extractor||SEP-CIV||SEP-CIV KVP-H - Invoice Number
|-style="background-color:#ddf5f5
|Pattern-Based Separation Extractor||SEP-PB||SEP-PB - Invoice Number
|-style="background-color:#ddf5f5
|EPI Separation/Page Number Extractor||EPI||EPI - Page # of ##
|}
 
=== Naming Children of Data Types ===
 
==== Data Formats ====
 
Data Formats are simple extractors that are always created as children of Data Types.  One of the main differences is they do not have collation properties.  So, there is no need for a collation prefix.  However, there are still some naming prescriptions to avoid generic, unhelpful naming, such as "Format 1", "Format 2", "Format 3" and so on.
 
Data Formats should be named in a way to identify what is being extracted.  When possible, this can be very similar to the regex pattern used.  For example, a "Date" extractor can have a variety of date formats:


* 06/12/1985
* 06/12/1985
Line 164: Line 179:
* 1985-06-12
* 1985-06-12


These are all formats for the same date.  You should name the Data Formats in a way that is descriptive to the data's pattern when possible.  The corresponding Data Formats that extract the date formats above could be named the following:
These are all formats for the same date.  You should name the child extractors in a way that is descriptive to the data's pattern when possible.  The corresponding extractors returning the date formats above could be named the following:


* ##/##/####
* ##/##/#### (or mm/dd/yyyy)
* ## Month ####
* ## Month #### (or dd Month yyyy)
* ####-##-##
* ####-##-## (or yyyy-mm-dd)


That being said, Data Formats don't always target data as structured as date format.  In these cases, a short phrase may be more useful.  For example, a Data Format returning days of the month could just be named "Month Names"
That being said, Data Formats don't always target data as structured as date format.  In these cases, a short phrase may be more useful.  For example, a Value Reader returning days of the month could just be named "Month Names"
 
==== Data Types as Children ====
 
Data Types created as children of other Data Types should still use the Extraction Prefix naming convention to call attention to their collation method.  Otherwise, they can follow the naming conventions for Data Formats.


==== Collation Specific Children Conventions ====
==== Collation Specific Children Conventions ====
Line 184: Line 195:
** Local children '''only''' used by the parent can simply be named "KEY" and "VALUE"
** Local children '''only''' used by the parent can simply be named "KEY" and "VALUE"


Split extractors
NAMED INSTANCE WARNING!!!


* The "Between" position requires two extractors.  Often, one for the beginning of the instance and one for the end.
** Referenced extractors indicating the beginning should be labeled "BEGIN - Parent Name"
** Referenced extractors indicating the end should be labled "END - Parent Name"
** Local children '''only''' used by the parent Data Type can simply be named "BEGIN" and "END"


Ordered Array extractors


* Ordered Arrays can present some special cases and even exceptions to our rules set out here.
* When an Ordered Array's children are used to populate Data Columns in a Data Table, the child extractors can use the same name as the Data Column to auto populate the table.
* When an Ordered Array's children are used to populate Data Columns in a Data Table, the child extractors can use the same name as the Data Column to auto populate the table.
** In these cases '''do not''' follow these naming conventions and simply name the child after the Data Column.  The names must match exactly.
** In these cases '''do not''' follow these naming conventions and simply name the child after the Data Column.  The names must match exactly.
* You may find it helpful to include the child names in the Content Suffix of an Ordered Array (i.e. "OA-H - Extractor Name [Child 1, Child 2, Child 3, Child 4]")
* You may find it helpful to include the child names in the Content Suffix of an Ordered Array (i.e. [OA-H] Extractor Name [Child 1, Child 2, Child 3, Child 4]")
** Ordered Arrays are often used to create the row structured for various table extraction methods.  Sometimes, multiple Ordered Arrays need to be created to account for multiple row formats, usually resulting from optional columns.
** Ordered Arrays are often used to create the row structured for various table extraction methods.  Sometimes, multiple Ordered Arrays need to be created to account for multiple row formats, usually resulting from optional columns.
*** For example, you may have a four column table where the third column is optional.  The table extractor may need two children, one Ordered Array for rows with all four columns filled and one for all but the third column filled.
*** For example, you may have a four column table where the third column is optional.  The table extractor may need two children, one Ordered Array for rows with all four columns filled and one for all but the third column filled.
**** In this case, you should list the elements present in the array.  You would have two children:
**** In this case, you should list the elements present in the array.  You would have two children:
***** OA-H - Table Name [Column 1, Column 2, Column 3, Column 4]
***** [OA-H] Table Name [Column 1, Column 2, Column 3, Column 4]
***** OA-H - Table Name [Column 1, Column 2, Column 4]
***** [OA-H] Table Name [Column 1, Column 2, Column 4]


=== Benefits ===
=== Benefits ===
Line 222: Line 227:
* Standardization means our Help Desk will get up to speed on your Content Model quickly if you call in for support.
* Standardization means our Help Desk will get up to speed on your Content Model quickly if you call in for support.
** This will ultimately save you time, resolving your issue quicker.
** This will ultimately save you time, resolving your issue quicker.
<!---EDITOR'S NOTE
This section needs review.  Some of the advice is good.  But some of it is outdated (For example, we don't really recommend using a "Key Extractors" folder anymore).  Likely the general guidance should be to limit folders to major usage types.  eg Classification Extracotrs, Exclusion Extractors, Field Value Extractors, etc.


== Foldering Conventions ==
== Foldering Conventions ==


A well organized folder structure can ease the process of locating assets in your Content Model.  Even following a good naming convention, it can become a chore to dig through a list of dozens of extractors to find the one you're looking for.
A well-organized folder structure can ease the process of locating assets in your Content Model.  Even following a good naming convention, it can become a chore to dig through a list of dozens of extractors to find the one you're looking for.


The standard recommended folders are listed below:
The standard recommended folders are listed below:

Revision as of 10:49, 29 December 2022

Asset management greatly improves your quality of life. A standard naming and foldering convention for your extractors and other assets will reduce the time you spend configuring and troubleshooting extraction. It also allows extractor references to be more easily noticed and understood without navigating into the references list directly.

Data Type Naming Conventions

A standard naming convention for Data Type extractors is particularly helpful. This single object has multiple collation configurations (Key-Value Pair, Ordered Array, etc) that change the way data is returned. Furthermore, Data Types are used all over Grooper, not only to extract values, but to exclude values from extraction, to limit the scope of a document where extraction is performed, to classify documents, and more.

If you start naming your extractors "Value 1", "Value 2", "Value 3" and so on, these names are vague both in terms of what value they are extracting and how they are getting the value. A simple coded prefix to the extractors name can give users an idea of how that extractor works at a simple glance of the name. We prescribe you use the following prefix/suffix naming convention:

USAGE [TYPE/COLLATION] - Content

  • Extractors have jobs. They do not return values for nothing! The returned values are used for data model extraction, document classification, page separation and more. The "Usage" prefix, in all caps, provides Design Studio users a quick look at how the extractor is being used.
  • The "Content" suffix provides users information about what data the extractor is targeting. The content is dependent on whatever it is you're trying to get out of the document.
  • The "Type/Collation" prefix is optional. It details a Value Reader's configured Extractor Type or a Data Type's configured Collation Provider.
    • This can be helpful or distracting depending on your familiarity with Grooper and the complexity of the extractor. It is often helpful for a Grooper designer to have a quick look at what type of extractor they're looking at in the node tree (e.g. Pattern Match, List Match, Labeled OMR, etc) or how a Data Type is collating results (e.g. Ordered Array, Key-Value List, Split, etc). Include this prefix if it helps your extractor asset identification. Discard it if you find it distracting.

Usage Prefixes

The extractor should at least be named according to how its being used in Grooper. What is its job? Is it collecting values for a Data Field in a Data Model? Is it defining section instances for a Data Section? Is it an exclusion extractor or an input filter?

The Usage Prefix should provide a quick look at how the Value Reader or Data Type is being used.

Usage Usage Prefix Example Name
Generic Extractor (referenced by other extractors) VAL VAL - Text Segment
Field Value Extractor (referenced by a Data Field) FV or VE FV - Invoice Number
Exclusion Extractor EXCL EXCL - Page Header
Subtraction Extractor SUB SUB - Page Footer
Input Filter Extractor INPT or IF INPT - Payment Info
Data Section Extractor SEC SEC - Payment Info
Table Extractors
Row extractor for Row Match TBL-RM TBL-RM - Payment Info
X-Axis extractor for Grid Layout TBL-X TBL-X - Payment Info
Y-Axis extractor for Grid Layout TBL-Y TBL-Y - Payment Info
Header extractor for Header-Value (depricated) or Tabular Layout TBL-HE or TBL-COL TBL-HE - Payment Info [Payment Date]
Note: For a Header Extractor, it's often helpful to have the Data Table's name and Data Column's name in the Content Suffix. Place the Data Column's name in square brackets or parenthesis after the Data Table's name.
Footer extractor for Header-Value (depricated) or Tabular Layout TBL-FOOT TBL-FOOT- Payment Info (Total Line)
Note: For a Footer Extractor, you could place what is used as a footer in square brackets or parenthesis, giving the user more info than it's a footer extractor for the "Payment Info" table.
Column Value Extractor (referenced by a Data Column) TBL-CV CV - Invoice Number
Classification Extractors
Positive Classification Extractor CLAS CLAS - Invoice

Note: Typically, the Content Suffix will be that Document Type's name being positively classified.

Negative Classification Extractor CLAS-NEG CLAS-NEG - Invoice
Feature Extractor CLAS-FEAT CLAS-FEAT - Invoice Bigrams

Note: Typically, the Content Suffix will be the kinds of features being collected.

Separation Extractors
Change in Value Separation Extractor SEP-CIV SEP-CIV - Invoice Number
Pattern-Based Separation Extractor SEP-PB SEP-PB - Invoice Number
EPI Separation/Page Number Extractor SEP-EPI SEP-EPI - Page # of ##
Note: "SEP - EPI Extractor" and "SEP - Page Number Extractor" are accepted alternatives"

Please note, the usage prefixes "TBL" and "CLAS" may seem like overkill. You may think "FEAT - Invoice Bigrams" is just as informative as "CLAS-FEAT - Invoice Bigrams". If you're a seasoned Grooper user, this is most likely true. However, the additional "TBL" and "CLAS" prefixes can be helpful for newer Grooper users to call out the extractor's "job" more explicitly in its name. However, they may not be strictly necessary. You should ultimately adopt the naming convention that works best for your team.