2023.1:Classification (Concept): Difference between revisions

Revision as of 07:57, 16 October 2020

Classification, in Grooper, is the process of assigning a Content Type (specifically a Document Type of a Content Model) to a Batch Folder. As far as Grooper is concerned a document is a Batch Folder objects with Batch Page objects as its children. Before classification, the document (Batch Folder) is an unclassified or "blank" document. Grooper doesn't know what kind of document it is yet. Documents are classified by:

Most often, the Classify activity using training data or rules set on a Content Model
In some cases, the Separate activity by assigning a Document Type to each new folder created
Manually assigning a Document Type by using the "Apply Document Type" command on a Batch Folder.

During the Classify activity, Grooper will use information from the document and its pages (generally text) and configurations from a Content Model (such as the Classification Method used) to assign the document a Document Type from a Content Model.

Classification and Data Extraction

Classification is performed before data extraction. Classification is actually a critical part of data extraction. Data extraction executes using configured Data Elements in a Data Model. A Data Model is part of a Content Model's hierarchy. Therefore, a document must be assigned a Content Type (specifically a Document Type of a Content Model) in order for the Extract activity to see the Data Models specifications for data extraction.

Until a document is classified, it has no Content Type assigned to it. It doesn't know which Content Model and corresponding Document Types and Data Models you're using to extract data. Without this information, Grooper will not understand which Data Elements to look for on which Document Types. Nor will it know the the extractors used to return values to the Data Elements in a Data Model.

In other words, the document must be classified (having a Document Type assigned to it) before performing the Extract activity.

@@ Line 7: / Line 7: @@
 During the '''Classify''' activity, Grooper will use information from the document and its pages (generally text) and configurations from a '''Content Model''' (such as the '''''[[Classification Method]]''''' used) to assign the document a '''Document Type''' from a '''Content Model'''.
-Classification is performed before [[Extract|data extraction]].  Until a document is classified, it has no '''Content Type''' assigned to it.  It doesn't know which '''Content Model''' and corresponding '''Document Types''' and '''Data Models''' you're using to extract data.  Without this information, Grooper will not understand which '''Data Elements''' to look for or the instructions to use to identify their data within the document.
+=== Classification and Data Extraction ===
+Classification is performed before [[Extract|data extraction]].  Classification is actually a critical part of data extraction.  Data extraction executes using configured '''Data Elements''' in a '''Data Model'''.  A '''Data Model''' is part of a '''Content Model's''' hierarchy.  Therefore, a document must be assigned a '''Content Type''' (specifically a '''Document Type''' of a '''Content Model''') in order for the '''Extract''' activity to see the '''Data Models''' specifications for data extraction.
+Until a document is classified, it has no '''Content Type''' assigned to it.  It doesn't know which '''Content Model''' and corresponding '''Document Types''' and '''Data Models''' you're using to extract data.  Without this information, Grooper will not understand which '''Data Elements''' to look for on which '''Document Types'''.  Nor will it know the the extractors used to return values to the '''Data Elements''' in a '''Data Model'''.
 In other words, the document ''must'' be classified (having a '''Document Type''' assigned to it) before performing the '''Extract''' activity.