Category:Document Modeling

From Grooper Wiki

"Document modeling" is the process of designing structured representations of documents to better understand them, manage them, and/or extract information from them. Many different Grooper components are used to help represent a document in different ways. This includes:

  • Batch Objects - These are the components used to represent a document's structure/format and store raw data.
    • For example, Batch Pages represent a document's individual pages and store the page's image, text data obtained from the Recognize activity, and more.
  • Content Types - These are the components used to represent how documents fit into a classification schema.
    • Content Types are used to form a classification "taxonomy". Content Models are at the top of the taxonomy. They composed of Document Types which represent different kinds of documents that all fit within the Content Model.
    • For example, Document Types represent different kinds of documents in a larger Content Model. Content Types are key to dictating the Data Model used to extract data from a document and Behaviors that control processing logic for several different Activities in Grooper.
  • Data Elements - These are the components used to represent document data (fields, sections and tables written on the document).
    • The "Data Model" is the core Data Element. A Data Model will have one or more child Data Elements representing fields, sections and tables. Data Models collect document data during the Extract activity. What Data Model a document uses is determined by their Content Type.
    • For example, Data Fields represent field-level data, such as a Social Security Number on a personnel form.