Expressions Cookbook (Concept): Difference between revisions
No edit summary |
No edit summary |
||
| Line 2: | Line 2: | ||
Expressions are snippets of .NET code, allowing Grooper to do various things outside its "normal" parameters. This includes calculating or validating extracted '''Data Field''' values in a '''Data Model''', applying conditional execution of a '''Batch Process''' or '''IP Profile''', and more! This article collects examples of common (and maybe not so common) uses of expressions in Grooper. | Expressions are snippets of .NET code, allowing Grooper to do various things outside its "normal" parameters. This includes calculating or validating extracted '''Data Field''' values in a '''Data Model''', applying conditional execution of a '''Batch Process''' or '''IP Profile''', and more! This article collects examples of common (and maybe not so common) uses of expressions in Grooper. | ||
== Glossary == | |||
<u><big>'''Batch Folder'''</big></u>: {{#lst:Glossary|Batch Folder}} | |||
<u><big>'''Batch Process'''</big></u>: {{#lst:Glossary|Batch Process}} | |||
<u><big>'''Batch'''</big></u>: {{#lst:Glossary|Batch}} | |||
<u><big>'''Box'''</big></u>: {{#lst:Glossary|Box}} | |||
<u><big>'''Classify'''</big></u>: {{#lst:Glossary|Classify}} | |||
<u><big>'''Content Category'''</big></u>: {{#lst:Glossary|Content Category}} | |||
<u><big>'''Data Field'''</big></u>: {{#lst:Glossary|Data Field}} | |||
<u><big>'''Data Model'''</big></u>: {{#lst:Glossary|Data Model}} | |||
<u><big>'''Document Type'''</big></u>: {{#lst:Glossary|Document Type}} | |||
<u><big>'''Execute'''</big></u>: {{#lst:Glossary|Execute}} | |||
<u><big>'''Export'''</big></u>: {{#lst:Glossary|Export}} | |||
<u><big>'''Expressions Cookbook'''</big></u>: {{#lst:Glossary|Expressions Cookbook}} | |||
<u><big>'''Expressions'''</big></u>: {{#lst:Glossary|Expressions}} | |||
<u><big>'''IP Command'''</big></u>: {{#lst:Glossary|IP Command}} | |||
<u><big>'''IP Profile'''</big></u>: {{#lst:Glossary|IP Profile}} | |||
<u><big>'''LINQ to Grooper Objects'''</big></u>: {{#lst:Glossary|LINQ to Grooper Objects}} | |||
<u><big>'''Machine'''</big></u>: {{#lst:Glossary|Machine}} | |||
<u><big>'''OCR'''</big></u>: {{#lst:Glossary|OCR}} | |||
<u><big>'''Recognize'''</big></u>: {{#lst:Glossary|Recognize}} | |||
<u><big>'''Render'''</big></u>: {{#lst:Glossary|Render}} | |||
<u><big>'''Service'''</big></u>: {{#lst:Glossary|Service}} | |||
{|class="important-box" | {|class="important-box" | ||
Revision as of 07:51, 10 May 2024
The "Expressions Cookbook" is a reference list for commonly used Code Expressions in Grooper.
Expressions are snippets of .NET code, allowing Grooper to do various things outside its "normal" parameters. This includes calculating or validating extracted Data Field values in a Data Model, applying conditional execution of a Batch Process or IP Profile, and more! This article collects examples of common (and maybe not so common) uses of expressions in Grooper.
Glossary
Batch Folder: The folder Batch Folder is an organizational unit within a inventory_2 Batch, allowing for a structured approach to managing and processing a collection of documents. Batch Folder nodes serve two purposes in a Batch. (1) Primarily, they represent "documents" in Grooper. (2) They can also serve more generally as folders, holding other Batch Folders and/or contract Batch Page nodes as children.
- Batch Folders are frequently referred to simply as "documents" or "folders" depending on how they are used in the Batch.
Batch Process: settings Batch Process nodes are crucial components in Grooper's architecture. A Batch Process is the step-by-step processing instructions given to a inventory_2 Batch. Each step is comprised of a "Code Activity" or a Review activity. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.
- Batch Processes by themselves do nothing. Instead, they execute edit_document Batch Process Steps which are added as children nodes.
- A Batch Process is often referred to as simply a "process".
Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.
Box: Box is a connection option for cloud CMIS Connections. It Grooper to the Box content management system for import and export operations.
Classify: unknown_document Classify is an Activity that "classifies" folder Batch Folders in a inventory_2 Batch by assigning them a description Document Type.
- Classification is key to Grooper's document processing. It affects how data is extracted from a document (during the Extract activity) and how Behaviors are applied.
- Classification logic is controlled by a Content Model's "Classify Method". These methods include using text patterns, previously trained document examples, and Label Sets to identify documents.
Content Category: collections_bookmark A Content Category is a container for other Content Category or description Document Type nodes in a stacks Content Model. Content Categories are often used simply as organizational buckets for Content Models with large numbers of Document Types. However, Content Categories are also necessary to create branches in a Content Model's classification taxonomy, allowing for more complex Data Element inheritance and Behavior inheritance.
Data Field: variables Data Fields represent a single value targeted for data extraction on a document. Data Fields are created as child nodes of a data_table Data Model and/or insert_page_break Data Sections.
- Data Fields are frequently referred to simply as "fields".
Data Model: data_table Data Models are leveraged during the Extract activity to collect data from documents (folder Batch Folders). Data Models are the root of a Data Element hierarchy. The Data Model and its child Data Elements define a schema for data present on a document. The Data Model's configuration (and its child Data Elements' configuration) define data extraction logic and settings for how data is reviewed in a Data Viewer.
Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:
- They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
- The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
- The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).
Execute: tv_options_edit_channels Execute is an Activity that runs one or more specified object commands. This gives access to a variety of Grooper commands in a settings Batch Process for which there is no Activity, such as the "Sort Children" command for Batch Folders or the "Expand Attachments" command for email attachments.
Export: output Export is an Activity that transfers documents and extracted information to external file systems and content management systems, completing the data processing workflow.
Expressions Cookbook: The "Expressions Cookbook" is a reference list for commonly used Code Expressions in Grooper.
Expressions: Expressions (not to be confused with regular expressions) are snippets of VB.NET code that expand Grooper's core functionality.
IP Command: IP Commands specify an image processing (IP) operation (such as image cleanup, format conversion or feature detection) and are used to construct image IP Steps in an IP Profile. IP Commands are configured using an IP Step's Command property.
IP Profile: perm_media IP Profiles are a step-by-step list of image processing operations (IP Commands). They are used for several image processing related operations, but primarily for:
- Permanently enhancing an image during the Image Processing activity (usually to get rid of defects in a scanned image, such as skewing or borders).
- Cleaning up an image in-memory during the Recognize activity without altering the image to improve OCR accuracy.
- Computer vision operations that collect layout data (table line locations, OMR checkboxes, barcode value and more) utilized in data extraction.
LINQ to Grooper Objects: LINQ is Microsoft .NET component that provides data querying capabilities to the .NET framework. In Grooper, you can use the LINQ syntax in Code Expressions to "LINQ to Grooper Objects". This allows expressions to access information from collections of data, such as from multi-instance Data Sections or Data Tables.
Machine: computer Machine nodes represent servers that have connected to the Grooper Repository. They are essential for distributing task processing loads across multiple servers. Grooper creates Machine nodes automatically whenever a server makes a new connection to a Grooper Repository's database. Once added, Machine nodes can be used to view server information and to manage Grooper Service instances.
OCR: OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.
Recognize: format_letter_spacing_wide Recognize is an Activity that obtains machine-readable text from contract Batch Pages and folder Batch Folders. When properly configured with an library_booksOCR Profile, Recognize will selectively perform OCR for images and native-text extraction for digital text in PDFs. Recognize can also reference an perm_mediaIP Profile to collect "layout data" like lines, checkboxes, and barcodes. Other Activities then use this machine-readable text and layout data for document analysis and data extraction.
Render: print Render is an Activity that converts files of various formats to PDF. It does this by digitally printing the file to PDF using the Grooper Render Printer. This normalizes electronic document content from file formats Grooper cannot read natively to PDF (which it can read natively), allowing Grooper to extract the text via the format_letter_spacing_wide Recognize Activity.
Service: Grooper Services are various executable programs that run as a Windows Service to facilitate Grooper processing. Service instances are installed, configured, started and stopped using Grooper Command Console (or in older Grooper versions, Grooper Config).
|
😎 |
Special thanks to BIS team member Brian Godwin (and others on the Professional Services team) for contributing this article! |
Data Model Expressions
Default Value Expressions
Current date
Now
New Guid (Globally Unique IDentifier)
Guid.NewGuid
Inspecting portions of link path for original file (path, filename, metadata) attached to a Batch Folder
- These examples extract information (full path & filename, filename, path, extension) from a Batch Folder's content link
Link.FullPath- Returns the attached file's path, including the file's name.
Link.Path- Returns the attached file's path, not including the file's name.
Link.ObjectName- Returns the attached file's name.
Link.LinkName- Returns the name of the link (e.g. "Import" or "Export")
IO.Path.GetFileNameWithoutExtension(Link.ObjectName)- Returns the attached file's name without the extension.
IO.Path.GetExtension(Link.ObjectName)
- These examples extract specific path segments (drive letter, first folder name) from a batch folder's content link
Link.PathSegments(0)- Returns the first segment of the file path (e.g. "C:" in "C:\Documents\Howdy.pdf").
Link.PathSegments(1)- Returns the second segment of the file path (e.g. "Documents" in "C:\Documents\Howdy.pdf").
Note: If you have removed the document link with the Execute > Remove Link command, you can still access the attached file's name using the expression Folder.AttachmentFileName
Populating fields with specific values (i.e. strings, numbers, dates)
"Hello world!"123.45DateAdd("d", 30, Now)Now.ToString("yyyyMMddhhmmss")
Populating fields with the name of the document's classified Document Type
ContentTypeName- Note: This is a shortcut for the expression
Folder.ContentType.DisplayName
Populating fields with the name of the document's classified Document Type's Content Category.
Folder.ContentType.ParentNode.DisplayName- Note: When using Content Categories, Document Types are added as child nodes to the Content Category object in the node tree. So, this expression first uses
Folderto select the document folder in the Batch. Then,ContentTypeinspects the document's classified Document Type. The Document Type is a child of the Content Category in this example. Since the Content Category is the Document Type's parent object, theParentNodejumps up the node tree hierarchy to inspect the parent Content Category. Last,DisplayNamereturns the Content Category's name.
Calculate Expressions
Addition of multiple fields
IntegerField1 + IntegerField2DecimalField1 + DecimalField2 + DecimalField3
Concatenation of multiple fields
String.Concat(StringField1, StringField2)String.Concat(StringField1, StringField2, StringField3)String.Concat(StringField1, StringField2, StringField3, StringField4)
Rounding
- This example rounds a decimal value to a precision of 4 digits (e.g. 2.34567891 to 2.3457)
Math.Round(DecimalField1, 4)
Non-integer addition (e.g. of date values)
- These examples increment a date by 30 days ("d"), 1 year ("yyyy"), and the last decrements the date by 3 months ("m")
DateAdd("d", 30, DateField1)DateAdd("yyyy", 1, DateField1)DateAdd("m", -3, DateField1)
Reformatting / Normalization of values
- This example replaces any backslashes with underscores
StringField1.Replace("\", "_")
- This example removes any backslashes
StringField1.Replace("\", "")
Substring calculation
- These examples extract information contained within a string "
ABC123456XXXX654321YYY" by designating the 0-based starting index and desired number of characters- ABC (first 3 characters):
StringField1.Substring(0, 3) - 123456 (6 characters within the string):
StringField1.Substring(3, 6) - XXXX (4 characters within the string):
StringField1.Substring(9, 4) - YYY (last 3 characters):
StringField1.Substring(StringField1.Length - 3)
- ABC (first 3 characters):
Getting the location coordinates of a field on the document
- This could be used to determine the coordinates and size of an extracted value on a document.
GetFieldInstance("Field Name").Location.ToString- Note, this returns logical location in inches, not pixels. So additional work would need to be performed to convert this to pixels if needed.
Validate Expressions
Date in past / future
- This example ensures the date value is a past date
DateField1 < Now
- This example ensures the date value is at least 30 days in the future
DateField1 >= DateAdd("d", 30, Now)
Equality / inequality of two fields (multiple options)
StringField1 = StringField2IntegerField1.Equals(IntegerField2)IntegerField1 <> DecimalField1Not DecimalField1.Equals(DecimalField2)
Summing fields and comparing to another field
IntegerField1 + IntegerField2 = IntegerField3DecimalField1 + DecimalField2 = DecimalField3DecimalField1 = SumFieldInstance("Table1\AmountColumn")
Running regular expression against field
Text.RegularExpressions.Regex.IsMatch(StringField1, "[0-9]{6}")
Inspecting field-level confidence scores
Instance.Confidence > 0.8
Batch Processing Expressions
Should Submit Expression
Inspecting flagged status
- These examples would submit the task when the object (i.e. folder, page) is flagged or not flagged (2nd example)
Item.FlaggedNot Item.Flagged
- This example would submit the task when the object (folder) contains one or more flagged pages
DirectCast(Item, BatchFolder).FlaggedPages.Any()
Inspecting flagged message
Item.FlagReason = "Needs classification"Item.FlagReason <> "Bypass review"
Inspecting presence of local copy in Grooper
DirectCast(Item, BatchFolder).HasLocalCopy
Inspecting existence of native version
DirectCast(Item, BatchFolder).HasAttachment
Inspecting MIME type
- This example would submit the task when the object's (folder) represents a native PDF or the second if its mime type is PDF
DirectCast(Item, BatchFolder).IsNativePDFDirectCast(Item, BatchFolder).AttachmentMimeType = "application/pdf"
Inspecting content type / parent content category
DirectCast(Item, BatchFolder).ContentTypeName = "MyContentType"DirectCast(DirectCast(Item, BatchFolder).ContentType.ParentNode, ContentCategory).Name = "MyContentCategory"
Inspecting if a field is blank / populated
DirectCast(Item, BatchFolder).IndexData.Fields("StringField1").Value <> ""Not String.IsNullOrEmpty(DirectCast(Item, BatchFolder).IndexData.Fields("StringField1").Value)
Inspecting image properties (resolution, color mode, aspect ratio, size (in bytes), pixel count, etc.)
DirectCast(Item, BatchPage).PrimaryImage.ResolutionX < 240DirectCast(Item, BatchPage).PrimaryImage.IsBinaryDirectCast(Item, BatchPage).PrimaryImage.IsColorDirectCast(Item, BatchPage).PrimaryImage.IsLandscapeDirectCast(Item, BatchPage).PrimaryImage.AspectRatio > 1.25DirectCast(Item, BatchPage).PrimaryImage.Size > 40960DirectCast(Item, BatchPage).PrimaryImage.PixelCount > 3500000
Inspecting presence of layout data (of a certain type: lines, OMR boxes, etc.)
DirectCast(Item, BatchFolder).HasLayoutData
Does page / document have OCR text?
DirectCast(Item, BatchFolder).HasRuntimeOCRDirectCast(Item, BatchPage).HasRuntimeOCR
Inspecting classification candidates and classification scores, incl. alternate candidate scores
DirectCast(Item, BatchFolder).ContentTypeName = "Document Type Name"
Functions and Should Submits
Grooper can now use lambda functions in expressions (and not just Should Submits, all expressions!). This gives you some really advanced capabilities if you have more advanced .NET programing skills.
This example determines if a page scoped task, like Recognize or Execute > Rasterize should be submitted depending on how many text segments are present on a PDF page. If the PDF page has less than 15 text segments, the tasks submits, otherwise the PDF page is not processed.
- This is useful when dealing with poorly formed PDFs that must be forced to be treated like an image when Grooper otherwise thinks they are a native text document.
Function() As Boolean
If DirectCast(Item, BatchPage).IsPDF
Dim doc As Grooper.PDF.PdfDoc = New Grooper.PDF.PdfDoc(DirectCast(Item, BatchPage).GetImageVersion, True)
Dim info As Grooper.PDF.PdfPageInfo = doc.Sharp.GetPageInfo(0)
Return (info.DrawTextOps.Count < 15)
End If
End Function
You could change what property values determine if the task is submitted by changing the Return statement in the function. Here are some examples:
Return info.PageType = PDF.PdfPageInfo.PageTypes.Mixed- Tasks would submit if the PDF's page type is "Mixed"Return info.RenderResolution = "Color @ 300 DPI"- Tasks would submit if the PDF's render format is Color @ 300 DPI.Return info.PageSize = "8.50"" x 11.00"""- Tasks would submit if the PDF's page size is 8.5 x 11.Return info.Images.Count = 4- Tasks would submit if PDF has exactly 4 images embedded in it.Return info.PathSegments.Count > 257- Tasks would submit if the PDF has more than 257 vector drawing paths.
Next Step Expressions
Inspecting batch creator
If(Batch.CreatedBy.ToLower() = "domain\jusername", TrueStepName, FalseStepName)If(Batch.CreatedByDisplayName = "Joe Username", TrueStepName, FalseStepName)
Inspecting creation time (range, day of week)
If(DatePart(DateInterval.Month, Batch.Created) = 6, TrueStepName, FalseStepName)If(DatePart(DateInterval.Day, Batch.Created) > 15, TrueStepName, FalseStepName)
IP Profile Expressions
IP Command Should Execute Expressions
Inspecting image properties (resolution, color mode, aspect ratio, size, pixel count, etc.)
Image.ResolutionX < 240Image.IsBinaryImage.IsColorImage.IsLandscapeImage.AspectRatio > 1.25Image.Size > 40960Image.PixelCount > 3500000
Inspecting presence of layout data (of a certain type: lines, OMR boxes, etc.)
Results.Line_Detection.HorizontalLines.Any()Results.Line_Detection.VerticalLines.Any()Results.Box_Detection.Boxes.Any()Results.Patch_Code_Detection.PatchCodes.Any()
Decisioning based on image classification (Results.ClassifyImage.whatever)
Results.Classify_Image.ClassName = "Sample 1"
Accessing and inspecting results log of prior IP commands
Results.Measure_Entropy.Entropy > 0.85
Inspecting whether prior commands modified image(s)
ResultList.IsImageSourceImage
Mapping Expressions
Import Mapping Expressions
Value concatenation
String.Concat(field1, field2)String.Concat(field1, " ", field2)
Value padding (adding or removing)
- These examples show how to left-pad a value with zeroes for 20 characters, right-pad a value with spaces for 40 characters, and finally trim a padded value of spaces.
field1.PadLeft(20, "0"c)field2.PadRight(40)field3.Trim()
Adding environment variables (date, user, etc.)
NowEnvironment.MachineNameEnvironment.UserNameEnvironment.UserDomainNameEnvironment.OSVersionEnvironment.ProcessorCount
Export Mapping Expressions
Addition of multiple fields
IntegerField1 + IntegerField2DecimalField1 + DecimalField2 + DecimalField3
Concatenation of multiple fields
String.Concat(StringField1, StringField2)String.Concat(StringField2, ", ", StringField1, ": ", StringField3)
How to access Grooper attributes (content type name, GUID, index data, etc.)
CurrentDocument.ContentTypeNameCurrentDocument.IdCurrentDocument.IndexData.Sections("Section1").Fields("Field1").ValueCurrentDocument.IndexData.Sections("Section1").Sections("SectionA").Fields("Field1A").ValueCurrentDocument.IndexData.Tables("Table1").Rows.First().Cells("Column1").Value
Naming based on original file name
IO.Path.GetFileNameWithoutExtension(CurrentDocument.ContentLink.Name)
Converting a date field to a string in a "year-month-day" format
DateField.ToString("yyyy-MM-dd")
Misc Expression Snippets
These expressions may or may not be useful by themselves. It's most likely they are used as part of a larger expression. They are documented here to keep track of previously requested solutions.
Count the number of children at a certain level. This would count the number of Batch Folders that are direct children of a Batch Folder being processed.
ChildrenAtLevel(1).Count
Count the number of children at a certain level of a parent folder. This would count the number of Batch Folders that are direct children of the parent Batch Folder relative to the Batch Folder being processed.
ParentFolder.ChildrenAtLevel(1).Count
General
|
WIP |
This section is a work-in-progress. It needs to be expanded for completeness. |
Understanding how to traverse hierarchy of, e.g. batch or content model
Understanding how to parse tables by row & column
Identifying Sections by instance number
How to inspect properties of node
Dynamic referencing vs. GUID referencing
Conditional expressions with IIF / IF
Using LINQ in Expressions
Direct Casting: when to (Cast)