Regular Expression (Concept)

STUB

This article is a stub. It contains minimal information on the topic and should be expanded.

Would you like to see this article expanded? Let us know at groopereducation@bisok.com.

Regular Expression (or regex) is a standard syntax designed to parse text strings. This is a way of finding information in text. It is the primary method by which Grooper extracts and returns data from documents.

Using a standard syntax, a sequential line of characters is written to match a string of characters in the text. This line of characters written to match text is called a "pattern" and can potentially return multiple strings, not just one value. It will return any string of text matching the pattern.

This syntax can be used to match very specific strings of characters or written more generally to match several permutations of the pattern. For example, one can write a regular expression pattern to match a specific date or any date in a text block.

Literal String Matching

The most basic regex patterns are used to literally match a string of characters, such as a word or phrase.

If you want to find the word "cat" in a block of text, the regex pattern cat would do that.

Notice as well, what is matched is the string "cat". Even if that string exists in the middle of a word, like "concatenate", the pattern still matches.

Text Data RegEx Pattern Result (highlighted)

dog

cat

cactus

concatenate

cat

dog

cat

cactus

concatenate

Regex patterns execute sequentially from left to right (just like the left-right read order of the English language). If you break the pattern down character by character, it becomes a little clearer what's happening.

First the c in the pattern would look for a single "c" character in the text data.

Text Data RegEx Pattern Result (highlighted)

dog

cat

cactus

concatenate

c

dog

cat

cactus

concatnate

But the pattern continues.

The ca in the pattern would look for those two characters stringed together, "c" followed immediately by "a".

Notice, as the pattern gets more specific, the number of matches decrease. The single letter "c" is more general, producing four results, where even just the two letters "ca" is more specific, producing three.

Text Data RegEx Pattern Result (highlighted)

dog

cat

cactus

concatenate

ca

dog

cat

cactus

concatnate

By the time you get to the full pattern, cat, you've given even more specificity to what text you're trying to match.

If you want to return just the word "cat" and not the segment "cat" in "concatenate", you'd need to adjust your pattern to be even more specific (We will discuss methods to do this later).

Text Data RegEx Pattern Result (highlighted)

dog

cat

cactus

concatenate

cat

dog

cat

cactus

concatenate