Regular Expression (Concept): Difference between revisions
Dgreenwood (talk | contribs) No edit summary |
Dgreenwood (talk | contribs) No edit summary |
||
| Line 1: | Line 1: | ||
<onlyinclude> | <onlyinclude> | ||
<blockquote style="font-size:14pt"> | <blockquote style="font-size:14pt"> | ||
Regular expression (or "regex") is a way of finding information in a block of text. It is the primary method Grooper extracts and returns data from documents. | Regular expression (or "regex") is a standard syntax designed to parse text strings. This is a way of finding information in a block of text. It is the primary method Grooper extracts and returns data from documents. | ||
</blockquote> | </blockquote> | ||
| Line 8: | Line 8: | ||
This syntax can be used to match very specific strings of characters or written more generally to match several permutations of the pattern. For example, one can write a regular expression pattern to match a specific date or any date in a text block. | This syntax can be used to match very specific strings of characters or written more generally to match several permutations of the pattern. For example, one can write a regular expression pattern to match a specific date or any date in a text block. | ||
</onlyinclude> | </onlyinclude> | ||
== Literal String Matching == | |||
{|cellpadding=10 cellspacing=5 | |||
| | |||
The most basic regex patterns are used to literally match a string of characters, such as a word or phrase. | |||
If you want to find the word "cat" in a block of text, the regex pattern <code>cat</code> would do that. | |||
Notice as well, what is matched is the string "cat". Even if that string exists in the middle of a word, like "concatenate", the pattern still matches. | |||
| | |||
{|cellpadding=5 cellspacing=5 | |||
|-style="background-color:#36b0a7; color:white" | |||
|'''Text Data'''||'''RegEx Pattern'''||'''Result (highlighted)''' | |||
|- | |||
| | |||
dog | |||
cat | |||
cactus | |||
concatenate | |||
|valign=top style="text-align:center"| | |||
<code>cat</code> | |||
| | |||
dog | |||
<span style="background-color:#ffea61">cat</span> | |||
cactus | |||
con<span style="background-color:#ffea61">cat</span>enate | |||
|} | |||
|- | |||
| | |||
Regex patterns execute sequentially from left to right. If you break the pattern down character by character, it becomes a little clearer what's happening. | |||
First the <code>c</code> in the pattern would look for a single "c" character in the text data. | |||
| | |||
{|cellpadding=5 cellspacing=5 | |||
|-style="background-color:#36b0a7; color:white" | |||
|'''Text Data'''||'''RegEx Pattern'''||'''Result (highlighted)''' | |||
|- | |||
| | |||
dog | |||
cat | |||
cactus | |||
concatenate | |||
|valign=top style="text-align:center"| | |||
<code>c</code> | |||
| | |||
dog | |||
<span style="background-color:#ffea61">c</span>at | |||
<span style="background-color:#ffea61">c</span>a<span style="background-color:#ffea61">c</span>tus | |||
con<span style="background-color:#ffea61">c</span>atnate | |||
|} | |||
|- | |||
| | |||
But the pattern continues. | |||
The <code>ca</code> in the pattern would look for those two characters stringed together, "c" followed immediately by "a". | |||
Notice, as the pattern gets more specific, the number of matches decrease. The single letter "c" is more general, producing four results, where even just the two letters "ca" is more specific, producing three. | |||
| | |||
{|cellpadding=5 cellspacing=5 | |||
|-style="background-color:#36b0a7; color:white" | |||
|'''Text Data'''||'''RegEx Pattern'''||'''Result (highlighted)''' | |||
|- | |||
| | |||
dog | |||
cat | |||
cactus | |||
concatenate | |||
|valign=top style="text-align:center"| | |||
<code>ca</code> | |||
| | |||
dog | |||
<span style="background-color:#ffea61">ca</span>t | |||
<span style="background-color:#ffea61">ca</span>ctus | |||
con<span style="background-color:#ffea61">ca</span>tnate | |||
|} | |||
|- | |||
| | |||
By the time you get to the full pattern, <code>cat</code>, you've given even more specificity to what text you're trying to match. | |||
If you want to return just the word "cat" and not the segment "cat" in "concatenate", you'd need to adjust your pattern to be even ''more'' specific (We will discuss methods to do this later). | |||
| | |||
{|cellpadding=5 cellspacing=5 | |||
|-style="background-color:#36b0a7; color:white" | |||
|'''Text Data'''||'''RegEx Pattern'''||'''Result (highlighted)''' | |||
|- | |||
| | |||
dog | |||
cat | |||
cactus | |||
concatenate | |||
|valign=top style="text-align:center"| | |||
<code>cat</code> | |||
| | |||
dog | |||
<span style="background-color:#ffea61">cat</span> | |||
cactus | |||
con<span style="background-color:#ffea61">cat</span>enate | |||
|} | |||
|} | |||
Revision as of 14:58, 19 March 2021
Regular expression (or "regex") is a standard syntax designed to parse text strings. This is a way of finding information in a block of text. It is the primary method Grooper extracts and returns data from documents.
Using a standard syntax, a sequential line of characters is written to match a string of characters in the text. This line of characters written to match text is called a "pattern" and can potentially return multiple strings, not just one value. It will return any string of text matching the pattern.
This syntax can be used to match very specific strings of characters or written more generally to match several permutations of the pattern. For example, one can write a regular expression pattern to match a specific date or any date in a text block.
Literal String Matching
|
The most basic regex patterns are used to literally match a string of characters, such as a word or phrase. If you want to find the word "cat" in a block of text, the regex pattern Notice as well, what is matched is the string "cat". Even if that string exists in the middle of a word, like "concatenate", the pattern still matches. |
| ||||||
|
Regex patterns execute sequentially from left to right. If you break the pattern down character by character, it becomes a little clearer what's happening. First the |
| ||||||
|
But the pattern continues. The Notice, as the pattern gets more specific, the number of matches decrease. The single letter "c" is more general, producing four results, where even just the two letters "ca" is more specific, producing three. |
| ||||||
|
By the time you get to the full pattern, If you want to return just the word "cat" and not the segment "cat" in "concatenate", you'd need to adjust your pattern to be even more specific (We will discuss methods to do this later). |
|