TF-IDF

From Grooper Wiki
Jump to navigation Jump to search

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic intended to reflect how important a word is to a document within a collection (or document set or “corpus”). It is how Grooper uses machine learning for training-based document classification (via the Lexical method) and data extraction (via the Field Class extractor).

TF-IDF is designed to reflect how important a word is to a document or a set of documents. The importance value (or weighting) of a word increases proportionally to the number of times it appears in the document (Term Frequency). The weighting is offset by the number of documents in the set containing the word (Inverse Document Frequency). Some words appear more generally across documents and hence are less unique identifiers. So, their weighting is lessened.


About

TF-IDF stands for “term frequency-inverse document frequency”. It is a numerical statistic intended to reflect how important a word is to a document within a collection (or document set or “corpus”). This “importance” is assigned a numerical weighting value. The higher the word’s weighting value, the more relevant it to that document in the corpus. The TF-IDF algorithm is one of the most common ways to identify particularly important or relevant words and phrases for a document. It’s so popular you’ve probably used it already without even realizing it. If you’ve used a search engine, it’s likely some variant of TF-IDF was used to rank your search results.

It is popular because it is both highly effective and relatively simple to understand. Take the two components “Term Frequency” and “Inverse Document Frequency”.

Term Frequency

Some words on a document are going to be more common. Some are going to be less common. If you see the same word over and over again on a document, there's a good chance that word is important to that document in one way or another.

Imagine looking at an Invoice. You expect to see a lot of terms related to Invoice documents. The word “invoice” or “invoice number” or “invoice date”. There’s a good chance if you see a document with the word “invoice” over and over again, you’re looking at an Invoice. The term “invoice” should carry more weight for Invoices. It is a feature of what makes an Invoice an Invoice. You wouldn't necessarily expect to see the term "invoice" on a Revenue Statement. Those are just two very different types of documents. Revenue Statements will have its own terms that appear frequently, like "income" "profit" or "expense".

For TF-IDF, the fact that you see a term more frequently is an important feature of that document. The more time that term (also known as feature) repeats on a document, the higher that term's (or feature's) weighting value.

Inverse Document Frequency

But what about Purchase Orders? You’ll see terms like “invoice number” on those kinds of documents too. How does the algorithm distinguish between terms that are common to multiple document types in the set? That is where the second part of the equation, “Inverse Document Frequency”, comes into play.

Inverse Document Frequency is a statistical interpretation of term-specificity. Take an extreme example like the word “the”. That word is so common if we just used term frequency, we’d end up giving the word “the” much more importance than it should have. In fact, the word “the” should have little or no weight. It’s not specific to one document or another and is therefore not helpful when it comes to identifying one over the other. Inverse Document Frequency mitigates features that are common to multiple documents in the set.

If you can find terms that appear frequently on one type of document that are also specific to that document type in the set, there’s a good chance if you see those terms on a different or unknown document, you’re looking at the same kind of document.

This is essentially the logic behind using TF-IDF to classify documents. The algorithm will score frequent terms (or features) in a document with a higher weighting value, but it will then decrease the term’s weighting value if that term appears frequently across multiple documents in the set. So yes, Purchase Orders might have “invoice-y” terms on them, but the value of “purchase order-y” terms specific to Purchase Orders are going to be much more important when it comes time to classify the document.

The Math

When it comes down to the math, the algorithm looks something like this.

A Term’s Weighting Value = TF x IDF

where…

TF = (# times the term appears in a document) / (total # of terms in the document)

and...

IDF = log(total # of document types / # of document types with the term in it)
Alternatively: IDF = -log(# of document types with the term in it / total # of document types)


Take the three documents below. Imagine each one is an example we want to train of one of three document types: A, B, or C

Document Type A Document Type B Document Type C
DocA.png DocC.png DocB.png

The number of times the word “This” appears on any document is 1. Document Type A has 5 total terms. Document Type B has 7 terms. Document Type C has 5 terms.

For Document Type A, the Term Frequency of the word "this" is "0.2". There are five words, out of which the word "this" appears once. One divided by five is "0.2". Same for Document Type C. For Document Type B, the math works out to "0.1429".

TF(“this”, Document Type A) = 1/5 = 0.2
TF(“this”, Document Type B) = 1/7 = 0.1429
TF(“this”, Document Type C) = 1/5 = 0.2

Inverse document frequency is calculated against the whole document set. So, this value will be same for all three document types. The IDF value for the word "this" works out to "0" because all three documents contain the word "this". The logarithm of 1 is always 0.

IDF(“this”, Doc Set) = log(3/3) = 0

Last, to find the TF-IDF weighting value for each document type, you just multiply those numbers together.

TF-IDF(“this”, Document A) = 0.2 x 0 = 0
TF-IDF(“this”, Document B) = 0.1429 x 0 = 0
TF-IDF(“this”, Document C) = 0.2 x 0 = 0

Again, the term “this” is on all docs. For basic TF-IDF, its IDF value scales its weighting all the way to zero. The term is not specific to one document or the other. So, regardless of the term's frequency value, the TF-IDF weighting score equates to zero. It’s not unique enough to be a good identifier of one document type or another.

The term “example” is found a lot on Document Type B but not on Document Type A or on Document Type C. It should have a much higher weighting value for Document Type B than the other two.

TF(“example”, Document A) = 0/5 = 0
TF(“example”, Document B) = 3/7 = 0.4286
TF(“example”, Document C) = 0/5 = 0


IDF(“example”, Doc Set) = log(3/1) = 0.4771


TF-IDF(“example”, Document A) = 0 x 0.4771 = 0
TF-IDF(“example”, Document B) = 0.4286 x 0.4771 = 0.2045
TF-IDF(“example”, Document C) = 0 x 0.4771 = 0

Sure enough, 0.205 is greater than 0.

The term “A” should be more interesting. It’s a weird one. It appears in two of the three documents, and it appears on Document Type A more than Document Type B. So, it’s not totally unique but might be more important to one document vs another. But still not a standout like the term “example” for Document Type C.

TF(“a”, Document A) = 2/5 = 0.4
TF(“a”, Document B) = 0/7 = 0
TF(“a”, Document C) = 1/5 = 0.2


IDF(“a”, Doc Set) = log(3/2) = 0.1761


TF-IDF(“a”, Document A) = 0.4 x 0.1761 = 0.0704
TF-IDF(“a”, Document B) = 0 x 0.1761 = 0
TF-IDF(“a”, Document C) = 0.2 x 0.1761 = 0.0352

The weightings bear out our ideas for this term. It shouldn’t be considered a feature of Document Type B at all, and its weighting is zero. It has a higher weighting value for Document Type A than Document Type C (0.0704 > 0.0352) because the term appears more frequently on Document Type C. But still, it isn’t as clear an identifier as the word “example” for Document Type B. So the weighting value is smaller (0.205 > 0.0704)

The Good

TF-IDF is a "training-based" approach to document classification. In Grooper, we contrast this "training-based" approach to "rules-based". Rules-based approaches to document classification work by extracting an identifying value for a Document Type, such as a document's title. One common way you as a human know what kind of document you're looking at is just by reading the document's title.

However, we may not be able to use titles for classification of every kind of document. There may just be too many variations of a title to make that a realistic option. Or a document might not use titles at all to identify them. Or we may not be able to pattern match a title due to bad OCR. Or a document’s title might change from year to year or unexpectedly in the middle of production. Using a Rule-Based approach these are all problems you would need to solve as they came up, one by one. A training-based approach, taking advantage of TF-IDF weighting values, is often better suited to handle these issues.

The beauty of TF-IDF is that it can help account for variability in documents. Rather than creating an extractor that targets a single feature of a document, such as its title, text features of the entire document will be trained and assigned a TF-IDF weighting value. Even if an unclassified document doesn't have a title or has a variation you haven't encountered yet, as long as it shares enough features in common with the trained examples, it will still get classified.

The Not-So Good: Words of Caution

That being said. A computer is never as good at understanding a document as a human being. Consider four main points when training a TF-IDF model.

  1. Your training weightings are only as good as the examples you provide them. If you don’t understand your documents, neither will the training model.
Trained examples need to be representative of the documents you’re dealing with. If you give the algorithm bad features, you’re going to get bad results. This is why TF-IDF is called a user assisted machine learning algorithm. It’s not as if Grooper is going to magically know what an invoice looks like until you provide it with examples of an invoice. If you give it examples of a purchase order and call it an invoice, Grooper is going to think it’s an invoice.
  1. If an unclassified document’s features are wildly different than any trained examples seen so far, TF-IDF will likely fail to classify it.
The greater the difference between your documents and the trained examples (difference in terms or features or difference in image quality resulting in difference in OCR results), the less likely you are to get 100% accurate classification.
Where a human being might be able to see similarities in the features, the TF-IDF values only have the context of the specific trained features. If they aren’t there on the document, there’s going to be nothing to compare between the document and the trained TF-IDF features.
The reverse can be true as well, if you train an example that is wildly different (using wildly different terms) than most documents in your set, you can get bad classification results. Sometimes, it’s best to let outliers be outliers and use different classification strategies, such as looking for a Rule-Based method for the outlier or creating a new Document Type for the outlier (or have a human review the document).
  1. More training isn't always the right solution.
Furthermore, the right solution isn’t always “feed the model more training examples”. There is a concept of “overtraining” a TF-IDF model. Once you start tossing in more and more features for more and more document types, the line between useful distinguishable features can start to blur. At that point you can end up with worse classification. With more and more trained examples, you may get to a point where every document starts to look the same as far as the TF-IDF weightings go. The weighting values can become "flat" where every term has about the same weighting value or there is otherwise not enough difference in the weightings to make a strong decision.
  1. Don’t discount the power of human review.
As powerful as TF-IDF can be, it does have its limitations. While it is designed to find features in common for documents you have not encountered before with features of trained document examples, there is still potential for documents coming through unclassified or mis-classified. This can be the case particularly with complex document sets with lots of variation among documents in a Document Type. That said, a good TF-IDF weighting model can still do the “heavy lift” of classification decision making for even the most complicated document sets.
When Grooper can’t make the right decision, a human reviewer is a perfectly reasonable option. A computer simply cannot beat the human mind when it comes to seeing, understanding, and evaluating patterns, let alone intuitive leaps in judgement. That’s basically all classification is, looking for common patterns in the features of a document and using those patterns to put documents in one group or another. If you’re looking for 100% accuracy (or as close to it as possible), a human review of Grooper’s TF-IDF decision making is an exceptional safety net for making the right choice.

TF-IDF Variants

There are a few variants on the standard TF-IDF algorithm that Grooper utilizes.

Class Frequency

Class Frequency alters the standard TF-IDF weightings by adding another variable to the equation. It modifies the weighting mechanism from TF-IDF to TF-IDF-CF. Class Frequency (CF) refers to the number of times a term (or feature) appears within a class of documents (or Document Type from Grooper's perspective).

CF = (# of documents within a Document Type on which the term appears) / (total # of documents within the Document Type)

If you're training several examples of a document, and the same term appears on every trained example, there's more than a good chance that feature is more important than a feature that only appears on one trained example.

For example, Assignments of Oil and Gas Lease contracts transfer rights to an oil and gas lease from an "assignor" granting the rights and an "assignee" receiving the rights. Even for a human being skimming through a collection of contracts pertaining to oil and gas leases, there's a good chance if you see the words "assignor" or "assignee" you will think "Hey, this is an Assignment". Those terms are just so common to those types of contracts that should appropriately carry a large weighting value.

Less common, you might see the terms "grantor" or "grantee". Imagine you trained five documents for an "Assignments" Document Type. Four of them use the term "assignor", and one of them uses the term "grantor".

CF("assignor", Assignment) = 4/5 = 0.8
CF("grantor", Assignment) = 1/5 = 0.2

The TF-IDF weightings for the term "assignor" would be multiplied by 0.8 and by 0.2 for the term "grantor". This is a statistical way of dampening the value of less common features of a collection of trained documents, while also improving the score of features common to trained examples. For example, you will see the terms "grantor" and "grantee" more commonly in Property Deeds. Class Frequency could help the training model assign higher value to the terms "assignor" and "assignee" when encountered in Assignments and "grantor" and "grantee" when encountered in Deeds.

Class Frequency can be a particularly effective way to mitigate outlier terms that only appear on a single trained example. In many cases, this provides a vast improvement to traditional TF-IDF document classification.

This variant is enabled by default when using the Lexical Classification Method. You can disable this variant by selecting the Content Model and setting the Use Class Frequency property to False.

Tf-idf-class-frequency.png

Logarithmic Term Frequency

Normal (or Raw) Term Frequency assigns a higher weight according to how many times the term appears on the document. If Document A contains twenty occurrences of the term, and Document B contains only one, it will carry 20 times the weight in Document A as it does in Document B. Sometimes, this may be an important feature to document classification. However, sometimes it can be a hinderance.

Often it is more important that you encountered the term on a document and not necessarily how many times it appears on an individual document. For example, a Memorandum of Oil and Gas Lease is a short form version of an Oil and Gas Lease. It serves as an official notice recorded with a county clerk of the Oil and Gas Lease. Memorandums of Lease will contain language similar to the Lease itself, including the word "lease". The term may appear frequently on the Memorandum as well as the Lease. However, the term "memorandum" may only appear once or twice. Even though it appears less often from a term like "lease", it will be a very good identifier you're looking at a Memorandum and not a Lease. However, with Normal Term Frequency, this term (being a less frequent term) may get a fairly low TF score, causing a Memorandum to share a high degree of similarity with a Lease.

Logarithmic Term Frequency instead scales the Term Frequency weightings logarithmically. This normalizes Term Frequency somewhat. Frequent words will still score a higher TF weighting, but the gap between the most frequent terms and less frequent ones will be narrowed. This can provide more accurate classification results when term frequency is not as important (when it's not as important how many times a term is encountered, just that it was).

When using the Lexical Classification Method, you can enable this variant by using the Term Frequency Mode property. Select the Content Model and change that property from Normal to Logarithmic.

Note: This is referred to as Sub-Linear TF Scaling in Grooper Version 2.80 and older. The property can be either True to use the logarithmic term frequency variant, or False to use raw term frequency.

Tf-idf-logarithmic-term-frequency.png

Augmented Term Frequency

Normal Term Frequency can improperly bias longer documents as opposed to shorter ones. Longer documents simply have more terms than shorter ones. This can cause problems when you have document sets of varying lengths.

Augmented Term Frequency divides raw term frequency by the raw term frequency of the most occurring term in the document. Similar to Logarithmic Term Frequency, this helps normalize the terms within the document. Furthermore, Grooper allows you to provide a "dampening factor" to control how extreme this normalization is.

If you have a mix of longer documents and shorter ones, and the longer documents get better classification results than the shorter ones, try using Augmented Term Frequency to improve your classification results.

When using the Lexical Classification Method, you can enable this variant by using the Term Frequency Mode property. Select the Content Model and change that property from Normal to Augmented.

Note: This variant was not available prior to Grooper Version 2.90

Tf-idf-augmented-term-frequency.png

Smooth Inverse Document Frequency

As seen in the example in the math section of this article, if a term is encountered on every trained Document Type its IDF value will equate to zero, effectively nullifying the term as far as weighting goes. Whether it's found on 2 out of 2 Document Types or 200 out of 200 Document Types, log(1) always equates to zero.

However, sometimes, you will encounter terms than can be useful for classifying a document, even if they appear on multiple document types. It may be the case that the term appears very frequently on one Document Type, but only once or twice on every other Document Type. Or, it may be a frequent term for 5 out of 10 Document Type and it's IDF value is mitigating the term frequency too much.

This is what Smooth Inverse Document Frequency attempts to accomplish. It changes the traditional IDF equation by "smoothing" the logarithmic curve so the IDF value never approaches zero. It will still assign a lower IDF score to terms appearing across multiple (or all) Document Types. However, the impact will not be as drastic if it appears on all or most Document Types.

The difference in the two curves looks something like the image below.

Smooth idf.png


When using the Lexical Classification Method, you can enable this variant using the Document Frequency Mode property. Select the Content Model and change that property from Normal to Smooth.

Note: In Grooper Version 2.80 and older, this variant was enabled by the Smooth IDF property. This property can either be True to use smooth inverse document frequency or False to use normal IDF.

Tf-idf-smooth-inverse-document-frequency.png

Grooper Help Documentation