TF-IDF (Concept): Difference between revisions

Revision as of 13:12, 5 October 2020

TF-IDF (Term Frequency-Inverse Document Frequency) is how Grooper uses machine learning for document classification and data extraction. TF-IDF is designed to reflect how important a word is to a document or a set of documents. The importance value (or weighting) of a word increases proportionally to the number of times it appears in the document (Term Frequency). The weighting is offset by the number of documents in the set containing the word (Inverse Document Frequency). Some words appear more generally across documents and hence are less unique identifiers. So, their weighting is lessened.

About

TF-IDF stands for “term frequency-inverse document frequency”. It is a numerical statistic intended to reflect how important a word is to a document within a collection (or document set or “corpus”). This “importance” is assigned a numerical weighting value. The higher the word’s weighting value, the more relevant it to that document in the corpus. The TF-IDF algorithm is one of the most common ways to identify particularly important or relevant words and phrases for a document. It’s so popular you’ve probably used it already without even realizing it. If you’ve used a search engine, it’s likely some variant of TF-IDF was used to rank your search results.

It is popular because it is both highly effective and relatively simple to understand. Take the two components “Term Frequency” and “Inverse Document Frequency”.

Term Frequency

Some words on a document are going to be more common. Some are going to be less common. If you see the same word over and over again on a document, there's a good chance that word is important to that document in one way or another.

Imagine looking at an Invoice. You expect to see a lot of terms related to Invoice documents. The word “invoice” or “invoice number” or “invoice date”. There’s a good chance if you see a document with the word “invoice” over and over again, you’re looking at an Invoice. The term “invoice” should carry more weight for Invoices. It is a feature of what makes an Invoice an Invoice. You wouldn't necessarily expect to see the term "invoice" on a Revenue Statement. Those are just two very different types of documents. Revenue Statements will have its own terms that appear frequently, like "income" "profit" or "expense".

For TF-IDF, the fact that you see a term more frequently is an important feature of that document. The more time that term (also known as feature) repeats on a document, the higher that term's (or feature's) weighting value.

Inverse Document Frequency

But what about Purchase Orders? You’ll see terms like “invoice number” on those kinds of documents too. How does the algorithm distinguish between terms that are common to multiple document types in the set? That is where the second part of the equation, “Inverse Document Frequency”, comes into play.

Inverse Document Frequency is a statistical interpretation of term-specificity. Take an extreme example like the word “the”. That word is so common if we just used term frequency, we’d end up giving the word “the” much more importance than it should have. In fact, the word “the” should have little or no weight. It’s not specific to one document or another and is therefore not helpful when it comes to identifying one over the other. Inverse Document Frequency mitigates features that are common to multiple documents in the set.

If you can find terms that appear frequently on one type of document that are also specific to that document type in the set, there’s a good chance if you see those terms on a different or unknown document, you’re looking at the same kind of document.

This is essentially the logic behind using TF-IDF to classify documents. The algorithm will score frequent terms (or features) in a document with a higher weighting value, but it will then decrease the term’s weighting value if that term appears frequently across multiple documents in the set. So yes, Purchase Orders might have “invoice-y” terms on them, but the value of “purchase order-y” terms specific to Purchase Orders are going to be much more important when it comes time to classify the document.

The Math

When it comes down to the math, the algorithm looks something like this.

A Term’s Weighting Value = TF x IDF

where…

TF = (# times the term appears in a document) / (total # of terms in the document)

IDF = log(total # of document types / # of document types with the term in it)

Alternatively: IDF = -log(# of document types with the term in it / total # of document types)

Imagine you had three documents in a document set, each of a different document type…

Document Type 1	Document Type 2	Document Type 3
------------	------------	------------
This	This	This
Is	Is	Is
A	Another	A
A	Another	Third
Sample	Example	Document
	Example
	Example

The number of times the word “This” appears on any document is 1. Document Type 1 has 5 total terms. Document Type 2 has 7 terms. Document Type 3 has 5 terms.

For Document Type 1, the Term Frequency of the word "this" is "0.2". There are five words, out of which the word "this" appears once. One divided by five is "0.2". Same for Document Type 3. For Document Type 2, the math works out to "0.1429".

TF(“this”, Document Type A) = 1/5 = 0.2

TF(“this”, Document Type B) = 1/7 = 0.1429

TF(“this”, Document Type C) = 1/5 = 0.2

Inverse document frequency is calculated against the whole document set. So, this value will be same for all three document types. The IDF value for the word "this" works out to "0" because all three documents contain the word "this". The logarithm of 1 is always 0.

IDF(“this”, Doc Set) = log(3/3) = 0

Last, to find the TF-IDF weighting value for each document type, you just multiply those numbers together.

TF-IDF(“this”, Document A) = 0.2 x 0 = 0

TF-IDF(“this”, Document B) = 0.1429 x 0 = 0

TF-IDF(“this”, Document C) = 0.2 x 0 = 0

Again, the term “this” is on all docs. For basic TF-IDF, its IDF value scales its weighting all the way to zero. The term is not specific to one document or the other. So, regardless of the term's frequency value, the TF-IDF weighting score equates to zero. It’s not unique enough to be a good identifier of one document type or another.

The term “example” is found a lot on Document Type B but not on Document Type A or on Document Type C. It should have a much higher weighting value for Document Type B than the other two.

TF(“example”, Document A) = 0/5 = 0

TF(“example”, Document B) = 3/7 = 0.4286

TF(“example”, Document C) = 0/5 = 0

IDF(“example”, Doc Set) = log(3/1) = 0.4771

TF-IDF(“example”, Document A) = 0 x 0.4771 = 0

TF-IDF(“example”, Document B) = 0.4286 x 0.4771 = 0.2045

TF-IDF(“example”, Document C) = 0 x 0.4771 = 0

Sure enough, 0.205 is greater than 0.

The term “A” should be more interesting. It’s a weird one. It appears in two of the three documents, and it appears on Document 1 more than Document 2. So, it’s not totally unique but might be more important to one document vs another. But still not a standout like the term “example” for Document 3.

TF(“a”, Document A) = 2/5 = 0.4

TF(“a”, Document B) = 0/7 = 0

TF(“a”, Document C) = 1/5 = 0.2

IDF(“a”, Doc Set) = log(3/2) = 0.1761

TF-IDF(“a”, Document A) = 0.4 x 0.1761 = 0.0704

TF-IDF(“a”, Document B) = 0 x 0.1761 = 0

TF-IDF(“a”, Document C) = 0.2 x 0.1761 = 0.0352

The weightings bear out our ideas for this term. It shouldn’t be considered a feature of Document 2 at all, and its weighting is zero. It has a higher weighting value for Document 1 than Document 3 (0.0704 > 0.0352) because the term appears more frequently on Document 2. But still isn’t as clear an identifier as the word “example” for Document 2. So the weighting value is smaller (0.205 > 0.0704)

The Good

TF-IDF is a "training-based" approach to document classification. In Grooper, we contract this "training-based" approach to "rules-based". Rules-based approaches to document classification work by extracting an identifying value for a Document Type, such as a document's title. One common way you as a human know what kind of document you're looking at is just by reading the document's title.

However, we may not be able to use titles for classification of every kind of document. There may just be too many variations of a title to make that a realistic option. Or a document might not use titles at all to identify them. Or we may not be able to pattern match a title due to bad OCR. Or a document’s title might change from year to year or unexpectedly in the middle of production. Using a Rule-Based approach these are all problems you would need to solve as they came up, one by one. A training-based approach, taking advantage of TF-IDF weighting values, is often better suited to handle these issues.

The beauty of TF-IDF is that it can help account for variability in documents. Rather than creating an extractor that targets a single feature of a document, such as its title, text features of the entire document will be trained and assigned a TF-IDF weighting value. Even if an unclassified document doesn't have a title or has a variation you haven't encountered yet, as long as it shares enough features in common with the trained examples, it will still get classified.

The Not-So Good: Words of Caution

That being said. A computer is never as good at understanding a document as a human being. Consider four main points when training a TF-IDF model.

Your training weightings are only as good as the examples you provide them. If you don’t understand your documents, neither is the training model.

Trained examples need to be representative of the documents you’re dealing with. If you give the algorithm bad features, you’re going to get bad results. This is why TF-IDF is called a user assisted machine learning algorithm. It’s not as if Grooper is going to magically know what an invoice looks like until you provide it with examples of an invoice. If you give it examples of a purchase order and call it an invoice, Grooper is going to think it’s an invoice.

If an unclassified document’s features are wildly different than any trained examples seen so far, TF-IDF will likely fail to classify it.

The greater the difference between your documents and the trained examples (difference in terms or features or difference in image quality resulting in difference in OCR results), the less likely you are to get 100% accurate classification.

Where a human being might be able to see similarities in the features, the TF-IDF values only have the context of the specific trained features. If they aren’t there on the document, there’s going to be nothing to compare between the document and the trained TF-IDF features.

The reverse can be true as well, if you train an example that is wildly different (using wildly different terms) than most documents in your set, you can get bad classification results. Sometimes, it’s best to let outliers be outliers and use different classification strategies, such as looking for a Rule-Based method for the outlier or creating a new Document Type for the outlier (or have a human review the document).

More training isn't always the right solution.

Furthermore, the right solution isn’t always “feed the model more training examples”. There is a concept of “overtraining” a TF-IDF model. Once you start tossing in more and more features for more and more document types, the line between useful distinguishable features can start to blur. At that point you can end up with worse classification. With more and more trained examples, you may get to a point where every document starts to look the same as far as the TF-IDF weightings go. The weighting values can become "flat" where every term has about the same weighting value or there is otherwise not enough difference in the weightings to make a strong decision.

Don’t discount the power of human review.

As powerful as TF-IDF can be, it does have its limitations. While it is designed to find features in common for documents you have not encountered before with features of trained document examples, there is still potential for documents coming through unclassified or mis-classified. This can be the case particularly with complex document sets with lots of variation among documents in a Document Type. That said, a good TF-IDF weighting model can still do the “heavy lift” of classification decision making for even the most complicated document sets.

When Grooper can’t make the right decision, a human reviewer is a perfectly reasonable option. A computer simply cannot beat the human mind when it comes to seeing, understanding, and evaluating patterns, let alone intuitive leaps in judgement. That’s basically all classification is, looking for common patterns in the features of a document and using those patterns to put documents in one group or another. If you’re looking for 100% accuracy (or as close to it as possible), a human review of Grooper’s TF-IDF decision making is an exceptional safety net for making the right choice.

@@ Line 137: / Line 137: @@
 ::The reverse can be true as well, if you train an example that is wildly different (using wildly different terms) than most documents in your set, you can get bad classification results.  Sometimes, it’s best to let outliers be outliers and use different classification strategies, such as looking for a Rule-Based method for the outlier or creating a new Document Type for the outlier (or have a human review the document).
-#<li value=3> '''More training isn't always the right solution.''
+#<li value=3> '''More training isn't always the right solution.'''
 ::Furthermore, the right solution isn’t always “feed the model more training examples”.  There is a concept of “overtraining” a TF-IDF model.  Once you start tossing in more and more features for more and more document types, the line between useful distinguishable features can start to blur.  At that point you can end up with worse classification.  With more and more trained examples, you may get to a point where every document starts to look the same as far as the TF-IDF weightings go.  The weighting values can become "flat" where every term has about the same weighting value or there is otherwise not enough difference in the weightings to make a strong decision.
@@ Line 146: / Line 146: @@
 ::When Grooper can’t make the right decision, a human reviewer is a perfectly reasonable option.  A computer simply cannot beat the human mind when it comes to seeing, understanding, and evaluating patterns, let alone intuitive leaps in judgement.  That’s basically all classification is, looking for common patterns in the features of a document and using those patterns to put documents in one group or another.  If you’re looking for 100% accuracy (or as close to it as possible), a human review of Grooper’s TF-IDF decision making is an exceptional safety net for making the right choice.
+== TF-IDF Variants ==