TF-IDF (Concept)

TF-IDF (Term Frequency-Inverse Document Frequency) is how Grooper uses machine learning for document classification and data extraction. TF-IDF is designed to reflect how important a word is to a document or a set of documents. The importance value (or weighting) of a word increases proportionally to the number of times it appears in the document (Term Frequency). The weighting is offset by the number of documents in the set containing the word (Inverse Document Frequency). Some words appear more generally across documents and hence are less unique identifiers. So, their weighting is lessened.

About

TF-IDF stands for “term frequency-inverse document frequency”. It is a numerical statistic intended to reflect how important a word is to a document within a collection (or document set or “corpus”). This “importance” is assigned a numerical weighting value. The higher the word’s weighting value, the more relevant it to that document in the corpus. The TF-IDF algorithm is one of the most common ways to identify particularly important or relevant words and phrases for a document. It’s so popular you’ve probably used it already without even realizing it. If you’ve used a search engine, it’s likely some variant of TF-IDF was used to rank your search results.

It is popular because it is both highly effective and relatively simple to understand. Take the two components “Term Frequency” and “Inverse Document Frequency”.

Term Frequency

Some words on a document are going to be more common. Some are going to be less common. If you see the same word over and over again on a document, there's a good chance that word is important to that document in one way or another.

Imagine looking at an invoice. You expect to see a lot of terms related to invoice type documents. The word “invoice” or “invoice number” or “invoice date”. There’s a good chance if you see a document with the word “invoice” over and over again, you’re looking at an invoice. The term “invoice” should carry more weight for invoices. You wouldn't necessarily expect to see the term "invoice" on a revenue statement. Those are just two very different types of documents. Revenue statements will have its own terms that appear frequently, like "profit" or "expense".

For TF-IDF, the fact that you see a term more frequently is an important feature of that document. The more time that term (also known as feature) repeats on a document, the higher that term's (or feature's) weighting value.

Inverse Document Frequency

But what about Purchase Orders? You’ll see terms like “invoice number” on those kinds of documents too. How does the algorithm distinguish between terms that are common to multiple document types in the set? That is where the second part of the equation, “Inverse Document Frequency”, comes into play.

Inverse Document Frequency is a statistical interpretation of term-specificity. Take an extreme example like the word “the”. That word is so common if we just used term frequency, we’d end up giving the word “the” much more importance than it should have. In fact, the word “the” should have little or no weight. It’s not specific to one document or another and is therefore not helpful when it comes to identifying one over the other. Inverse Document Frequency mitigates features that are common to multiple documents in the set.

If you can find terms that appear frequently on one type of document that are also specific to that document type in the set, there’s a good chance if you see those terms on a different or unknown document, you’re looking at the same kind of document.

This is essentially the logic behind using TF-IDF to classify documents. The algorithm will score frequent terms for a document with a higher weighting value, but it will then decrease the term’s weighting value if that term appears frequently across multiple documents in the set. So yes, Purchase Orders might have “invoice-y” terms on them, but the value of “purchase order-y” terms specific to Purchase Orders are going to be much more important when it comes time to classify the document.

The Math

When it comes down to the math, the algorithm looks something like this.

A Specific Term’s Weighting Value = TF x IDF

where…

TF = (# times a specific term appears in a document) / (total # of terms in the document)

IDF = log(total # of document types / # of document types with the specific term in it)

Alternatively: IDF = -log(# of document types with the specific term in it / total # of document types)

Imagine you had three documents in a document set, each of a different document type…

Document Type 1	Document Type 2	Document Type 3
------------	------------	------------
This	This	This
Is	Is	Is
A	Another	A
A	Another	Third
Sample	Example	Document
	Example
	Example

The number of times the word “This” appears on either document is 1. Document Type 1 has 5 total terms. Document Type 2 has 7 terms. Document Type 3 has 5 terms.

For Document Type 1, the Term Frequency of the word "this" is "0.20". There are five words, out of which the word "this" appears once. One divided by five is "0.2". Same for Document Type 3. For Document Type 2, the math works out to 0.1429.

TF(“this”, Document Type 1) = 1/5 = 0.2 TF(“this”, Document Type 2) = 1/7 = 0.1429 TF(“this”, Document Type 3) = 1/5 = 0.2

Inverse document frequency is calculated against the whole document set. So, this value will be same for all three document types. The IDF value for the word "this" works out to "0" because all three documents contain the word "this". The logarithm of 1 is always 0.

IDF(“this”, Doc Set) = log(3/3) = 0

Last, to find the TF-IDF weighting value for each document type, you just multiply those numbers together.

TF-IDF(“this”, Document 1) = 0.2 x 0 = 0 TF-IDF(“this”, Document 2) = 0.1429 x 0 = 0 TF-IDF(“this”, Document 3) = 0.2 x 0 = 0

Again, the term “this” is on all docs. For basic TF-IDF, its IDF value scales its weighting all the way to zero because it’s not a good identifier. The term is not specific to one document or the other. So, regardless of the term's frequency value, the TF-IDF weighting score equates to zero.

The term “example” is found a lot on Document 2 but not on Document 1 or on Document 3. It should have a much higher weighting value for Document 2 than the other two.

TF(“example”, Document 1) = 0/5 = 0 TF(“example”, Document 2) = 3/7 = 0.4286 TF(“example”, Document 3) = 0/5 = 0

IDF(“example”, Doc Set) = -log(3/1) = 0.4771

TF-IDF(“example”, Document 1) = 0 x 0.4771 = 0 TF-IDF(“example”, Document 2) = 0.4286 x 0.4771 = 0.2045 TF-IDF(“example”, Document 3) = 0 x 0.4771 = 0

Sure enough, 0.205 is greater than 0.

The term “A” should be more interesting. It’s a weird one. It appears in two of the three documents, and it appears on Document 1 more than Document 2. So, it’s not totally unique but might be more important to one document vs another. But still not a standout like the term “example” for Document 3.

TF(“a”, Document 1) = 2/5 = 0.4 TF(“a”, Document 2) = 0/7 = 0 TF(“a”, Document 3) = 1/5 = 0.2

IDF(“a”, Doc Set) = -log(3/2) = 0.1761

TF-IDF(“a”, Document 1) = 0.4 x 0.1761 = 0.0704 TF-IDF(“a”, Document 2) = 0 x 0.1761 = 0 TF-IDF(“a”, Document 3) = 0.2 x 0.1761 = 0.0352

The weightings bear out our ideas for this term. It shouldn’t be considered a feature of Document 2 at all, and its weighting is zero. It has a higher weighting value for Document 1 than Document 3 (0.0704 > 0.0352) because the term appears more frequently on Document 2. But still isn’t as clear an identifier as the word “example” for Document 2. So the weighting value is smaller (0.205 > 0.0704) In all honestly, this is similar to what we were doing intuitively when we were matching document titles to create rules for classification. We found a single text feature of the Document Type, the title of that document, that was common to all documents of that type. The Term Frequency here would be 1 since we only used a single feature, the title, for each Document Type (1 term divided by 1 total terms = 1). And the title is specific to only documents of that Document Type. So, the feature has a low Document Frequency (1 times term appears divided by 4 total Doc Types = 0.25) or high Inverse Document Frequency (-log(0.25) = 0.60206). If we were to multiply those two values together, we’d get a relatively high TF-IDF weighting score for the feature of the Document Type’s title (1 x 0.60206 = 0.60206). • Note: A “high” weighting value will always be relative to the number of terms/documents/document types you train. There is no magic number where everything above 0.5 (for example) is a “high” weighting or and below is a “low” weighting. It’s all dependent on how a term’s score ranks up against the document being analyzed and that terms score in the various document types. The Good The beauty of TF-IDF is that it can help account for variability in documents. We may not be able to title match every kind of document. There may just be too many to make that a realistic option. Or a document might not use titles at all to identify them. Or we may not be able to pattern match a title due to bad OCR. Or a document’s title might change from year to year or unexpectedly in the middle of production. Using a Rule-Based approach these are all problems you would need to solve as they came up, one by one. A training-based approach, taking advantage of TF-IDF weighting values, is often better suited to handle these issues. The Not-So Good: Words of Caution That being said. A computer is never as good at understanding a document as a human being. This is three-fold. 1) Your training weightings are only as good as the examples you provide them. If you don’t understand your documents, neither is the training model.

Trained examples need to be representative of the documents you’re dealing with. If you give the algorithm bad features, you’re going to get bad results. This is why TF-IDF is called a user assisted machine learning algorithm. It’s not as if Grooper is going to magically know what an invoice looks like until you provide it with examples of an invoice. But if you give it examples of a purchase order and call it an invoice, well, Grooper is going to think it’s an invoice. 2) If an unclassified document’s features are wildly different than any trained examples seen so far, TF-IDF will likely fail to classify it. The more different your documents are from the trained examples (The more variety there is to a single Document Type. The more different the OCR results are on a scanned image.), the less likely you are to get 100% accurate classification.

Where a human being might be able to see similarities in the features, the TF-IDF values only have the context of the specific trained features. If they aren’t there on the document, there’s going to be nothing to compare between the document and the trained TF-IDF features.

The reverse can be true as well, if you train an example that is wildly different (using wildly different terms) than most documents in your set, you can get bad classification results. Sometimes, it’s best to let outliers be outliers and use different classification strategies, such as looking for a Rule-Based method for the outlier or creating a new Document Type for the outlier (or have a human review the document).

Furthermore, the right solution isn’t always “feed the model more training examples”. There is a concept of “overtraining” a TF-IDF model. Once you start tossing in more and more features for more and more document types, the line between useful distinguishable features can start to blur. At that point you can end up with worse classification. With more and more trained examples, you may get to a point where every document starts to look the same as far as the TF-IDF weightings go.

3) Don’t discount the power of human review.

As powerful as TF-IDF can be, it does have its limitations. While it is designed to find features in common for documents you have not encountered before with features of trained document examples, there is still potential for documents coming through unclassified or mis-classified. This can be the case particularly with complex document sets with lots of variation among documents in a Document Type. That said, a good TF-IDF weighting model can still do the “heavy lift” of classification decision making for even the most complicated document sets.

When Grooper can’t make the right decision, a human reviewer is a perfectly reasonable option. A computer simply cannot beat the human mind when it comes to seeing, understanding, and evaluating patterns, let alone intuitive leaps in judgement. That’s basically all classification is, looking for common patterns in the features of a document and using those patterns to put documents in one group or another. If you’re looking for 100% accuracy (or as close to it as possible), a human review of Grooper’s TF-IDF decision making is an exceptional safety net for making the right choice.