TF-IDF (Concept): Difference between revisions
Configadmin (talk | contribs) Created page with "TF-IDF (Term Frequency-Inverse Document Frequency) is how Grooper uses machine learning for document classification and Field Class|data extracti..." |
Configadmin (talk | contribs) No edit summary |
||
| Line 1: | Line 1: | ||
TF-IDF (Term Frequency-Inverse Document Frequency) is how Grooper uses machine learning for document [[Training Based Approach|classification]] and [[Field Class|data extraction]]. TF-IDF is designed to reflect how important a word is to a document or a set of documents. The importance value (or weighting) of a word increases proportionally to the number of times it appears in the document (Term Frequency). The weighting is offset by the number of documents in the set containing the word (Inverse Document Frequency). Some words appear more generally across documents and hence are less unique identifiers. So, their weighting is lessened. | TF-IDF (Term Frequency-Inverse Document Frequency) is how Grooper uses machine learning for document [[Training-Based Approach|classification]] and [[Field Class|data extraction]]. TF-IDF is designed to reflect how important a word is to a document or a set of documents. The importance value (or weighting) of a word increases proportionally to the number of times it appears in the document (Term Frequency). The weighting is offset by the number of documents in the set containing the word (Inverse Document Frequency). Some words appear more generally across documents and hence are less unique identifiers. So, their weighting is lessened. | ||
Revision as of 16:37, 27 December 2019
TF-IDF (Term Frequency-Inverse Document Frequency) is how Grooper uses machine learning for document classification and data extraction. TF-IDF is designed to reflect how important a word is to a document or a set of documents. The importance value (or weighting) of a word increases proportionally to the number of times it appears in the document (Term Frequency). The weighting is offset by the number of documents in the set containing the word (Inverse Document Frequency). Some words appear more generally across documents and hence are less unique identifiers. So, their weighting is lessened.