Content Type Filter
Filtering information can allow faster iterating and easier testing.
Contents
About
Content Type Filter is a property on the Classify, Extract, and Data Review activities. Its addition was born from the notion that testing and iterating should be as easy as possible in Grooper.
Concerning General Data Filtering
Data filtering is the procedure of choosing a more specific part of your data set and using that subset for viewing or analysis. Filtering is usually (but not always) temporary – the complete data set is retained, but only a portion of it (the filtered set) is used for the calculation or review.
Uses for data filtering:
- Observe results for a particular period of time.
- Calculate results for particular groups of interest.
- Exclude unwanted observations from an analysis.
- Train and validate statistical models.
Filtering requires establishing rules or logic to recognize the points you want to include in your analysis. Data filtering can also be known as “subsetting” data, or a data “drill-down”. This article will illustrate a filtered data set and, later, discuss how you configure and use filtering in Grooper.
Example of Filtering
The table below shows a selection of data set from a survey about peoples’ preferred cat. The survey data contains demographic information about the respondents as well as each person’s preferred cat and that person’s rating (out of 5) for each of six varieties of cat.
Filtering this data involves:
- Coming up with a rule for the observations needed.
- Selecting the observations that fit the rule.
- Conducting the analysis using only the information contained in those selected observations.
For example, the table below shows the data filtered for Males only. The darker colored rows are kept in the analysis while the remaining rows are excluded. Results computed for Males are then calculated based on the highlighted rows (ID’s 2, 9, 11, 12, 13, 14). If we want to know the average rating for Persian Cat among males, we would compute that as (5 + 5 + 4 + 5 + 5 + 3) / 6 = 4.5.
Results for Different Groups
A general need for most research is to gather results for different groups in the data. One may want to ask about the prevalence of poverty within a demographic segment of the overall population, understand sales figures for a particular quarter, or view survey results collected from customers who gave your software a positive review on Google. In each case, a logical rule defines whether each case in the sample is excluded or included.
From the example above, we may wish to compute the average rating for each cat within for the Males in the sample. Such filtering transforms the results like this:
Sometimes filtering is carried out implicitly. For example, in survey research, the columns of a crosstab correspond to a special case of filtering, where filtered results are computed separately for each column, and the results are displayed side-by-side.
Data Cleansing
One reason for filtering data is to remove observations that may contain errors or are undesirable for analysis. For example, you may want to remove respondents who did not complete the survey, respondents who raced through the survey and selected answers without paying attention to what they were answering (“speeders”), or cases where data entered manually has been entered with mistakes. In other areas of research, a multivariate technique may only be applicable to cases where there is complete information for all the variables that were measured, and so a filter may be constructed to remove cases where some observations are missing.
Checking Results
Filtering can be used to evaluate the performance of statistical algorithms and models. The basic idea is to split up the sample into two or more groups, and to then apply the analysis independently to each group and compare the results. This kind of filtering would select cases from the data at random, rather than using some rule which is based on the data. This ensures a valid comparison and is often referred to as training, testing, and validating.
How To
As stated above, a the Content Type Filter property is established on three different Activities. Let's dive into setting this up for each different type.
! | Some of the tabs in this tutorial are longer than the others. Please scroll to the bottom of each step's tab before going to the step. |
Content Type Filter for Classify
This functionality can be useful for when you need to reclassify specific document types without disturbing the classification of other documents. Again, the emphasis here is ease of testing and iteration, especially considering models that get larger over time. Establishing this filter can be used for testing purposes in the Classify Tester and Unattended Activity Tester tabs of the individual step, or to literally filter the step for the purposes of the actual Batch Process.
|
![]() |
|
![]() |
|
![]() |
Content Type Filter for Extract
This functionality can be useful for when you need to test Extraction after making adjustments to your model. As you add more and more Data Elements to your model, and you Batches grow, filtering to test specific elements against specific documents can be an enormous time saver. Establishing this filter can be used for testing purposes in the Unattended Activity Tester tab of the individual step, or to literally filter the step for the purposes of the actual Batch Process.
|
![]() |
|
![]() |
|
![]() |
Content Type Filter for Data Review
This functionality is less about testing, and more about streamlining human activity. Data Review can be a very involved process, requiring a lot of human interaction by many individuals, so filtration can be an excellent means to reduce error and increase efficiency. Filtering a Data Review activity will limit the Content Types available to it when someone is performing the review.
Version Differences
The Content Type Filter property did not exist prior to Grooper 2.9.