Content Type Filter

From Grooper Wiki
Jump to navigation Jump to search
Graphic depicting the notion of Data Filtering.

Filtering information can allow faster iterating and easier testing.

About

Content Type Filter is a property on the Classify, Extract, and Data Review activities. Its addition was born from the notion that testing and iterating should be as easy as possible in Grooper.

Concerning General Data Filtering

Data filtering is the procedure of choosing a more specific part of your data set and using that subset for viewing or analysis. Filtering is usually (but not always) temporary – the complete data set is retained, but only a portion of it (the filtered set) is used for the calculation or review.

Uses for data filtering:

  • Observe results for a particular period of time.
  • Calculate results for particular groups of interest.
  • Exclude unwanted observations from an analysis.
  • Train and validate statistical models.


Filtering requires establishing rules or logic to recognize the points you want to include in your analysis. Data filtering can also be known as “subsetting” data, or a data “drill-down”. This article will illustrate a filtered data set and, later, discuss how you configure and use filtering in Grooper.

Example of Filtering

The table below shows a selection of data set from a survey about peoples’ preferred cat. The survey data contains demographic information about the respondents as well as each person’s preferred cat and that person’s rating (out of 5) for each of six varieties of cat.

ID Age Gender Preferred cat Persian Cat
Content type filtering 002.png
Bengal Cat
Content type filter 003.png
Maine Coon
Content type filter 004.png
Siamese Cat
Content type filter 005.png
British Shorthair
Content type filter 006.png
Munchkin Cat
Content type filter 007.png
1 25 to 29 Female Munchkin Cat 2 5 2 3 1 4
2 45 to 49 Male Munchkin Cat 5 1 5 5 3 4
3 25 to 29 Female Bengal Cat 5 4 2 3 1 1
4 25 to 29 Female Persian Cat 4 2 2 2 2 2
5 55 to 64 Female Bengal Cat 3 4 3 3 4 2
6 55 to 64 Female British Shorthair 3 3 3 3 4 4
7 50 to 54 Female Maine Coon 2 3 5 2 2 2
8 35 to 39 Female Persian Cat 4 2 5 3 2 5
9 65 or more Male British Shorthair 5 5 3 5 5 3
10 45 to 49 Female Maine Coon 4 4 4 5 5 3
11 45 to 49 Male Persian Cat 4 1 1 4 1 1
12 55 to 64 Male Persian Cat 5 2 2 5 2 2
13 55 to 64 Male Persian Cat 5 2 2 3 2 2
14 30 to 34 Male Munchkin Cat 3 2 5 3 3 5
15 65 or more Female British Shorthair 2 4 2 5 4 2


Filtering this data involves:

  1. Coming up with a rule for the observations needed.
  2. Selecting the observations that fit the rule.
  3. Conducting the analysis using only the information contained in those selected observations.

For example, the table below shows the data filtered for Males only. The darker colored rows are kept in the analysis while the remaining rows are excluded. Results computed for Males are then calculated based on the highlighted rows (ID’s 2, 9, 11, 12, 13, 14). If we want to know the average rating for Persian Cat among males, we would compute that as (5 + 5 + 4 + 5 + 5 + 3) / 6 = 4.5.

ID Age Gender Preferred Cat Persian Cat
Content type filtering 002.png
Bengal Cat
Content type filter 003.png
Maine Coon
Content type filter 004.png
Siamese Cat
Content type filter 005.png
British Shorthair
Content type filter 006.png
Munchkin Cat
Content type filter 007.png
1 25 to 29 Female Munchkin Cat 2 5 2 3 1 4
2 45 to 49 Male Munchkin Cat 5 1 5 5 3 4
3 25 to 29 Female Bengal Cat 5 4 2 3 1 1
4 25 to 29 Female Persian Cat 4 2 2 2 2 2
5 55 to 64 Female Bengal Cat 3 4 3 3 4 2
6 55 to 64 Female British Shorthair 3 3 3 3 4 4
7 50 to 54 Female Maine Coon 2 3 5 2 2 2
8 35 to 39 Female Persian Cat 4 2 5 3 2 5
9 65 or more Male British Shorthair 5 5 3 5 5 3
10 45 to 49 Female Maine Coon 4 4 4 5 5 3
11 45 to 49 Male Persian Cat 4 1 1 4 1 1
12 55 to 64 Male Persian Cat 5 2 2 5 2 2
13 55 to 64 Male Persian Cat 5 2 2 3 2 2
14 30 to 34 Male Munchkin Cat 3 2 5 3 3 5
15 65 or more Female British Shorthair 2 4 2 5 4 2

Results for Different Groups

A general need for most research is to gather results for different groups in the data. One may want to ask about the prevalence of poverty within a demographic segment of the overall population, understand sales figures for a particular quarter, or view survey results collected from customers who gave your software a positive review on Google. In each case, a logical rule defines whether each case in the sample is excluded or included.

From the example above, we may wish to compute the average rating for each cat within for the Males in the sample. Such filtering transforms the results like this:

Unfiltered Filtered to Males
Average Sample Size Average Sample Size
Persian Cat
Content type filtering 002.png
3.73 15 Persian Cat
Content type filtering 002.png
4.50 6
Bengal Cat
Content type filter 003.png
2.93 15 Bengal Cat
Content type filter 003.png
2.17 6
Maine Coon
Content type filter 004.png
3.07 15 Content type filtering 001.png Maine Coon
Content type filter 004.png
3.00 6
Siamese Cat
Content type filter 005.png
3.60 15 Siamese Cat
Content type filter 005.png
4.17 6
British Shorthair
Content type filter 006.png
2.73 15 British Shorthair
Content type filter 006.png
2.67 6
Munchkin Cat
Content type filter 007.png
2.80 15 Munchkin Cat
Content type filter 007.png
2.83 6


Sometimes filtering is carried out implicitly. For example, in survey research, the columns of a crosstab correspond to a special case of filtering, where filtered results are computed separately for each column, and the results are displayed side-by-side.

Data Cleansing

One reason for filtering data is to remove observations that may contain errors or are undesirable for analysis. For example, you may want to remove respondents who did not complete the survey, respondents who raced through the survey and selected answers without paying attention to what they were answering (“speeders”), or cases where data entered manually has been entered with mistakes. In other areas of research, a multivariate technique may only be applicable to cases where there is complete information for all the variables that were measured, and so a filter may be constructed to remove cases where some observations are missing.

Checking Results

Filtering can be used to evaluate the performance of statistical algorithms and models. The basic idea is to split up the sample into two or more groups, and to then apply the analysis independently to each group and compare the results. This kind of filtering would select cases from the data at random, rather than using some rule which is based on the data. This ensures a valid comparison and is often referred to as training, testing, and validating.

How To

As stated above, a the Content Type Filter property is established on three different Activities. Let's dive into setting this up for each different type.

! Some of the tabs in this tutorial are longer than the others. Please scroll to the bottom of each step's tab before going to the step.

Content Type Filter for Classify

This functionality can be useful for when you need to reclassify specific document types without disturbing the classification of other documents. Again, the emphasis here is ease of testing and iteration, especially considering models that get larger over time. Establishing this filter can be used for testing purposes in the Classify Tester and Unattended Activity Tester tabs of the individual step, or to literally filter the step for the purposes of the actual Batch Process.

  1. Navigate to Batch Processing > Processes > Working and use the Add > Batch... command.
  2. Click the Add Step... button.
  3. Set the Activity Type property to Classify
Content type filter 008a.png
  1. Select the Content Model Scope property and click the drop-down arrow. Select the desired scope.
Content type filter 008b.png
  1. Set the Apply To property to Classified
    • When you set this property to this setting, it will expose the next property.
  2. Select the Content Type Filter property and click the drop-down button.
  3. The Content Types available in this drop-down menu will be limited to the scope seleceted previously. You can select as few or as many Content Types you wish to filter this particular step to.
Content type filter 008c.png

Back to top to continue to next tab

Content Type Filter for Extract

This functionality can be useful for when you need to test Extraction after making adjustments to your model. As you add more and more Data Elements to your model, and you Batches grow, filtering to test specific elements against specific documents can be an enormous time saver. Establishing this filter can be used for testing purposes in the Unattended Activity Tester tab of the individual step, or to literally filter the step for the purposes of the actual Batch Process.

  1. Navigate to Batch Processing > Processes > Working and use the Add > Batch... command.
  2. Click the Add Step... button.
  3. Set the Activity Type property to Extract
Content type filter 009a.png
  1. Select the Content Type Filter property and click the drop-down arrow. Select as many Content Types as you deem necessary.
Content type filter 009b.png
  1. Select the Data Element Filter property and click the drop-down arrow.
    • The Content Types selected from the previous step will be represented in a linear fashion in the drop-down list. Expanding their hierarchies will expose the Data Elements available from your Data Model.
  2. Select the Data Elements for each Content Type.
    • This is a very granular approach and gives you an enormous amount of control for filtering purposes.
Content type filter 009c.png

Back to top to continue to next tab

Content Type Filter for Data Review

This functionality is less about testing, and more about streamlining human activity. Data Review can be a very involved process, requiring a lot of human interaction by many individuals, so filtration can be an excellent means to reduce error and increase efficiency. Filtering a Data Review activity will limit the Content Types available to it when someone is performing the review.

  1. Navigate to Batch Processing > Processes > Working and use the Add > Batch... command.
  2. Click the Add Step... button.
  3. Set the Activity Type property to Data Review
Content type filter 010a.png
  1. Select the Index Navigator Settings property and click the ellipsis button.
  2. In the Index Navigator Settings window, select the Content Type Filter property and click the drop-down arrow.
  3. You can select as few or as many Content Types you wish to filter this particular step to.
Content type filter 010b.png

Version Differences

The Content Type Filter property did not exist prior to Grooper 2.9.