HTTP Import (Import Provider)

From Grooper Wiki
(Redirected from Condition HTML)

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

HTTP Import is an Import Provider used to import web-based content (web pages and files hosted on an HTTP server). HTTP Import can be used to ingest individual web pages, defined portions of a website or entire websites into Grooper.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2025). It contains one or more Projects with resources used in examples throughout this article.

About

HTTP Import is used to import web content into Grooper Batches. It can be used to import the following from an HTTP server:

  • Individual web pages (HTML documents)
  • Files hosted on web servers, including PDFs hosted on websites.
  • Entire websites

How does it work?

HTTP Import will bring in one or more web pages based on how the provider is configured. This configuration will determine how Grooper navigates pages on the website. One Grooper document is created for each distinct URL. Each web page is imported as a Batch Folder with an HTML file as its primary attachment. For URLs that resolve to files (such as PDFs), the file is imported as a Batch Folder and is its primary attachment.

How is it configured?

The HTTP Import configuration involves setting a "Source". This source can be:

  • A single web page.
  • Or, the root of a "web app"
    • Commonly, the root of a website is also the root of a web app. However, a single website may also host multiple web apps as subsites.

When a web app's root is defined, one or more relative URLs are added to the "Relative Page URLs" to specify which pages to include in the import. Furthermore, HTTP Import will traverse links on a web page to import linked pages when a "Link Selector" is configured.

How To

These how-to instructions will demonstrate the basics of HTTP content ingestion. We will show you how to use HTTP Import for the following scenarios:

  • How to import a single web page with HTTP Import.
  • How to import multiple site pages with HTTP Import in two ways:
    • Using "Relative Page URLs" only - This is useful if you have a known list of pages you want to import from a single domain.
    • Using "Link Selectors" - This is useful if you want to traverse links from a starting webpage (or webpages) and import each linked page.

We will import pages related to the Oklahoma state constitution to demonstrate each of these scenarios.

Import a single web page with HTTP Import

The easiest way to configure HTTP Import is to import a single page. This is not often very practical, but will show you the basics for how HTTP Import works.

  • This tutorial shows you how to use HTTP Import for user directed imports from the Imports page. However, just like any other import provider, you could perform scheduled imports with HTTP Import using an Import Watcher (if your use case demands it).

To import a single web page with HTTP Import do the following:

  1. Go to the Imports page.
  2. Press the "New Import Job" button to configure a new Import Job.
  3. This brings up the "Submit Import Job" editor. In the "Description" property, enter a brief description for the Import Job.
  4. Select the "Provider" property to choose the Import Provider. Using the dropdown list, choose "HTTP Import".
  5. Expand the "Provider" property.
  6. Open the "Sources" editor (press the "..." button) to add one or more HTTP Resources.
  7. Press the "Add" button to add a new HTTP Resource.
  8. To import a single web page, enter that web page's address in the "URL" property.
    • For example: Enter example enter https://en.wikipedia.org/wiki/Constitution_of_Oklahoma to import the Wikipedia entry for the Oklahoma State Constitution.
  9. In the "Description" field, you must enter a brief description for the HTTP Resource. This is a required field.
  10. Press "OK" in the Sources editor when finished configuring the HTTP Resource.
  11. Configure the rest of the Provider settings as needed. These include the settings common to all Import Providers, including the "Batch Creation" settings where users select which Batch Process to use.
    • The "Wait Time" property controls how much time to wait before each page is imported. This property should not be necessary when importing a single web page.
  12. Press the "Submit" button to submit the Import Job.
  13. Assuming you have an Import Watcher service running, the webpage will be imported accordingly.
    • HTML documents are just HTML text. That text gets rendered on screen in a web browser. This includes text that links to images and CSS style sheets. If the webpage uses relative links to link to these resources (most do), it will look "weird" in the Grooper Document Viewer. The "Condition HTML" command can help here. See the section on using Condition HTML below for more information.

Import multiple site pages with HTTP Import

If you're using HTTP Import, it is more likely you want to import several web pages. Either a known list of pages in a website domain, or several pages linked throughout a website domain (possibly the entire domain).

  • To import a known list of pages in a domain, you will need to configure the "Relative Page URLs" property, adding one Relative Page URL for each page you want to import.
  • To import pages linked throughout a domain, you will add one or more "Link Selectors" to determine which links to follow (and which links not to follow) when importing pages.

We will show you both scenarios using the same starting point: The Oklahoma State Constitution on ballotpedia.org.

Using Relative Page URLs only

If you know exactly what pages you want to import, you can configure HTTP Import to import that list of pages by adding them to the list of "Relative Page URLs". A "relative page URL" is simply the path relative to the website domain.

For example: If https://www.example.com/about/contact.html is the full web address:
  • https://www.example.com is the domain.
  • about/contact.html is the page's relative path.

To import from a known list of pages, using "Relative Page URLs" do the following:

  1. Go to the Imports page.
  2. Press the "New Import Job" button to configure a new Import Job.
  3. This brings up the "Submit Import Job" editor. In the "Description" property, enter a brief description for the Import Job.
  4. Select the "Provider" property to choose the Import Provider. Using the dropdown list, choose "HTTP Import".
  5. Expand the "Provider" property.
  6. Open the "Sources" editor (press the "..." button) to add one or more HTTP Resources.
  7. Press the "Add" button to add a new HTTP Resource.
  8. In the "URL" property, enter the website's domain.
    • Example: We want to import the Oklahoma State Constitution hosted on webpages at ballotpedia.org. We would enter https://ballotpedia.org.
  9. Open the "Relative Page URLs" editor (Press the "..." button).
  10. Enter the relative paths for each page you wish to import (relative to the domain entered in the "URL").
    • Example: We want to import the Oklahoma State Constitution hosted on ballotpedia.org. We would enter the following list of relative URLs:
Oklahoma_Constitution
Preamble,_Oklahoma_Constitution
Article_I,_Oklahoma_Constitution
Article_II,_Oklahoma_Constitution
Article_III,_Oklahoma_Constitution
Article_IV,_Oklahoma_Constitution
Article_V,_Oklahoma_Constitution
Article_VI,_Oklahoma_Constitution
Article_VII,_Oklahoma_Constitution
Article_VIIA,_Oklahoma_Constitution
Article_VIIB,_Oklahoma_Constitution
Article_VIII,_Oklahoma_Constitution
Article_IX,_Oklahoma_Constitution
Article_X,_Oklahoma_Constitution
Article_XI,_Oklahoma_Constitution
Article_XII,_Oklahoma_Constitution
Article_XIIA,_Oklahoma_Constitution
Article_XIII,_Oklahoma_Constitution
Article_XIIIA,_Oklahoma_Constitution
Article_XIIIB,_Oklahoma_Constitution
Article_XIV,_Oklahoma_Constitution
Article_XV,_Oklahoma_Constitution
Article_XVI,_Oklahoma_Constitution
Article_XVII,_Oklahoma_Constitution
Article_XVIII,_Oklahoma_Constitution
Article_XIX,_Oklahoma_Constitution
Article_XX,_Oklahoma_Constitution
Article_XXI,_Oklahoma_Constitution
Article_XXII,_Oklahoma_Constitution
Article_XXIII,_Oklahoma_Constitution
Article_XXIV,_Oklahoma_Constitution
Article_XXV,_Oklahoma_Constitution
Article_XXV-A,_Oklahoma_Constitution
Article_XXVI,_Oklahoma_Constitution
Article_XXVIIIA,_Oklahoma_Constitution
Article_XXVIII,_Oklahoma_Constitution
Article_XXIX,_Oklahoma_Constitution
Article_XXX,_Oklahoma_Constitution
Schedule,_Oklahoma_Constitution
  1. Press "OK" in the Relative Page URLs editor when finished.
  2. In the "Description" field, you must enter a brief description for the HTTP Resource. This is a required field.
  3. Press "OK" in the Sources editor when finished configuring the HTTP Resource.
  4. Configure the rest of the Provider settings as needed. These include the settings common to all Import Providers, including the "Batch Creation" settings where users select which Batch Process to use.
    • The "Wait Time" property controls how much time to wait before each page is imported. This property should not be necessary when importing a single web page.
  5. Press the "Submit" button to submit the Import Job.
  6. Assuming you have an Import Watcher service running, one webpage for each relative URL in the "Relative Page URLs" list be imported accordingly.
    • HTML documents are just HTML text. That text gets rendered on screen in a web browser. This includes text that links to images and CSS style sheets. If the webpage uses relative links to link to these resources (most do), it will look "weird" in the Grooper Document Viewer. The "Condition HTML" command can help here. See the section on using Condition HTML below for more information.

Using Link Selectors

Link Selectors allow you start at one (or more) web page and import pages linked from that page. This is a great way to import a large number of pages from a website. This is done by adding one or more "Link Selectors".

  • Link Selectors use CSS selectors to select the links on a page you want to follow. Grooper will open each link matched by the selector and import the page.
    • The most common selector is simply a.
    • "Exclusion Selectors" may optionally be configured to exclude pages whose links you don't want Grooper to follow.
  • Grooper will crawl an entire hierarchy of pages when the Link Selector's "Recursive" property is set to "True".
  • You can also further filter pages you want to include/exclude from import using the "Included URL Pattern" and "Excluded URL Pattern" properties.


To import using Link Selectors, do the following:

  1. Go to the Imports page.
  2. Press the "New Import Job" button to configure a new Import Job.
  3. This brings up the "Submit Import Job" editor. In the "Description" property, enter a brief description for the Import Job.
  4. Select the "Provider" property to choose the Import Provider. Using the dropdown list, choose "HTTP Import".
  5. Expand the "Provider" property.
  6. Open the "Sources" editor (press the "..." button) to add one or more HTTP Resources.
  7. Press the "Add" button to add a new HTTP Resource.
  8. In the "URL" property, enter the website's domain.
    • Example: We want to import the Oklahoma State Constitution hosted on webpages at ballotpedia.org. We would enter https://ballotpedia.org.
  9. Open the "Relative Page URLs" editor (Press the "..." button).
  10. You must enter at least one relative page path (relative to the domain entered in the "URL").
    • Think of this as the "starting point" for the import. Starting on this page, you should be able to follow links to each page you want to import.
      • Multiple relative page URLs may be entered here.
      • When the Link Selector's "Recursive" property is set to "True", Grooper will also continue to follow links found on any page linked from these relative URLs as well.
      • Example: We want to import the Oklahoma State Constitution hosted on ballotpedia.org. The webpage https://ballotpedia.org/Oklahoma_Constitution acts as a table of contents for the rest of the pages detailing each article in the constitution. They are all linked from this one page. We would enter Oklahoma_Constitution in the "Relative Page URLs" editor.
  11. Press "OK" in the Relative Page URLs editor when finished.
  12. Open the "Link Selectors" editor (Press the "..." button).
  13. Press the "+" button to add a new Hyperlink Selector.
  14. In the "Selector" property, enter a CSS selector that matches the links Grooper should follow.
    • a is the most commonly used selector.
  15. Configure the remaining properties as needed. Of note are the following properties:
    • Recursive: If you want Grooper to continue opening links on every subsequent page it opens, enable the "Recursive" property. Grooper will continue opening pages until no more matching links are found.
      • Be aware, this can cause Grooper to import large numbers of web pages. Use with caution.
    • Included URL Pattern: Enabling this allows you to further restrict the webpages Grooper imports. It allows users to enter a regular expression pattern. The page will only be imported if the regex matches the URL.
      • Example: We want to import the pages on ballotpedia.org for each article in the Oklahoma Constitution. The URLs for the pages we want to import all end in "Oklahoma_Constitution". We could enable the Included URL Pattern and use the following pattern: Oklahoma_Constitution$ This would further ensure we only imported the pages we wanted to import.
    • Excluded URL Pattern: Enabling this allows you to further restrict the webpages Grooper imports. It allows users to enter a regular expression pattern. If the URL matches this regex, it will be excluded from import.
  16. In the "Description" field, you must enter a brief description for the HTTP Resource. This is a required field.
  17. Press "OK" in the Sources editor when finished configuring the HTTP Resource.
  18. Configure the rest of the Provider settings as needed. These include the settings common to all Import Providers, including the "Batch Creation" settings where users select which Batch Process to use.
    • The "Wait Time" property controls how much time to wait before each page is imported. This property should not be necessary when importing a single web page.
  19. Press the "Submit" button to submit the Import Job.
  20. Assuming you have an Import Watcher service running, one webpage for each link opened will be imported accordingly.
    • HTML documents are just HTML text. That text gets rendered on screen in a web browser. This includes text that links to images and CSS style sheets. If the webpage uses relative links to link to these resources (most do), it will look "weird" in the Grooper Document Viewer. The "Condition HTML" command can help here. See the section on using Condition HTML below for more information.

Batch Process considerations

When importing HTML documents into Grooper, there are several commands you may want to execute in a Batch Process to better prepare them for your end goal. These include:

  • HTTP Link - Load Content - Use this command when performing a sparse HTTP Import to load each HTML document in parallel.
    • Sparse imports are enabled by turning the "Sparse Import" property to "True".
    • Imports always run "single threaded" in Grooper. By executing this command in a Batch Process, the load operation (fully copying the document to the Grooper Repository) can be executing by multiple processing threads.
    • To do this, the first step in your Batch Process should be "Execute" with the "HTTP Link - Load Content" command added and configured.
  • HTML Document - Condition HTML - This command has main two purposes: (1) It can make the HTML document more human-readable in the Document Viewer and (2) it can isolate and remove unwanted HTML elements. See more in the section below.
  • HTML Document - Convert to PDF - This command converts the HTML page to a PDF document. Grooper can then process the PDF just like it processes any PDF.
  • HTML Document - Convert to Text - This command converts the HTML page to a TXT document. This is useful for webpages that present as text files (For example this page from the US Code of Federal Regulations hosted on govinfo.gov). It will get rid of unnecessary HTML elements and leave you with just plain text.

The Condition HTML command

FYI

The Condition HTML command's "Attribute Rules" and "Wrap Rules" have a very niche purpose.

The "Attribute Rules" and "Wrap Rules" assist in styling HTML elements in HTML documents. These properties were developed for a use case that involved converting XML documents to HTML documents using the XML Transform activity.

  • "Attribute Rules" add attributes to existing HTML elements.
  • "Wrap Rules" wrap text in an HTML element. Text is matched with regular expressions, then wrapped in an HTML element of your choosing.

Attribute Rules and Wrap Rules do not pertain to HTTP Import and will not be discussed further in this article.

The Condition HTML command has main two purposes:

1. It can make the HTML document more human-readable in the Document Viewer.

Many (if not most) web pages make use of "relative links" to various resources. The link is relative to the website's domain. This often includes images and CSS style sheets. When Grooper imports an HTML documents, this relative path is effectively broken, making the web page "look weird". For the page to be more human readable, these relative links need to be replace with absolute links.

Condition HTML can prepend a domain URL to relative links, making them absolute links. This will resolve those broken links and the web page will look like the user expects it to in the Grooper Document Viewer.

  • The is done by configuring the "Site URL" property.
  • In "Site URL", enter a URL that will be prepended to each relative link in the HTML page.


Example: We imported this page in a previous example: https://en.wikipedia.org/wiki/Constitution_of_Oklahoma

Before "Condition HTML"'

  • No CSS styling.
  • Image links are broken.


"Condition HTML" configuration:

  • Site URL = https://en.wikipedia.org
  • This prepends this URL to any relative links on the page.
  • For example, a relative link image path might be /wiki/graphics/image.png.
    • Its full path would be https://en.wikipedia.org/wiki/graphics/image.png.
    • The Site URL configuration converts /wiki/graphics/image.png to https://en.wikipedia.org/wiki/graphics/image/png in the HTML text.


After "Condition HTML"

  • CSS styling is present. Fonts, font size, element sizes and other page styling affected by CSS style sheets are implemented.
  • Image links are present. Images show up on screen


2. It can isolate and remove unwanted HTML elements.

One use case for HTTP Import is importing web content that can be added to an AI Assistant's knowledge resources. Depending on the website, there may be all kinds of "junk" you don't want to feed the AI Assistant: navigation boxes, ad banners, headers and footers, etc.

Condition HTML can use CSS selectors to select the elements you want to keep and remove ones you want to discarded.

  • This is accomplished by configuring the "Body Selector" and/or "Removal Selector" properties.
  • Body Selector: Enter a CSS selector that matches the main text content you want to keep.
    • When set this replaces the <body> HTML element with the content of the selected element.
    • Example: This is an easy way to remove <header> and <footer> elements in scenarios where HTML <body> element contains a <header>, <main>, and <footer> element.
      • For the "Body Selector" enter "main".
      • This will replace the <body> element with the content of the <main> element. Effectively, this removes the <header> and <footer> elements.
  • Removal Selector: Enter a CSS selector that matches one or more elements you wish to remove.
    • This can by used to remove unnecessary or repetitive content from the HTML.
    • Examples: Navigation elements, HTML elements containing ads, HTML elements containing repetitive text that does not help inform an AI Assistant's knowledge base.


Body Selector Example: In an earlier example we imported several pages from ballotpedia.org.

The main text content we want is in a specific HTML element. The following selector will select what we need and get rid of most of what we don't want: div#content

  • Hint: Use your browser's developer tools to locate the element you want to select. How did we know div#content is the selector that works best for this webpage? We used the developer tools to locate the element as seen in the image on the left.


Before "Condition HTML":

There are several unnecessary HTML elements cluttering the page.


"Condition HTML" configuration:

  • Body Selector = div#content
  • This will replace the <main> element in the HTML with the selected element (the <div> element whose id="content").


After "Condition HTML":

Many unnecessary elements are gone (notably the header element up at the top of the page). This is now more streamlined for consumption by an AI Assistant.

However, there are still more HTML elements that can be removed. Continue reading for an example of Condition HTML's "Removal Selector" feature.



Removal Selector Example: Continuing with the example above, In an earlier example we imported several pages from ballotpedia.org.

Condition HTML's "Removal Selector" can be used on its own or in combination with the Body Selector to further remove unwanted elements. The following selector will remove most of the remaining HTML elements that are not unique text on the web page: table.infobox, table.navbox

  • You can enter multiple selectors by separating them with a comma.
  • Hint: Use your browser's developer tools to locate the element you want to select. We knew we did not want the navigation box linking to all the articles in image on the left. How did we know table.infobox was the right selector? We used the developer tools to locate the element as seen in the image on the left.


Before "Condition HTML":

There are several unnecessary HTML elements cluttering the page.

Removal Selectors are often used to get rid of repetitive content. In this scenario the "Articles" navigation panel is located on every web page we imported. We don't need this panel. We are not using it to navigate a web site. Furthermore, we imported these pages with the end goal of creating an AI Assistant and adding them to its knowledge resources. The text in this panel is does not contain any worthwhile information to feed the AI Assistant.


"Condition HTML" configuration:

  • Removal Selector = table.infobox, table.navbox
  • This will remove the HTML elments that match the table.infobox and table.navbox selectors.


After "Condition HTML":

The HTML elements matched by the "Removal Selector" are removed. In the image on the left, the web page's Articles panel has been removed.