HTTP Import (Import Provider): Difference between revisions

From Grooper Wiki
Line 210: Line 210:
<br clear=all>
<br clear=all>


[[File:2025 HTTPImport 0X-SiteURL-03.png|class=simpleborder simpleshadow|left]]
<div style="width:670px">[[File:2025 HTTPImport 0X-SiteURL-03.png|class=simpleborder simpleshadow|left]]</div>
"Condition HTML" configuration:
"Condition HTML" configuration:
<div class="flowlist">
<div class="flowlist">

Revision as of 10:14, 17 June 2025

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

Under construction. Check back soon for more info!

About

HTTP Import is used to import web content into Grooper Batches. It can be used to import the following from an HTTP server:

  • Individual web pages (HTML documents)
  • Files hosted on web servers, including PDFs hosted on websites.
  • Entire websites

How does it work?

HTTP Import will bring in one or more web pages based on how the provider is configured. This configuration will determine how Grooper navigates pages on the website. One Grooper document is created for each distinct URL. Each web page is imported as a Batch Folder with an HTML file as its primary attachment. For URLs that resolve to files (such as PDFs), the file is imported as a Batch Folder and is its primary attachment.

How is it configured?

The HTTP Import configuration involves setting a "Source". This source can be:

  • A single web page.
  • Or, the root of a "web app"
    • Commonly, the root of a website is also the root of a web app. However, a single website may also host multiple web apps as subsites.

When a web app's root is defined, one or more relative URLs are added to the "Relative Page URLs" to specify which pages to include in the import. Furthermore, HTTP Import will traverse links on a web page to import linked pages when a "Link Selector" is configured.

How To

These how-to instructions will demonstrate the basics of HTTP content ingestion. We will show you how to use HTTP Import for the following scenarios:

  • How to import a single web page with HTTP Import.
  • How to import multiple site pages with HTTP Import in two ways:
    • Using "Relative Page URLs" only - This is useful if you have a known list of pages you want to import from a single domain.
    • Using "Link Selectors" - This is useful if you want to traverse links from a starting webpage (or webpages) and import each linked page.

We will import pages related to the Oklahoma state constitution to demonstrate each of these scenarios.

Import a single web page with HTTP Import

The easiest way to configure HTTP Import is to import a single page. This is not often very practical, but will show you the basics for how HTTP Import works.

  • This tutorial shows you how to use HTTP Import for user directed imports from the Imports page. However, just like any other import provider, you could perform scheduled imports with HTTP Import using an Import Watcher (if your use case demands it).

To import a single web page with HTTP Import do the following:

  1. Go to the Imports page.
  2. Press the "New Import Job" button to configure a new Import Job.
  3. This brings up the "Submit Import Job" editor. In the "Description" property, enter a brief description for the Import Job.
  4. Select the "Provider" property to choose the Import Provider. Using the dropdown list, choose "HTTP Import".
  5. Expand the "Provider" property.
  6. Open the "Sources" editor (press the "..." button) to add one or more HTTP Resources.
  7. Press the "Add" button to add a new HTTP Resource.
  8. To import a single web page, enter that web page's address in the "URL" property.
    • For example: Enter example enter https://en.wikipedia.org/wiki/Constitution_of_Oklahoma to import the Wikipedia entry for the Oklahoma State Constitution.
  9. In the "Description" field, you must enter a brief description for the HTTP Resource. This is a required field.
  10. Press "OK" in the Sources editor when finished configuring the HTTP Resource.
  11. Configure the rest of the Provider settings as needed. These include the settings common to all Import Providers, including the "Batch Creation" settings where users select which Batch Process to use.
    • The "Wait Time" property controls how much time to wait before each page is imported. This property should not be necessary when importing a single web page.
  12. Press the "Submit" button to submit the Import Job.
  13. Assuming you have an Import Watcher service running, the webpage will be imported accordingly.
    • HTML documents are just HTML text. That text gets rendered on screen in a web browser. This includes text that links to images and CSS style sheets. If the webpage uses relative links to link to these resources (most do), it will look "weird" in the Grooper Document Viewer. The "Condition HTML" command can help here. See the section on using Condition HTML below for more information.

Import multiple site pages with HTTP Import

If you're using HTTP Import, it is more likely you want to import several web pages. Either a known list of pages in a website domain, or several pages linked throughout a website domain (possibly the entire domain).

  • To import a known list of pages in a domain, you will need to configure the "Relative Page URLs" property, adding one Relative Page URL for each page you want to import.
  • To import pages linked throughout a domain, you will add one or more "Link Selectors" to determine which links to follow (and which links not to follow) when importing pages.

We will show you both scenarios using the same starting point: The Oklahoma State Constitution on ballotpedia.org.

Using Relative Page URLs only

If you know exactly what pages you want to import, you can configure HTTP Import to import that list of pages by adding them to the list of "Relative Page URLs". A "relative page URL" is simply the path relative to the website domain.

For example: If https://www.example.com/about/contact.html is the full web address:
  • https://www.example.com is the domain.
  • about/contact.html is the page's relative path.

To import from a known list of pages, using "Relative Page URLs" do the following:

  1. Go to the Imports page.
  2. Press the "New Import Job" button to configure a new Import Job.
  3. This brings up the "Submit Import Job" editor. In the "Description" property, enter a brief description for the Import Job.
  4. Select the "Provider" property to choose the Import Provider. Using the dropdown list, choose "HTTP Import".
  5. Expand the "Provider" property.
  6. Open the "Sources" editor (press the "..." button) to add one or more HTTP Resources.
  7. Press the "Add" button to add a new HTTP Resource.
  8. In the "URL" property, enter the website's domain.
    • Example: We want to import the Oklahoma State Constitution hosted on webpages at ballotpedia.org. We would enter https://ballotpedia.org.
  9. Open the "Relative Page URLs" editor (Press the "..." button).
  10. Enter the relative paths for each page you wish to import (relative to the domain entered in the "URL").
    • Example: We want to import the Oklahoma State Constitution hosted on ballotpedia.org. We would enter the following list of relative URLs:
Oklahoma_Constitution
Preamble,_Oklahoma_Constitution
Article_I,_Oklahoma_Constitution
Article_II,_Oklahoma_Constitution
Article_III,_Oklahoma_Constitution
Article_IV,_Oklahoma_Constitution
Article_V,_Oklahoma_Constitution
Article_VI,_Oklahoma_Constitution
Article_VII,_Oklahoma_Constitution
Article_VIIA,_Oklahoma_Constitution
Article_VIIB,_Oklahoma_Constitution
Article_VIII,_Oklahoma_Constitution
Article_IX,_Oklahoma_Constitution
Article_X,_Oklahoma_Constitution
Article_XI,_Oklahoma_Constitution
Article_XII,_Oklahoma_Constitution
Article_XIIA,_Oklahoma_Constitution
Article_XIII,_Oklahoma_Constitution
Article_XIIIA,_Oklahoma_Constitution
Article_XIIIB,_Oklahoma_Constitution
Article_XIV,_Oklahoma_Constitution
Article_XV,_Oklahoma_Constitution
Article_XVI,_Oklahoma_Constitution
Article_XVII,_Oklahoma_Constitution
Article_XVIII,_Oklahoma_Constitution
Article_XIX,_Oklahoma_Constitution
Article_XX,_Oklahoma_Constitution
Article_XXI,_Oklahoma_Constitution
Article_XXII,_Oklahoma_Constitution
Article_XXIII,_Oklahoma_Constitution
Article_XXIV,_Oklahoma_Constitution
Article_XXV,_Oklahoma_Constitution
Article_XXV-A,_Oklahoma_Constitution
Article_XXVI,_Oklahoma_Constitution
Article_XXVIIIA,_Oklahoma_Constitution
Article_XXVIII,_Oklahoma_Constitution
Article_XXIX,_Oklahoma_Constitution
Article_XXX,_Oklahoma_Constitution
Schedule,_Oklahoma_Constitution
  1. Press "OK" in the Relative Page URLs editor when finished.
  2. In the "Description" field, you must enter a brief description for the HTTP Resource. This is a required field.
  3. Press "OK" in the Sources editor when finished configuring the HTTP Resource.
  4. Configure the rest of the Provider settings as needed. These include the settings common to all Import Providers, including the "Batch Creation" settings where users select which Batch Process to use.
    • The "Wait Time" property controls how much time to wait before each page is imported. This property should not be necessary when importing a single web page.
  5. Press the "Submit" button to submit the Import Job.
  6. Assuming you have an Import Watcher service running, one webpage for each relative URL in the "Relative Page URLs" list be imported accordingly.
    • HTML documents are just HTML text. That text gets rendered on screen in a web browser. This includes text that links to images and CSS style sheets. If the webpage uses relative links to link to these resources (most do), it will look "weird" in the Grooper Document Viewer. The "Condition HTML" command can help here. See the section on using Condition HTML below for more information.

Using Link Selectors

Link Selectors allow you start at one (or more) web page and import pages linked from that page. This is a great way to import a large number of pages from a website. This is done by adding one or more "Link Selectors".

  • Link Selectors use CSS selectors to select the links on a page you want to follow. Grooper will open each link matched by the selector and import the page.
    • The most common selector is simply a.
    • "Exclusion Selectors" may optionally be configured to exclude pages whose links you don't want Grooper to follow.
  • Grooper will crawl an entire hierarchy of pages when the Link Selector's "Recursive" property is set to "True".
  • You can also further filter pages you want to include/exclude from import using the "Included URL Pattern" and "Excluded URL Pattern" properties.


To import using Link Selectors, do the following:

  1. Go to the Imports page.
  2. Press the "New Import Job" button to configure a new Import Job.
  3. This brings up the "Submit Import Job" editor. In the "Description" property, enter a brief description for the Import Job.
  4. Select the "Provider" property to choose the Import Provider. Using the dropdown list, choose "HTTP Import".
  5. Expand the "Provider" property.
  6. Open the "Sources" editor (press the "..." button) to add one or more HTTP Resources.
  7. Press the "Add" button to add a new HTTP Resource.
  8. In the "URL" property, enter the website's domain.
    • Example: We want to import the Oklahoma State Constitution hosted on webpages at ballotpedia.org. We would enter https://ballotpedia.org.
  9. Open the "Relative Page URLs" editor (Press the "..." button).
  10. You must enter at least one relative page path (relative to the domain entered in the "URL").
    • Think of this as the "starting point" for the import. Starting on this page, you should be able to follow links to each page you want to import.
      • Multiple relative page URLs may be entered here.
      • When the Link Selector's "Recursive" property is set to "True", Grooper will also continue to follow links found on any page linked from these relative URLs as well.
      • Example: We want to import the Oklahoma State Constitution hosted on ballotpedia.org. The webpage https://ballotpedia.org/Oklahoma_Constitution acts as a table of contents for the rest of the pages detailing each article in the constitution. They are all linked from this one page. We would enter Oklahoma_Constitution in the "Relative Page URLs" editor.
  11. Press "OK" in the Relative Page URLs editor when finished.
  12. Open the "Link Selectors" editor (Press the "..." button).
  13. Press the "+" button to add a new Hyperlink Selector.
  14. In the "Selector" property, enter a CSS selector that matches the links Grooper should follow.
    • a is the most commonly used selector.
  15. Configure the remaining properties as needed. Of note are the following properties:
    • Recursive: If you want Grooper to continue opening links on every subsequent page it opens, enable the "Recursive" property. Grooper will continue opening pages until no more matching links are found.
      • Be aware, this can cause Grooper to import large numbers of web pages. Use with caution.
    • Included URL Pattern: Enabling this allows you to further restrict the webpages Grooper imports. It allows users to enter a regular expression pattern. The page will only be imported if the regex matches the URL.
      • Example: We want to import the pages on ballotpedia.org for each article in the Oklahoma Constitution. The URLs for the pages we want to import all end in "Oklahoma_Constitution". We could enable the Included URL Pattern and use the following pattern: Oklahoma_Constitution$ This would further ensure we only imported the pages we wanted to import.
    • Excluded URL Pattern: Enabling this allows you to further restrict the webpages Grooper imports. It allows users to enter a regular expression pattern. If the URL matches this regex, it will be excluded from import.
  16. In the "Description" field, you must enter a brief description for the HTTP Resource. This is a required field.
  17. Press "OK" in the Sources editor when finished configuring the HTTP Resource.
  18. Configure the rest of the Provider settings as needed. These include the settings common to all Import Providers, including the "Batch Creation" settings where users select which Batch Process to use.
    • The "Wait Time" property controls how much time to wait before each page is imported. This property should not be necessary when importing a single web page.
  19. Press the "Submit" button to submit the Import Job.
  20. Assuming you have an Import Watcher service running, one webpage for each link opened will be imported accordingly.
    • HTML documents are just HTML text. That text gets rendered on screen in a web browser. This includes text that links to images and CSS style sheets. If the webpage uses relative links to link to these resources (most do), it will look "weird" in the Grooper Document Viewer. The "Condition HTML" command can help here. See the section on using Condition HTML below for more information.

Batch Process considerations

When importing HTML documents into Grooper, there are several commands you may want to execute in a Batch Process to better prepare them for your end goal. These include:

  • HTTP Link - Load Content - Use this command when performing a sparse HTTP Import to load each HTML document in parallel.
    • Sparse imports are enabled by turning the "Sparse Import" property to "True".
    • Imports always run "single threaded" in Grooper. By executing this command in a Batch Process, the load operation (fully copying the document to the Grooper Repository) can be executing by multiple processing threads.
    • To do this, the first step in your Batch Process should be "Execute" with the "HTTP Link - Load Content" command added and configured.
  • HTML Document - Condition HTML - This command has two purposes: (1) It can make the HTML document more human-readable in the Document Viewer and (2) it can isolate and remove unwanted HTML elements. See more in the section below.

The Condition HTML command

The Condition HTML command has two purposes:

1. It can make the HTML document more human-readable in the Document Viewer.

Many (if not most) web pages make use of "relative links" to various resources. The link is relative to the website's domain. This often includes images and CSS style sheets. When Grooper imports an HTML documents, this relative path is effectively broken, making the web page "look weird". For the page to be more human readable, these relative links need to be replace with absolute links.

Condition HTML can prepend a domain URL to relative links, making them absolute links. This will resolve those broken links and the web page will look like the user expects it to in the Grooper Document Viewer.

  • The is done by configuring the "Site URL" property.
  • In "Site URL", enter a URL that will be prepended to each relative link in the HTML page.


Example: We imported this page in a previous example: https://en.wikipedia.org/wiki/Constitution_of_Oklahoma

Before "Condition HTML"

  • No CSS styling.
  • Image links are broken.


"Condition HTML" configuration:

  • Site URL = https://en.wikipedia.org
  • This prepends this URL to any relative links on the page.
  • For example, a relative link image path might be "/wiki/graphics/image.png".
    • Its full path would be "https://en.wikipedia.org/wiki/graphics/image/png".
    • The Site URL configuration converts "/wiki/graphics/image.png" to "https://en.wikipedia.org/wiki/graphics/image/png" in the HTML text.


After "Condition HTML"

  • CSS styling is present.
  • Image links are present. Images show up on screen




2. It can isolate and remove unwanted HTML elements.

One use case for HTTP Import is importing web content that can be added to an AI Assistant's knowledge resources. Depending on the website, there may be all kinds of "junk" you don't want to feed the AI Assistant: navigation boxes, ad banners, headers and footers, etc.

Condition HTML can use CSS selectors to select the elements you want to keep and remove ones you want to discarded.