HTTP Import (Import Provider): Difference between revisions

From Grooper Wiki
No edit summary
Line 37: Line 37:
:*<li class="fyi-bullet"> This tutorial shows you how to use HTTP Import for user directed imports from the Imports page. However, just like any other import provider, you could perform scheduled imports with HTTP Import using an [[Import Watcher]] (if your use case demands it).
:*<li class="fyi-bullet"> This tutorial shows you how to use HTTP Import for user directed imports from the Imports page. However, just like any other import provider, you could perform scheduled imports with HTTP Import using an [[Import Watcher]] (if your use case demands it).


This example will import the Wikipedia entry for the Oklahoma State Constitution.
To import a single web page with HTTP Import do the following:


# Go to the Imports page.
# Go to the Imports page.
Line 47: Line 47:
# Press the "Add" button to add a new HTTP Resource.
# Press the "Add" button to add a new HTTP Resource.
# To import a single web page, enter that web page's address in the "URL" property.
# To import a single web page, enter that web page's address in the "URL" property.
#:*<li class="fyi-bullet"> For example: Enter example enter <code><nowiki>https://en.wikipedia.org/wiki/Constitution_of_Oklahoma</nowiki></code> to import the Wikipedia entry for the Oklahoma State Constitution.
# In the "Description" field, you must enter a brief description for the HTTP Resource. This is a required field. # Press the "OK" button when finished configuring the HTTP Resource.  
# In the "Description" field, you must enter a brief description for the HTTP Resource. This is a required field. # Press the "OK" button when finished configuring the HTTP Resource.  
# Configure the rest of the Provider settings as needed. These include the settings common to all Import Providers, including the "Batch Creation" settings where users select which Batch Process to use.
# Configure the rest of the Provider settings as needed. These include the settings common to all Import Providers, including the "Batch Creation" settings where users select which Batch Process to use.

Revision as of 14:48, 16 June 2025

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

Under construction. Check back soon for more info!

About

HTTP Import is used to import web content into Grooper Batches. It can be used to import the following from an HTTP server:

  • Individual web pages (HTML documents)
  • Files hosted on web servers, including PDFs hosted on websites.
  • Entire websites

How does it work?

HTTP Import will bring in one or more web pages based on how the provider is configured. This configuration will determine how Grooper navigates pages on the website. One Grooper document is created for each distinct URL. Each web page is imported as a Batch Folder with an HTML file as its primary attachment. For URLs that resolve to files (such as PDFs), the file is imported as a Batch Folder and is its primary attachment.

How is it configured?

The HTTP Import configuration involves setting a "Source". This source can be:

  • A single web page.
  • Or, the root of a "web app"
    • Commonly, the root of a website is also the root of a web app. However, a single website may also host multiple web apps as subsites.

When a web app's root is defined, one or more relative URLs are added to the "Relative Page URLs" to specify which pages to include in the import. Furthermore, HTTP Import will traverse links on a web page to import linked pages when a "Link Selector" is configured.

How To

These how-to instructions will demonstrate the basics of HTTP content ingestion. We will show you how to use HTTP Import for the following scenarios:

  • How to import a single web page with HTTP Import.
  • How to import multiple site pages with HTTP Import in two ways:
    • Using "Relative Page URLs" only - This is useful if you have a known list of pages you want to import from a single domain.
    • Using "Link Selectors" - This is useful if you want to traverse links from a starting webpage (or webpages) and import each linked page.

We will import pages related to the Oklahoma state constitution to demonstrate each of these scenarios.

Import a single web page with HTTP Import

The easiest way to configure HTTP Import is to import a single page. This is not often very practical, but will show you the basics for how HTTP Import works.

  • This tutorial shows you how to use HTTP Import for user directed imports from the Imports page. However, just like any other import provider, you could perform scheduled imports with HTTP Import using an Import Watcher (if your use case demands it).

To import a single web page with HTTP Import do the following:

  1. Go to the Imports page.
  2. Press the "New Import Job" button to configure a new Import Job.
  3. This brings up the "Submit Import Job" editor. In the "Description" property, enter a brief description for the Import Job.
  4. Select the "Provider" property to choose the Import Provider. Using the dropdown list, choose "HTTP Import".
  5. Expand the "Provider" property.
  6. Open the "Sources" editor (press the "..." button) to add one or more HTTP Resources.
  7. Press the "Add" button to add a new HTTP Resource.
  8. To import a single web page, enter that web page's address in the "URL" property.
    • For example: Enter example enter https://en.wikipedia.org/wiki/Constitution_of_Oklahoma to import the Wikipedia entry for the Oklahoma State Constitution.
  9. In the "Description" field, you must enter a brief description for the HTTP Resource. This is a required field. # Press the "OK" button when finished configuring the HTTP Resource.
  10. Configure the rest of the Provider settings as needed. These include the settings common to all Import Providers, including the "Batch Creation" settings where users select which Batch Process to use.
    • The "Wait Time" property controls how much time to wait before each page is imported. This property should not be necessary when importing a single web page.
  11. Press the "Submit" button to submit the Import Job.
  12. Assuming you have an Import Watcher service running, the webpage will be imported accordingly.
    • HTML documents are just HTML text. That text gets rendered on screen in a web browser. This includes text that links to images and CSS style sheets. If the webpage uses relative links to link to these resources (most do), it will look "weird" in the Grooper Document Viewer. The "Condition HTML" command can help here. See the section on using Condition HTML below for more information.

Import multiple site pages with HTTP Import

Using Relative Page URLs

Using Link Selectors

Batch Process considerations

Sparse import and parallel loading

Conditioning HTML documents