HTTP Import (Import Provider): Difference between revisions

From Grooper Wiki
Line 48: Line 48:
# To import a single web page, enter that web page's address in the "URL" property.
# To import a single web page, enter that web page's address in the "URL" property.
#:*<li class="fyi-bullet"> For example: Enter example enter <code><nowiki>https://en.wikipedia.org/wiki/Constitution_of_Oklahoma</nowiki></code> to import the Wikipedia entry for the Oklahoma State Constitution.
#:*<li class="fyi-bullet"> For example: Enter example enter <code><nowiki>https://en.wikipedia.org/wiki/Constitution_of_Oklahoma</nowiki></code> to import the Wikipedia entry for the Oklahoma State Constitution.
# In the "Description" field, you must enter a brief description for the HTTP Resource. This is a required field. # Press the "OK" button when finished configuring the HTTP Resource.  
# In the "Description" field, you must enter a brief description for the HTTP Resource. This is a required field.  
# Press "OK" in the Sources editor when finished configuring the HTTP Resource.  
# Configure the rest of the Provider settings as needed. These include the settings common to all Import Providers, including the "Batch Creation" settings where users select which Batch Process to use.
# Configure the rest of the Provider settings as needed. These include the settings common to all Import Providers, including the "Batch Creation" settings where users select which Batch Process to use.
#:*<li class="fyi-bullet"> The "Wait Time" property controls how much time to wait before each page is imported. This property should not be necessary when importing a single web page.
#:*<li class="fyi-bullet"> The "Wait Time" property controls how much time to wait before each page is imported. This property should not be necessary when importing a single web page.
Line 57: Line 58:
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.7777777777777777; padding: 20px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cm9rfeplu00yox00iyort2br6?embed_v=2" loading="lazy" title="HTTP Import - Import a single web page" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.7777777777777777; padding: 20px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cm9rfeplu00yox00iyort2br6?embed_v=2" loading="lazy" title="HTTP Import - Import a single web page" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>


=== Import multiple site pages with HTTP Import ===
=== Import multiple site pages with HTTP Import ===
If you're using HTTP Import, it is more likely you want to import ''several'' web pages. Either a known list of pages in a website domain, or several pages linked throughout a website domain (possibly the entire domain).
If you're using HTTP Import, it is more likely you want to import ''several'' web pages. Either a known list of pages in a website domain, or several pages linked throughout a website domain (possibly the entire domain).

Revision as of 15:18, 16 June 2025

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

Under construction. Check back soon for more info!

About

HTTP Import is used to import web content into Grooper Batches. It can be used to import the following from an HTTP server:

  • Individual web pages (HTML documents)
  • Files hosted on web servers, including PDFs hosted on websites.
  • Entire websites

How does it work?

HTTP Import will bring in one or more web pages based on how the provider is configured. This configuration will determine how Grooper navigates pages on the website. One Grooper document is created for each distinct URL. Each web page is imported as a Batch Folder with an HTML file as its primary attachment. For URLs that resolve to files (such as PDFs), the file is imported as a Batch Folder and is its primary attachment.

How is it configured?

The HTTP Import configuration involves setting a "Source". This source can be:

  • A single web page.
  • Or, the root of a "web app"
    • Commonly, the root of a website is also the root of a web app. However, a single website may also host multiple web apps as subsites.

When a web app's root is defined, one or more relative URLs are added to the "Relative Page URLs" to specify which pages to include in the import. Furthermore, HTTP Import will traverse links on a web page to import linked pages when a "Link Selector" is configured.

How To

These how-to instructions will demonstrate the basics of HTTP content ingestion. We will show you how to use HTTP Import for the following scenarios:

  • How to import a single web page with HTTP Import.
  • How to import multiple site pages with HTTP Import in two ways:
    • Using "Relative Page URLs" only - This is useful if you have a known list of pages you want to import from a single domain.
    • Using "Link Selectors" - This is useful if you want to traverse links from a starting webpage (or webpages) and import each linked page.

We will import pages related to the Oklahoma state constitution to demonstrate each of these scenarios.

Import a single web page with HTTP Import

The easiest way to configure HTTP Import is to import a single page. This is not often very practical, but will show you the basics for how HTTP Import works.

  • This tutorial shows you how to use HTTP Import for user directed imports from the Imports page. However, just like any other import provider, you could perform scheduled imports with HTTP Import using an Import Watcher (if your use case demands it).

To import a single web page with HTTP Import do the following:

  1. Go to the Imports page.
  2. Press the "New Import Job" button to configure a new Import Job.
  3. This brings up the "Submit Import Job" editor. In the "Description" property, enter a brief description for the Import Job.
  4. Select the "Provider" property to choose the Import Provider. Using the dropdown list, choose "HTTP Import".
  5. Expand the "Provider" property.
  6. Open the "Sources" editor (press the "..." button) to add one or more HTTP Resources.
  7. Press the "Add" button to add a new HTTP Resource.
  8. To import a single web page, enter that web page's address in the "URL" property.
    • For example: Enter example enter https://en.wikipedia.org/wiki/Constitution_of_Oklahoma to import the Wikipedia entry for the Oklahoma State Constitution.
  9. In the "Description" field, you must enter a brief description for the HTTP Resource. This is a required field.
  10. Press "OK" in the Sources editor when finished configuring the HTTP Resource.
  11. Configure the rest of the Provider settings as needed. These include the settings common to all Import Providers, including the "Batch Creation" settings where users select which Batch Process to use.
    • The "Wait Time" property controls how much time to wait before each page is imported. This property should not be necessary when importing a single web page.
  12. Press the "Submit" button to submit the Import Job.
  13. Assuming you have an Import Watcher service running, the webpage will be imported accordingly.
    • HTML documents are just HTML text. That text gets rendered on screen in a web browser. This includes text that links to images and CSS style sheets. If the webpage uses relative links to link to these resources (most do), it will look "weird" in the Grooper Document Viewer. The "Condition HTML" command can help here. See the section on using Condition HTML below for more information.

Import multiple site pages with HTTP Import

If you're using HTTP Import, it is more likely you want to import several web pages. Either a known list of pages in a website domain, or several pages linked throughout a website domain (possibly the entire domain).

  • To import a known list of pages in a domain, you will need to configure the "Relative Page URLs" property, adding one Relative Page URL for each page you want to import.
  • To import pages linked throughout a domain, you will add one or more "Link Selectors" to determine which links to follow (and which links not to follow) when importing pages.

We will show you both scenarios using the same starting point: The Oklahoma State Constitution on ballotpedia.org.

Using Relative Page URLs

If you know exactly what pages you want to import, you can configure HTTP Import to import that list of pages by adding them to the list of "Relative Page URLs". A "relative page URL" is simply the path relative to the website domain.

For example: If https://www.example.com/about/contact.html is the full web address:
  • https://www.example.com is the domain.
  • about/contact.html is the page's relative path.

To import from a known list of pages, using "Relative Page URLs" do the following:

  1. Go to the Imports page.
  2. Press the "New Import Job" button to configure a new Import Job.
  3. This brings up the "Submit Import Job" editor. In the "Description" property, enter a brief description for the Import Job.
  4. Select the "Provider" property to choose the Import Provider. Using the dropdown list, choose "HTTP Import".
  5. Expand the "Provider" property.
  6. Open the "Sources" editor (press the "..." button) to add one or more HTTP Resources.
  7. Press the "Add" button to add a new HTTP Resource.
  8. In the "URL" property, enter the website's domain.
    • Example: We want to import the Oklahoma State Constitution hosted on webpages at ballotpedia.org. We would enter https://ballotpedia.org.
  9. Open the "Relative Page URLs" editor (Press the "..." button).
  10. Enter the relative paths for each page you wish to import (relative to the domain entered in the "URL").
    • Example: We want to import the Oklahoma State Constitution hosted on ballotpedia.org. We would enter the following list of relative URLs:
Oklahoma_Constitution
Preamble,_Oklahoma_Constitution
Article_I,_Oklahoma_Constitution
Article_II,_Oklahoma_Constitution
Article_III,_Oklahoma_Constitution
Article_IV,_Oklahoma_Constitution
Article_V,_Oklahoma_Constitution
Article_VI,_Oklahoma_Constitution
Article_VII,_Oklahoma_Constitution
Article_VIIA,_Oklahoma_Constitution
Article_VIIB,_Oklahoma_Constitution
Article_VIII,_Oklahoma_Constitution
Article_IX,_Oklahoma_Constitution
Article_X,_Oklahoma_Constitution
Article_XI,_Oklahoma_Constitution
Article_XII,_Oklahoma_Constitution
Article_XIIA,_Oklahoma_Constitution
Article_XIII,_Oklahoma_Constitution
Article_XIIIA,_Oklahoma_Constitution
Article_XIIIB,_Oklahoma_Constitution
Article_XIV,_Oklahoma_Constitution
Article_XV,_Oklahoma_Constitution
Article_XVI,_Oklahoma_Constitution
Article_XVII,_Oklahoma_Constitution
Article_XVIII,_Oklahoma_Constitution
Article_XIX,_Oklahoma_Constitution
Article_XX,_Oklahoma_Constitution
Article_XXI,_Oklahoma_Constitution
Article_XXII,_Oklahoma_Constitution
Article_XXIII,_Oklahoma_Constitution
Article_XXIV,_Oklahoma_Constitution
Article_XXV,_Oklahoma_Constitution
Article_XXV-A,_Oklahoma_Constitution
Article_XXVI,_Oklahoma_Constitution
Article_XXVIIIA,_Oklahoma_Constitution
Article_XXVIII,_Oklahoma_Constitution
Article_XXIX,_Oklahoma_Constitution
Article_XXX,_Oklahoma_Constitution
Schedule,_Oklahoma_Constitution
  1. Press "OK" in the Relative Page URLs editor when finished.
  2. In the "Description" field, you must enter a brief description for the HTTP Resource. This is a required field.
  3. Press "OK" in the "Sources" editor when finished configuring the HTTP Resource.
  4. Configure the rest of the Provider settings as needed. These include the settings common to all Import Providers, including the "Batch Creation" settings where users select which Batch Process to use.
    • The "Wait Time" property controls how much time to wait before each page is imported. This property should not be necessary when importing a single web page.
  5. Press the "Submit" button to submit the Import Job.
  6. Assuming you have an Import Watcher service running, the webpage will be imported accordingly.
    • HTML documents are just HTML text. That text gets rendered on screen in a web browser. This includes text that links to images and CSS style sheets. If the webpage uses relative links to link to these resources (most do), it will look "weird" in the Grooper Document Viewer. The "Condition HTML" command can help here. See the section on using Condition HTML below for more information.

Batch Process considerations

Sparse import and parallel loading

Conditioning HTML documents