elm.web.file_loader.AsyncFileLoader

class AsyncFileLoader(header_template=None, verify_ssl=True, aget_kwargs=None, pw_launch_kwargs=None, pdf_read_kwargs=None, html_read_kwargs=None, pdf_read_coroutine=None, html_read_coroutine=None, pdf_ocr_read_coroutine=None, file_cache_coroutine=None, browser_semaphore=None)[source]

Bases: object

Async web file (PDF or HTML) loader

Purpose:

Save content from links as files.

Responsibilities:
  1. Retrieve data from a URL.

  2. Determine wether information should be stored as a PDF or HTML document.

Key Relationships:

Returns either PDFDocument or HTMLDocument. Uses aiohttp to access the web.

Parameters:
  • header_template (dict, optional) – Optional GET header template. If not specified, uses the DEFAULT_HEADER_TEMPLATE defined for this class. By default, None.

  • verify_ssl (bool, optional) – Option to use aiohttp’s default SSL check. If False, SSL certificate validation is skipped. By default, True.

  • aget_kwargs (dict, optional) – Other kwargs to pass to aiohttp.ClientSession.get(). By default, None.

  • pw_launch_kwargs (dict, optional) – Keyword-value argument pairs to pass to async_playwright.chromium.launch() (only used when reading HTML). By default, None.

  • pdf_read_kwargs (dict, optional) – Keyword-value argument pairs to pass to the pdf_read_coroutine. By default, None.

  • html_read_kwargs (dict, optional) – Keyword-value argument pairs to pass to the html_read_coroutine. By default, None.

  • pdf_read_coroutine (callable, optional) – PDF file read coroutine. Must by an async function. Should accept PDF bytes as the first argument and kwargs as the rest. Must return a elm.web.document.PDFDocument. If None, a default function that runs in the main thread is used. By default, None.

  • html_read_coroutine (callable, optional) – HTML file read coroutine. Must by an async function. Should accept HTML text as the first argument and kwargs as the rest. Must return a elm.web.document.HTMLDocument. If None, a default function that runs in the main thread is used. By default, None.

  • pdf_ocr_read_coroutine (callable, optional) – PDF OCR file read coroutine. Must by an async function. Should accept PDF bytes as the first argument and kwargs as the rest. Must return a elm.web.document.PDFDocument. If None, PDF OCR parsing is not attempted, and any scanned PDF URL’s will return a blank document. By default, None.

  • file_cache_coroutine (callable, optional) – File caching coroutine. Can be used to cache files downloaded by this class. Must accept an Document instance as the first argument and the file content to be written as the second argument. If this method is not provided, no document caching is performed. By default, None.

  • browser_semaphore (asyncio.Semaphore, optional) – Semaphore instance that can be used to limit the number of playwright browsers open concurrently. If None, no limits are applied. By default, None.

Methods

fetch(url)

Fetch a document for the given URL.

fetch_all(*urls)

Fetch documents for all requested URL's.

Attributes

DEFAULT_HEADER_TEMPLATE

Default header

DEFAULT_HEADER_TEMPLATE = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language': 'en-US,en;q=0.5', 'Connection': 'keep-alive', 'DNT': '1', 'Referer': 'https://www.google.com/', 'Upgrade-Insecure-Requests': '1', 'User-Agent': ''}

Default header

async fetch_all(*urls)[source]

Fetch documents for all requested URL’s.

Parameters:

*urls – Iterable of URL’s (as strings) to fetch.

Returns:

list – List of documents, one per requested URL.

async fetch(url)[source]

Fetch a document for the given URL.

Parameters:

url (str) – URL for the document to pull down.

Returns:

elm.web.document.Document – Document instance containing text, if the fetch was successful.