compass.web.website_crawl.COMPASSCrawler#

class COMPASSCrawler(validator, url_scorer, file_loader_kwargs=None, already_visited=None, num_link_scores_to_check_per_page=4, max_pages=100, browser_semaphore=None)[source]#

Bases: object

A simple website crawler to search for ordinance documents

Parameters:
  • validator (callable()) – An async callable that takes a document instance (containing the text from a PDF or a webpage) and returns a boolean indicating whether the text passes the validation check. This is used to determine whether or not to keep (i.e. return) the document.

  • url_scorer (callable()) – An async callable that takes a list of dictionaries containing URL information and assigns each dictionary a score key representing the score for that URL. The input URL dictionaries will each have at least one key: “href”. This key will contain the URL of the link. The dictionary may also have other attributes such as “title”, which contains the link title text.

  • file_loader_kwargs (dict, optional) – Additional keyword-value argument pairs to pass to the AsyncFileLoader class. If this dictionary contains the pw_launch_kwargs key, it’s value (assumes to be another dictionary) will be used to initialize the playwright instances used for the crawl. By default, None.

  • already_visited (set, optional) – A set of URLs (either strings or :class:Link objects) that have already been visited. This is used to avoid re-visiting links that have already been checked. By default, None.

  • num_link_scores_to_check_per_page (int, default 3) – Number of top unique-scoring links per page to use for recursive crawling. This helps the crawl stay focused on the most likely links to contain documents of interest.

  • max_pages (int, default 100) – Maximum number of pages to crawl before terminating, regardless of whether the document was found or not. By default, 100.

  • browser_semaphore (asyncio.Semaphore, optional) – Semaphore instance that can be used to limit the number of playwright browsers open concurrently. If None, no limits are applied. By default, None.

Methods

run(base_url[, termination_callback, ...])

Run the COMPASS website crawler

async run(base_url, termination_callback=None, on_new_page_visit_hook=None)[source]#

Run the COMPASS website crawler

Parameters:
  • base_url (str) – URL of the website to start crawling from.

  • termination_callback (callable(), optional) – An async callable that takes a list of documents and returns a boolean indicating whether to stop crawling. If None, the crawl will simply terminates when DOC_THRESHOLD number of documents have been found. By default, None.

  • on_new_page_visit_hook (callable(), optional) – An async callable that is called every time a new page is found during the crawl. The callable should accept a single argument, which is the page Link instance. If None, no additional processing is done on new pages. By default, None.

Returns:

list – List of document instances that passed the validation check. Each document contains the text from a PDF and has an attribute source that contains the URL of the document. This list may be empty if no documents are found.