compass.web.website_crawl.COMPASSCrawler#
- class COMPASSCrawler(validator, url_scorer, file_loader_kwargs=None, already_visited=None, num_link_scores_to_check_per_page=4, max_pages=100, browser_semaphore=None)[source]#
Bases:
object
A simple website crawler to search for ordinance documents
- Parameters:
validator (
callable()
) – An async callable that takes a document instance (containing the text from a PDF or a webpage) and returns a boolean indicating whether the text passes the validation check. This is used to determine whether or not to keep (i.e. return) the document.url_scorer (
callable()
) – An async callable that takes a list of dictionaries containing URL information and assigns each dictionary a score key representing the score for that URL. The input URL dictionaries will each have at least one key: “href”. This key will contain the URL of the link. The dictionary may also have other attributes such as “title”, which contains the link title text.file_loader_kwargs (
dict
, optional) – Additional keyword-value argument pairs to pass to theAsyncFileLoader
class. If this dictionary contains thepw_launch_kwargs
key, it’s value (assumes to be another dictionary) will be used to initialize the playwright instances used for the crawl. By default,None
.already_visited (
set
, optional) – A set of URLs (either strings or :class:Link
objects) that have already been visited. This is used to avoid re-visiting links that have already been checked. By default,None
.num_link_scores_to_check_per_page (
int
, default3
) – Number of top unique-scoring links per page to use for recursive crawling. This helps the crawl stay focused on the most likely links to contain documents of interest.max_pages (
int
, default100
) – Maximum number of pages to crawl before terminating, regardless of whether the document was found or not. By default,100
.browser_semaphore (
asyncio.Semaphore
, optional) – Semaphore instance that can be used to limit the number of playwright browsers open concurrently. IfNone
, no limits are applied. By default,None
.
Methods
run
(base_url[, termination_callback, ...])Run the COMPASS website crawler
- async run(base_url, termination_callback=None, on_new_page_visit_hook=None)[source]#
Run the COMPASS website crawler
- Parameters:
base_url (
str
) – URL of the website to start crawling from.termination_callback (
callable()
, optional) – An async callable that takes a list of documents and returns a boolean indicating whether to stop crawling. IfNone
, the crawl will simply terminates whenDOC_THRESHOLD
number of documents have been found. By default,None
.on_new_page_visit_hook (
callable()
, optional) – An async callable that is called every time a new page is found during the crawl. The callable should accept a single argument, which is the pageLink
instance. IfNone
, no additional processing is done on new pages. By default,None
.
- Returns:
list
– List of document instances that passed the validation check. Each document contains the text from a PDF and has an attribute source that contains the URL of the document. This list may be empty if no documents are found.