elm.web.website_crawl.ELMWebsiteCrawler

class ELMWebsiteCrawler(validator, file_loader_kwargs=None, browser_config_kwargs=None, crawl_strategy_kwargs=None, crawler_config_kwargs=None, cte_kwargs=None, extra_url_filters=None, include_external=False, url_scorer=None, max_pages=100, page_limit=None)[source]

Bases: object

Crawl a website for documents of interest

Parameters:
  • validator (callable) – An async callable that takes a document instance (containing the text from a PDF or a webpage) and returns a boolean indicating whether the text passes the validation check. This is used to determine whether or not to keep (i.e. return) the document.

  • file_loader_kwargs (dict, optional) – Additional keyword-value argument pairs to pass to the AsyncFileLoader class. By default, None.

  • browser_config_kwargs (dict, optional) – Additional keyword-value argument pairs to pass to the crawl4ai.async_configs.BrowserConfig class. By default, None.

  • crawl_strategy_kwargs (dict, optional) – Additional keyword-value argument pairs to pass to the ELMWebsiteCrawlingStrategy class. By default, None.

  • crawler_config_kwargs (dict, optional) – Additional keyword-value argument pairs to pass to the crawl4ai.async_configs.CrawlerRunConfig class. By default, None.

  • cte_kwargs (dict, optional) – Additional keyword-value argument pairs to pass to the ContentTypeExcludeFilter class. This filter is used to exclude URLs based on their content type. By default, None.

  • extra_url_filters (list, optional) – Additional URL filters to apply during crawling. Each filter must have a (non-async) apply method that takes a URL and returns a boolean indicating whether the URL should be included in the crawl. By default, None.

  • include_external (bool, optional) – Whether to include external links in the crawl. By default, False.

  • url_scorer (callable, optional) – An async callable that takes a list of dictionaries containing URL information and assigns each dictionary a score key representing the score for that URL. The input URL dictionaries will each have at least one key: “href”. This key will contain the URL of the link. The dictionary may also have other attributes such as “text”, which contains the link title text. If None, uses the ELMLinkScorer.score() method to score the URLs. By default, None.

  • max_pages (int, optional) – Maximum number of successful pages to crawl. By default, 100.

  • page_limit (int, optional) – Maximum number of pages to crawl regardless of success status. If None, a page limit of 2 * max_pages is used. To set no limit (not recommended), use math.inf. By default, None.

Methods

run(base_url[, termination_callback, ...])

Crawl a website for documents of interest

async run(base_url, termination_callback=None, on_result_hook=None, return_c4ai_results=False)[source]

Crawl a website for documents of interest

Parameters:
  • base_url (str) – The base URL to start crawling from.

  • termination_callback (callable, optional) – An async callable that takes a list of documents and returns a boolean indicating whether to stop crawling. If None, the ELMWebsiteCrawlingStrategy.found_enough_docs() is used, which simply terminates when roughly a handful of documents have been found. By default, None.

  • on_result_hook (callable, optional) – An async callable that is called every time a result is found during the crawl. This can be used to perform additional processing on each result or to monitor the crawl progress. The callable should accept a single argument, which is the crawl result object. If None, no additional processing is done on the results. By default, None.

  • return_c4ai_results (bool, optional) – Whether to return the raw crawl4ai results along with the documents. If True, returns a tuple of (documents, crawl4ai_results). If False, returns only the documents. By default, False.

Returns:

  • out_docs – List of document instances that passed the validation check. Each document contains the text from a PDF or a webpage, and has an attribute source that contains the URL of the document.

  • results (list, optional) – List of crawl4ai results containing metadata about the crawled pages. This is only returned if return_c4ai_results is True.