elm.web.website_crawl.ELMWebsiteCrawler
- class ELMWebsiteCrawler(validator, file_loader_kwargs=None, browser_config_kwargs=None, crawl_strategy_kwargs=None, crawler_config_kwargs=None, cte_kwargs=None, extra_url_filters=None, include_external=False, url_scorer=None, max_pages=100, page_limit=None)[source]
Bases:
object
Crawl a website for documents of interest
- Parameters:
validator (callable) – An async callable that takes a document instance (containing the text from a PDF or a webpage) and returns a boolean indicating whether the text passes the validation check. This is used to determine whether or not to keep (i.e. return) the document.
file_loader_kwargs (dict, optional) – Additional keyword-value argument pairs to pass to the
AsyncFileLoader
class. By default,None
.browser_config_kwargs (dict, optional) – Additional keyword-value argument pairs to pass to the
crawl4ai.async_configs.BrowserConfig
class. By default,None
.crawl_strategy_kwargs (dict, optional) – Additional keyword-value argument pairs to pass to the
ELMWebsiteCrawlingStrategy
class. By default,None
.crawler_config_kwargs (dict, optional) – Additional keyword-value argument pairs to pass to the
crawl4ai.async_configs.CrawlerRunConfig
class. By default,None
.cte_kwargs (dict, optional) – Additional keyword-value argument pairs to pass to the
ContentTypeExcludeFilter
class. This filter is used to exclude URLs based on their content type. By default,None
.extra_url_filters (list, optional) – Additional URL filters to apply during crawling. Each filter must have a (non-async)
apply
method that takes a URL and returns a boolean indicating whether the URL should be included in the crawl. By default,None
.include_external (bool, optional) – Whether to include external links in the crawl. By default,
False
.url_scorer (callable, optional) – An async callable that takes a list of dictionaries containing URL information and assigns each dictionary a score key representing the score for that URL. The input URL dictionaries will each have at least one key: “href”. This key will contain the URL of the link. The dictionary may also have other attributes such as “text”, which contains the link title text. If
None
, uses theELMLinkScorer.score()
method to score the URLs. By default,None
.max_pages (int, optional) – Maximum number of successful pages to crawl. By default,
100
.page_limit (int, optional) – Maximum number of pages to crawl regardless of success status. If
None
, a page limit of 2 * max_pages is used. To set no limit (not recommended), usemath.inf
. By default,None
.
Methods
run
(base_url[, termination_callback, ...])Crawl a website for documents of interest
- async run(base_url, termination_callback=None, on_result_hook=None, return_c4ai_results=False)[source]
Crawl a website for documents of interest
- Parameters:
base_url (str) – The base URL to start crawling from.
termination_callback (callable, optional) – An async callable that takes a list of documents and returns a boolean indicating whether to stop crawling. If
None
, theELMWebsiteCrawlingStrategy.found_enough_docs()
is used, which simply terminates when roughly a handful of documents have been found. By default,None
.on_result_hook (callable, optional) – An async callable that is called every time a result is found during the crawl. This can be used to perform additional processing on each result or to monitor the crawl progress. The callable should accept a single argument, which is the crawl result object. If
None
, no additional processing is done on the results. By default,None
.return_c4ai_results (bool, optional) – Whether to return the raw crawl4ai results along with the documents. If
True
, returns a tuple of (documents, crawl4ai_results). IfFalse
, returns only the documents. By default,False
.
- Returns:
out_docs – List of document instances that passed the validation check. Each document contains the text from a PDF or a webpage, and has an attribute source that contains the URL of the document.
results (list, optional) – List of crawl4ai results containing metadata about the crawled pages. This is only returned if return_c4ai_results is
True
.