elm.web.website_crawl.ELMWebsiteCrawlingStrategy
- class ELMWebsiteCrawlingStrategy(max_depth: int, filter_chain: ~crawl4ai.deep_crawling.filters.FilterChain = <crawl4ai.deep_crawling.filters.FilterChain object>, url_scorer: ~crawl4ai.deep_crawling.scorers.URLScorer | None = None, include_external: bool = False, max_pages: int = inf, logger: ~logging.Logger | None = None)[source]
Bases:
BestFirstCrawlingStrategy
Custom crawling strategy for ELM website searching
Methods
arun
(start_url, crawler[, config])Main entry point for best-first crawling.
can_process_url
(url, depth)Validate the URL format and apply filtering.
found_enough_docs
(out_docs)Check if enough documents have been found.
link_discovery
(result, source_url, ...)Extract links from the crawl result, validate them, and append new URLs (with their parent references) to next_links.
shutdown
()Signal cancellation and clean up resources.
Attributes
Number of URLs to process in each batch
Whether to batch process only links with the same score.
- BATCH_SIZE = 10
Number of URLs to process in each batch
- ONE_SCORE_AT_A_TIME = True
Whether to batch process only links with the same score.
This works best if the score is an integer value, since scores are compared directly using the == operator.
- async classmethod found_enough_docs(out_docs)[source]
Check if enough documents have been found.
If
ELMWebsiteCrawlingStrategy.ONE_SCORE_AT_A_TIME
isTrue
, this function returnsTrue
when 5 documents have been found, otherwise it returnsTrue
when 8 documents have been found.- Parameters:
out_docs (list) – List of documents found during the crawl.
- Returns:
bool – Whether enough documents have been found to stop the crawl.
- async link_discovery(result, source_url, current_depth, visited, next_links)[source]
Extract links from the crawl result, validate them, and append new URLs (with their parent references) to next_links. Also updates the depths dictionary.
Note
Overridden from original BestFirstCrawlingStrategy to implement return full link dictionary instead of just URLs.
- __call__(start_url: str, crawler: AsyncWebCrawler, config: CrawlerRunConfig)
Call self as a function.
- async arun(start_url: str, crawler: AsyncWebCrawlerType, config: CrawlerRunConfigType | None = None) RunManyReturn
Main entry point for best-first crawling.
Returns either a list (batch mode) or an async generator (stream mode) of CrawlResults.