elm.web.website_crawl.ELMWebsiteCrawlingStrategy

class ELMWebsiteCrawlingStrategy(max_depth: int, filter_chain: ~crawl4ai.deep_crawling.filters.FilterChain = <crawl4ai.deep_crawling.filters.FilterChain object>, url_scorer: ~crawl4ai.deep_crawling.scorers.URLScorer | None = None, include_external: bool = False, max_pages: int = inf, logger: ~logging.Logger | None = None)[source]

Bases: BestFirstCrawlingStrategy

Custom crawling strategy for ELM website searching

Methods

`arun`(start_url, crawler[, config])	Main entry point for best-first crawling.
`can_process_url`(url, depth)	Validate the URL format and apply filtering.
`found_enough_docs`(out_docs)	Check if enough documents have been found.
`link_discovery`(result, source_url, ...)	Extract links from the crawl result, validate them, and append new URLs (with their parent references) to next_links.
`shutdown`()	Signal cancellation and clean up resources.

Attributes

`BATCH_SIZE`	Number of URLs to process in each batch
`ONE_SCORE_AT_A_TIME`	Whether to batch process only links with the same score.

BATCH_SIZE = 10: Number of URLs to process in each batch

ONE_SCORE_AT_A_TIME = True

Whether to batch process only links with the same score.

This works best if the score is an integer value, since scores are compared directly using the == operator.

async classmethod found_enough_docs(out_docs)[source]

Check if enough documents have been found.

If ELMWebsiteCrawlingStrategy.ONE_SCORE_AT_A_TIME is True, this function returns True when 5 documents have been found, otherwise it returns True when 8 documents have been found.

Parameters:: out_docs (list) – List of documents found during the crawl.
Returns:: bool – Whether enough documents have been found to stop the crawl.

async link_discovery(result, source_url, current_depth, visited, next_links)[source]: Extract links from the crawl result, validate them, and append new URLs (with their parent references) to next_links. Also updates the depths dictionary.

Note

Overridden from original BestFirstCrawlingStrategy to implement return full link dictionary instead of just URLs.

__call__(start_url: str, crawler: AsyncWebCrawler, config: CrawlerRunConfig): Call self as a function.

async arun(start_url: str, crawler: AsyncWebCrawlerType, config: CrawlerRunConfigType | None = None) → RunManyReturn

Main entry point for best-first crawling.

Returns either a list (batch mode) or an async generator (stream mode) of CrawlResults.

async can_process_url(url: str, depth: int) → bool: Validate the URL format and apply filtering. For the starting URL (depth 0), filtering is bypassed.

async shutdown() → None: Signal cancellation and clean up resources.