elm.web.website_crawl.ELMWebsiteCrawlingStrategy

class ELMWebsiteCrawlingStrategy(max_depth: int, filter_chain: ~crawl4ai.deep_crawling.filters.FilterChain = <crawl4ai.deep_crawling.filters.FilterChain object>, url_scorer: ~crawl4ai.deep_crawling.scorers.URLScorer | None = None, include_external: bool = False, max_pages: int = inf, logger: ~logging.Logger | None = None)[source]

Bases: BestFirstCrawlingStrategy

Custom crawling strategy for ELM website searching

Methods

arun(start_url, crawler[, config])

Main entry point for best-first crawling.

can_process_url(url, depth)

Validate the URL format and apply filtering.

found_enough_docs(out_docs)

Check if enough documents have been found.

link_discovery(result, source_url, ...)

Extract links from the crawl result, validate them, and append new URLs (with their parent references) to next_links.

shutdown()

Signal cancellation and clean up resources.

Attributes

BATCH_SIZE

Number of URLs to process in each batch

ONE_SCORE_AT_A_TIME

Whether to batch process only links with the same score.

BATCH_SIZE = 10

Number of URLs to process in each batch

ONE_SCORE_AT_A_TIME = True

Whether to batch process only links with the same score.

This works best if the score is an integer value, since scores are compared directly using the == operator.

async classmethod found_enough_docs(out_docs)[source]

Check if enough documents have been found.

If ELMWebsiteCrawlingStrategy.ONE_SCORE_AT_A_TIME is True, this function returns True when 5 documents have been found, otherwise it returns True when 8 documents have been found.

Parameters:

out_docs (list) – List of documents found during the crawl.

Returns:

bool – Whether enough documents have been found to stop the crawl.

Extract links from the crawl result, validate them, and append new URLs (with their parent references) to next_links. Also updates the depths dictionary.

Note

Overridden from original BestFirstCrawlingStrategy to implement return full link dictionary instead of just URLs.

__call__(start_url: str, crawler: AsyncWebCrawler, config: CrawlerRunConfig)

Call self as a function.

async arun(start_url: str, crawler: AsyncWebCrawlerType, config: CrawlerRunConfigType | None = None) RunManyReturn

Main entry point for best-first crawling.

Returns either a list (batch mode) or an async generator (stream mode) of CrawlResults.

async can_process_url(url: str, depth: int) bool

Validate the URL format and apply filtering. For the starting URL (depth 0), filtering is bypassed.

async shutdown() None

Signal cancellation and clean up resources.