compass.scripts.download.download_jurisdiction_ordinances_from_website#
- async download_jurisdiction_ordinances_from_website(website, heuristic, keyword_points, file_loader_kwargs=None, browser_config_kwargs=None, crawler_config_kwargs=None, max_urls=100, crawl_semaphore=None, pb_jurisdiction_name=None, return_c4ai_results=False)[source]#
Download ordinance documents from a jurisdiction website
- Parameters:
website (
str
) – URL of the jurisdiction website to search.keyword_points (
dict
) – Dictionary of keyword points to use for scoring links. Keys are keywords, values are points to assign to links containing the keyword. If a link contains multiple keywords, the points are summed up.file_loader_kwargs (
dict
, optional) – Dictionary of keyword arguments pairs to initializeelm.web.file_loader.AsyncFileLoader
. If found, the “pw_launch_kwargs” key in these will also be used to initialize theelm.web.search.google.PlaywrightGoogleLinkSearch
used for the Google URL search. By default,None
.browser_config_kwargs (
dict
, optional) – Dictionary of keyword arguments pairs to initialize thecrawl4ai.async_configs.BrowserConfig
class used for the web crawl. By default,None
.crawler_config_kwargs (
dict
, optional) – Dictionary of keyword arguments pairs to initialize thecrawl4ai.async_configs.CrawlerConfig
class used for the web crawl. By default,None
.max_urls (
int
, optional) – Max number of URLs to check from the website before terminating the search. By default,100
.crawl_semaphore (
asyncio.Semaphore
, optional) – Semaphore instance that can be used to limit the number of website searches happening concurrently. IfNone
, no limits are applied. By default,None
.pb_jurisdiction_name (
str
, optional) – Optional jurisdiction name to use to update progress bar, if it’s being used. By default,None
.return_c4ai_results (
bool
, defaultFalse
) – IfTrue
, the crawl4ai results will be returned as a second return value. This is useful for debugging and examining the crawled URLs. IfFalse
, only the documents will be returned. By default,False
.
- Returns:
out_docs (
list
) – List ofBaseDocument
instances containing potential ordinance information, or an empty list if no ordinance document was found.results (
list
, optional) – List of crawl4ai results containing metadata about the crawled pages. This is only returned if return_c4ai_results isTrue
.
Notes
Requires
TempFileCache
service to be running.