compass.scripts.process.process_counties_with_openai#
- async process_counties_with_openai(out_dir, tech, jurisdiction_fp, model='gpt-4o', num_urls_to_check_per_jurisdiction=5, max_num_concurrent_browsers=10, max_num_concurrent_jurisdictions=None, url_ignore_substrings=None, file_loader_kwargs=None, pytesseract_exe_fp=None, td_kwargs=None, tpe_kwargs=None, ppe_kwargs=None, log_dir=None, clean_dir=None, ordinance_file_dir=None, jurisdiction_dbs_dir=None, llm_costs=None, log_level='INFO')[source]#
Download and extract ordinances for a list of counties
This function scrapes ordinance documents (PDFs or HTML text) for a list of specified counties and processes them using one or more LLM models. Output files, logs, and intermediate artifacts are stored in configurable directories.
- Parameters:
out_dir (path-like) – Path to the output directory. If it does not exist, it will be created. This directory will contain the structured ordinance CSV file, all downloaded ordinance documents (PDFs and HTML), usage metadata, and default subdirectories for logs and intermediate outputs (unless otherwise specified).
tech (
{"wind", "solar"}
) – Label indicating which technology type is being processed.jurisdiction_fp (path-like) – Path to a CSV file specifying the jurisdictions to process. The CSV must contain two columns: “County” and “State”, which specify the county and state names, respectively.
model (
str
orlist
ofdict
, optional) – LLM model(s) to use for scraping and parsing ordinance documents. If a string is provided, it is assumed to be the name of the default model (e.g., “gpt-4o”), and environment variables are used for authentication.If a list is provided, it should contain dictionaries of arguments that can initialize instances of
OpenAIConfig
. Each dictionary can specify the model name, client type, and initialization arguments.Each dictionary must also include a
tasks
key, which maps to a string or list of strings indicating the tasks that instance should handle. Exactly one of the instances must include “default” as a task, which will be used when no specific task is matched. For example:"model": [ { "model": "gpt-4o-mini", "llm_call_kwargs": { "temperature": 0, "timeout": 300, }, "client_kwargs": { "api_key": "<your_api_key>", "api_version": "<your_api_version>", "azure_endpoint": "<your_azure_endpoint>", }, "tasks": ["default", "date_extraction"], }, { "model": "gpt-4o", "client_type": "openai", "tasks": ["ordinance_text_extraction"], } ]
By default,
"gpt-4o"
.num_urls_to_check_per_jurisdiction (
int
, optional) – Number of unique Google search result URLs to check for each jurisdiction when attempting to locate ordinance documents. By default,5
.max_num_concurrent_browsers (
int
, optional) – Maximum number of browser instances to launch concurrently for performing Google searches. Increasing this value can speed up searches, but may lead to timeouts or performance issues on machines with limited resources. By default,10
.max_num_concurrent_jurisdictions (
int
, optional) – Maximum number of jurisdictions to process in parallel. Limiting this can help manage memory usage when dealing with a large number of documents. By defaultNone
(no limit).url_ignore_substrings (
list
ofstr
, optional) – A list of substrings that, if found in any URL, will cause the URL to be excluded from consideration. This can be used to specify particular websites or entire domains to ignore. For example:- url_ignore_substrings = [
“wikipedia”, “nrel.gov”, “www.co.delaware.in.us/egov/documents/1649699794_0382.pdf”,
]
The above configuration would ignore all wikipedia articles, all websites on the NREL domain, and the specific file located at www.co.delaware.in.us/egov/documents/1649699794_0382.pdf. By default,
None
.file_loader_kwargs (
dict
, optional) – Dictionary of keyword arguments pairs to initializeelm.web.file_loader.AsyncFileLoader
. If found, the “pw_launch_kwargs” key in these will also be used to initialize theelm.web.search.google.PlaywrightGoogleLinkSearch
used for the google URL search. By default,None
.pytesseract_exe_fp (path-like, optional) – Path to the pytesseract executable. If specified, OCR will be used to extract text from scanned PDFs using Google’s Tesseract. By default
None
.td_kwargs (
dict
, optional) – Additional keyword arguments to pass totempfile.TemporaryDirectory
. The temporary directory is used to store documents which have not yet been confirmed to contain relevant information. By default,None
.tpe_kwargs (
dict
, optional) – Additional keyword arguments to pass toconcurrent.futures.ThreadPoolExecutor
, used for I/O-bound tasks such as logging. By default,None
.ppe_kwargs (
dict
, optional) – Additional keyword arguments to pass toconcurrent.futures.ProcessPoolExecutor
, used for CPU-bound tasks such as PDF loading and parsing. By default,None
.log_dir (path-like, optional) – Path to the directory for storing log files. If not provided, a
logs
subdirectory will be created inside out_dir. By default,None
.clean_dir (path-like, optional) – Path to the directory for storing cleaned ordinance text output. If not provided, a
cleaned_text
subdirectory will be created inside out_dir. By default,None
.ordinance_file_dir (path-like, optional) – Path to the directory where downloaded ordinance files (PDFs or HTML) for each jurisdiction are stored. If not provided, a
ordinance_files
subdirectory will be created inside out_dir. By default,None
.jurisdiction_dbs_dir (path-like, optional) – Path to the directory where parsed ordinance database files are stored for each jurisdiction. If not provided, a
jurisdiction_dbs
subdirectory will be created inside out_dir. By default,None
.llm_costs (
dict
, optional) – Dictionary mapping model names to their token costs, used to track the estimated total cost of LLM usage during the run. The structure should be:{"model_name": {"prompt": float, "response": float}}
Costs are specified in dollars per million tokens. For example:
"llm_costs": {"my_gpt": {"prompt": 1.5, "response": 3.7}}
registers a model named “my_gpt” with a cost of $1.5 per million input (prompt) tokens and $3.7 per million output (response) tokens for the current processing run.
Note
The displayed total cost does not track cached tokens, so treat it like an estimate. Your final API costs may vary.
If set to
None
, no custom model costs are recorded, and cost tracking may be unavailable in the progress bar. By default,None
.log_level (
str
, optional) – Logging level for ordinance scraping and parsing (e.g., “TRACE”, “DEBUG”, “INFO”, “WARNING”, or “ERROR”). By default,"INFO"
.
- Returns: