compass.scripts.process.process_jurisdictions_with_openai#
- async process_jurisdictions_with_openai(out_dir, tech, jurisdiction_fp, model='gpt-4o-mini', num_urls_to_check_per_jurisdiction=5, max_num_concurrent_browsers=10, max_num_concurrent_website_searches=10, max_num_concurrent_jurisdictions=25, url_ignore_substrings=None, known_local_docs=None, known_doc_urls=None, file_loader_kwargs=None, search_engines=None, pytesseract_exe_fp=None, td_kwargs=None, tpe_kwargs=None, ppe_kwargs=None, log_dir=None, clean_dir=None, ordinance_file_dir=None, jurisdiction_dbs_dir=None, perform_se_search=True, perform_website_search=True, llm_costs=None, log_level='INFO', keep_async_logs=False)[source]#
Extract ordinances for one or more jurisdiction(s)
This function scrapes ordinance documents (PDFs or HTML text) for a given set of jurisdictions and processes them using one or more LLM models. Output files, logs, and intermediate artifacts are stored in configurable directories.
The processing has a well-defined order:
Process any/all known local documents
Process any/all known document URLs
Search engine-based search for ordinance documents
Jurisdiction website crawl-based search for ordinance documents
Users can disable any of these steps via inputs to this function. If any step returns a document with extractable ordinance information, subsequent steps are skipped for that jurisdiction.
- Parameters:
out_dir (path-like) – Path to the output directory. If it does not exist, it will be created. This directory will contain the structured ordinance CSV file, all downloaded ordinance documents (PDFs and HTML), usage metadata, and default subdirectories for logs and intermediate outputs (unless otherwise specified).
tech (
{"wind", "solar", "small wind"}) – Label indicating which technology type is being processed.jurisdiction_fp (path-like) – Path to a CSV file specifying the jurisdictions to process. The CSV must contain at least two columns: “County” and “State”, which specify the county and state names, respectively. If you would like to process a subdivision with a county, you must also include “Subdivision” and “Jurisdiction Type” columns. The “Subdivision” should be the name of the subdivision, and the “Jurisdiction Type” should be a string identifying the type of subdivision (e.g., “City”, “Township”, etc.)
model (
strorlistofdict, optional) –LLM model(s) to use for scraping and parsing ordinance documents. If a string is provided, it is assumed to be the name of the default model (e.g., “gpt-4o”), and environment variables are used for authentication.
If a list is provided, it should contain dictionaries of arguments that can initialize instances of
OpenAIConfig. Each dictionary can specify the model name, client type, and initialization arguments.Each dictionary must also include a
taskskey, which maps to a string or list of strings indicating the tasks that instance should handle. Exactly one of the instances must include “default” as a task, which will be used when no specific task is matched. For example:"model": [ { "model": "gpt-4o-mini", "llm_call_kwargs": { "temperature": 0, "timeout": 300, }, "client_kwargs": { "api_key": "<your_api_key>", "api_version": "<your_api_version>", "azure_endpoint": "<your_azure_endpoint>", }, "tasks": ["default", "date_extraction"], }, { "model": "gpt-4o", "client_type": "openai", "tasks": ["ordinance_text_extraction"], } ]
By default,
"gpt-4o".num_urls_to_check_per_jurisdiction (
int, optional) – Number of unique Google search result URLs to check for each jurisdiction when attempting to locate ordinance documents. By default,5.max_num_concurrent_browsers (
int, optional) – Maximum number of browser instances to launch concurrently for retrieving information from the web. Increasing this value too much may lead to timeouts or performance issues on machines with limited resources. By default,10.max_num_concurrent_website_searches (
int, optional) – Maximum number of website searches allowed to run simultaneously. Increasing this value can speed up searches, but may lead to timeouts or performance issues on machines with limited resources. By default,10.max_num_concurrent_jurisdictions (
int, default25) – Maximum number of jurisdictions to process concurrently. Limiting this can help manage memory usage when dealing with a large number of documents. By default25.url_ignore_substrings (
listofstr, optional) –A list of substrings that, if found in any URL, will cause the URL to be excluded from consideration. This can be used to specify particular websites or entire domains to ignore. For example:
url_ignore_substrings = [ "wikipedia", "nrel.gov", "www.co.delaware.in.us/documents/1649699794_0382.pdf", ]
The above configuration would ignore all wikipedia articles, all websites on the NREL domain, and the specific file located at www.co.delaware.in.us/documents/1649699794_0382.pdf. By default,
None.known_local_docs (
dictor path-like, optional) – A dictionary where keys are the jurisdiction codes (as strings) and values are lists of dictionaries containing information about each document. The latter dictionaries should contain at least the key"source_fp"pointing to the full path of the local document file. All other keys will be added as attributes to the loaded document instance. You can include the key"is_legal_doc"to skip the legal document check for known documents. Similarly, you can provide the"date"key, which is a list of[year, month, day], some or all of which can be null, to skip the date extraction step of the processing pipeline. If this input is provided, local documents will be checked first. See the top-level documentation of this function for the full processing of the pipeline. This input can also be a path to a JSON file containing the dictionary of code-to-document-info mappings. By default,None.known_doc_urls (
dictor path-like, optional) –A dictionary where keys are the jurisdiction codes (as strings) and values are lists of dictionaries containing information about each document. The latter dictionaries should contain at least the key
"source"representing the known URL to check for that document. All other keys will be added as attributes to the loaded document instance. You can include the key"is_legal_doc"to skip the legal document check for known documents. Similarly, you can provide the"date"key, which is a list of[year, month, day], some or all of which can be null, to skip the date extraction step of the processing pipeline. If this input is provided, the known URLs will be checked before applying the search engine search. See the top-level documentation of this function for the full processing order of the pipeline. This input can also be a path to a JSON file containing the dictionary of code-to-document-info mappings.Note
The same input can be used for both known_local_docs and known_doc_urls as long as both
"source_fp"and"source"keys are provided in each document info dictionary.By default,
None.file_loader_kwargs (
dict, optional) – Dictionary of keyword arguments pairs to initializeelm.web.file_loader.AsyncWebFileLoader. If found, the “pw_launch_kwargs” key in these will also be used to initialize theelm.web.search.google.PlaywrightGoogleLinkSearchused for the google URL search. By default,None.search_engines (
list, optional) – A list of dictionaries, where each dictionary contains information about a search engine class that should be used for the document retrieval process. Each dictionary should contain at least the key"se_name", which should correspond to one of the search engine class names fromelm.web.search.run.SEARCH_ENGINE_OPTIONS. The rest of the keys in the dictionary should contain keyword-value pairs to be used as parameters to initialize the search engine class (things like API keys and configuration options; see the ELM documentation for details on search engine class parameters). The list should be ordered by search engine preference - the first search engine parameters will be used to submit the queries initially, then any subsequent search engine listings will be used as fallback (in order that they appear). Do not repeat search engines - only the last config dictionary will be used to initialize the search engine if you do. IfNone, then all default configurations for the search engines (along with the fallback order) are used. By default,None.pytesseract_exe_fp (path-like, optional) – Path to the pytesseract executable. If specified, OCR will be used to extract text from scanned PDFs using Google’s Tesseract. By default
None.td_kwargs (
dict, optional) – Additional keyword arguments to pass totempfile.TemporaryDirectory. The temporary directory is used to store documents which have not yet been confirmed to contain relevant information. By default,None.tpe_kwargs (
dict, optional) – Additional keyword arguments to pass toconcurrent.futures.ThreadPoolExecutor, used for I/O-bound tasks such as logging. By default,None.ppe_kwargs (
dict, optional) – Additional keyword arguments to pass toconcurrent.futures.ProcessPoolExecutor, used for CPU-bound tasks such as PDF loading and parsing. By default,None.log_dir (path-like, optional) – Path to the directory for storing log files. If not provided, a
logssubdirectory will be created inside out_dir. By default,None.clean_dir (path-like, optional) – Path to the directory for storing cleaned ordinance text output. If not provided, a
cleaned_textsubdirectory will be created inside out_dir. By default,None.ordinance_file_dir (path-like, optional) – Path to the directory where downloaded ordinance files (PDFs or HTML) for each jurisdiction are stored. If not provided, a
ordinance_filessubdirectory will be created inside out_dir. By default,None.jurisdiction_dbs_dir (path-like, optional) – Path to the directory where parsed ordinance database files are stored for each jurisdiction. If not provided, a
jurisdiction_dbssubdirectory will be created inside out_dir. By default,None.perform_se_search (
bool, defaultTrue) – Option to perform a search engine-based search for ordinance documents. This is the standard way to collect ordinance documents, and it is recommended to leave this set toTrueunless you are re-processing local documents. IfTrue, the search engine approach is used to locate ordinance documents before falling back to a website crawl-based search (if that has been selected). By default,True.perform_website_search (
bool, defaultTrue) – Option to fallback to a jurisdiction website crawl-based search for ordinance documents if the search engine approach fails to recover any relevant documents. By default,True.llm_costs (
dict, optional) –Dictionary mapping model names to their token costs, used to track the estimated total cost of LLM usage during the run. The structure should be:
{"model_name": {"prompt": float, "response": float}}
Costs are specified in dollars per million tokens. For example:
"llm_costs": {"my_gpt": {"prompt": 1.5, "response": 3.7}}
registers a model named “my_gpt” with a cost of $1.5 per million input (prompt) tokens and $3.7 per million output (response) tokens for the current processing run.
Note
The displayed total cost does not track cached tokens, so treat it like an estimate. Your final API costs may vary.
If set to
None, no custom model costs are recorded, and cost tracking may be unavailable in the progress bar. By default,None.log_level (
str, optional) – Logging level for ordinance scraping and parsing (e.g., “TRACE”, “DEBUG”, “INFO”, “WARNING”, or “ERROR”). By default,"INFO".keep_async_logs (
bool, defaultFalse) – Option to store the full asynchronous log record to a file. This is only useful if you intend to monitor overall processing progress from a file instead of from the terminal. IfTrue, all of the unordered records are written to a “all.log” file in the log_dir directory. By default,False.
- Returns:
str– Message summarizing run results, including total processing time, total cost, output directory, and number of documents found. The message is formatted for easy reading in the terminal and may include color-coded cost information if the terminal supports it.