elm.ords.process.process_counties_with_openai
- async process_counties_with_openai(out_dir, county_fp=None, model='gpt-4', azure_api_key=None, azure_version=None, azure_endpoint=None, llm_call_kwargs=None, llm_service_rate_limit=4000, text_splitter_chunk_size=3000, text_splitter_chunk_overlap=300, num_urls_to_check_per_county=5, max_num_concurrent_browsers=10, file_loader_kwargs=None, pytesseract_exe_fp=None, td_kwargs=None, tpe_kwargs=None, ppe_kwargs=None, log_dir=None, clean_dir=None, county_ords_dir=None, county_dbs_dir=None, log_level='INFO')[source]
Download and extract ordinances for a list of counties.
- Parameters:
out_dir (path-like) – Path to output directory. This directory will be created if it does not exist. This directory will contain the structured ordinance output CSV as well as all of the scraped ordinance documents (PDFs and HTML text files). Usage information and default options for log/clean directories will also be stored here.
county_fp (path-like, optional) – Path to CSV file containing a list of counties to extract ordinance information for. This CSV should have “County” and “State” columns that contains the county and state names. By default,
None
, which runs the extraction for all known counties (this is untested and not currently recommended).model (str, optional) – Name of LLM model to perform scraping. By default,
"gpt-4"
.azure_api_key (str, optional) – Azure OpenAI API key. By default,
None
, which pulls the key from the environment variableAZURE_OPENAI_API_KEY
instead.azure_version (str, optional) – Azure OpenAI API version. By default,
None
, which pulls the version from the environment variableAZURE_OPENAI_VERSION
instead.azure_endpoint (str, optional) – Azure OpenAI API endpoint. By default,
None
, which pulls the endpoint from the environment variableAZURE_OPENAI_ENDPOINT
instead.llm_call_kwargs (dict, optional) – Keyword-value pairs used to initialize an elm.ords.llm.LLMCaller instance. By default,
None
.llm_service_rate_limit (int, optional) – Token rate limit of LLm service being used (OpenAI). By default,
4000
.text_splitter_chunk_size (int, optional) – Chunk size input to langchain.text_splitter.RecursiveCharacterTextSplitter. By default,
3000
.text_splitter_chunk_overlap (int, optional) – Chunk overlap input to langchain.text_splitter.RecursiveCharacterTextSplitter. By default,
300
.num_urls_to_check_per_county (int, optional) – Number of unique Google search result URL’s to check for ordinance document. By default,
5
.max_num_concurrent_browsers (int, optional) – Number of unique concurrent browser instances to open when performing Google search. Setting this number too high on a machine with limited processing can lead to increased timeouts and therefore decreased quality of Google search results. By default,
10
.pytesseract_exe_fp (path-like, optional) – Path to pytesseract executable. If this option is specified, OCR parsing for PDf files will be enabled via pytesseract. By default,
None
.td_kwargs (dict, optional) – Keyword-value argument pairs to pass to
tempfile.TemporaryDirectory
. The temporary directory is used to store files downloaded from the web that are still being parsed for ordinance information. By default,None
.tpe_kwargs (dict, optional) – Keyword-value argument pairs to pass to
concurrent.futures.ThreadPoolExecutor
. The thread pool executor is used to run I/O intensive tasks like writing to a log file. By default,None
.ppe_kwargs (dict, optional) – Keyword-value argument pairs to pass to
concurrent.futures.ProcessPoolExecutor
. The process pool executor is used to run CPU intensive tasks like loading a PDF file. By default,None
.log_dir (path-like, optional) – Path to directory for log files. This directory will be created if it does not exist. By default,
None
, which creates alogs
folder in the output directory for the county-specific log files.clean_dir (path-like, optional) – Path to directory for cleaned ordinance text output. This directory will be created if it does not exist. By default,
None
, which creates aclean
folder in the output directory for the cleaned ordinance text files.county_ords_dir (path-like, optional) – Path to directory for individual county ordinance file outputs. This directory will be created if it does not exist. By default,
None
, which creates acounty_ord_files
folder in the output directory.county_dbs_dir (path-like, optional) – Path to directory for individual county ordinance database outputs. This directory will be created if it does not exist. By default,
None
, which creates acounty_dbs
folder in the output directory.log_level (str, optional) – Log level to set for county retrieval and parsing loggers. By default,
"INFO"
.
- Returns:
pd.DataFrame – DataFrame of parsed ordinance information. This file will also be stored in the output directory under “wind_db.csv”.