Ordinance GPT

This example folder contains supporting documents, results, and code for the Ordinance GPT experiment.

Prerequisites

We recommend installing the pytesseract module to allow PDF retrieval for scanned documents. See the ordinance-specific installation instructions for more details.

Running from Python

This instruction set presents a simplified example to extract ordinance data from a ordinance document on disk. This corresponds with the ordinance data extraction from PDF results in Buster et al., 2024.

To run this, first download one or more ordinance documents from the Box folder.

After downloading the ordinance document(s), set the relevant path for the fp_pdf variable, and then run the script: .. code-block:: bash

$ python parse_pdf.py

Running from the Command Line Utility

This instruction set is an experimental process to use LLMs to search the internet for relevant ordinance documents, download those documents, and then extract the relevant ordinance data.

There are a few key things you need to set up in order to run ordinance retrieval and extraction. First, you must specify which counties you want to process. You can do this by setting up a CSV file with a County and a State column. Each row in the CSV file then represents a single county to process. See the example CSV file for reference.

Once you have set up the county CSV, you can fill out the template JSON config. See the documentation for the “process_counties_with_openai” function for an explanation of all the allowed inputs to the configuration file. Some notable inputs here are the azure* keys, which should be configured to match your Azure OpenAI API deployment (unless it’s defined in your environment with the AZURE_OPENAI_API_KEY, AZURE_OPENAI_VERSION, and AZURE_OPENAI_ENDPOINT keys, in which case you can remove these keys completely), and the pytesseract_exe_fp key, which should point to the pytesseract executable path on your local machine (or removed from the config file if you are opting out of OCR). You may also have to adjust the llm_service_rate_limit to match your deployment’s API tokens-per-minute limit. Be sure to provide full paths to all files/directories unless you are executing the program from your working folder.

Execution

Once you are happy with the configuration parameters, you can kick off the processing using

$ elm ords -c config.json

You may also wish to add a -v option to print logs to the terminal (however, keep in mind that the code runs asynchronously, so the the logs will not print in order).

Warning

Running all of the 85 counties given in the sample county CSV file can cost $700-$1000 in API calls. We recommend running a smaller subset for example purposes.

Debugging

Not sure why things aren’t working? No error messages? Make sure you run the CLI call with a -v flag for “verbose” logging (e.g., $ elm ords -c config.json -v)

Errors on import statements? Trouble importing pdftotext with cryptic error messages like symbol not found in flat namespace? Follow the ordinance-specific install instructions exactly.

Source Ordinance Documents

The ordinance documents downloaded using (an older version of) this example code can be downloaded here.

Extension to Other Technologies

Extending this functionality to other technologies is possible but requires deeper understanding of the underlying processes. We recommend you start out by examining the decision tree queries in graphs.py as well as how they are applied in parse.py. Once you have a firm understanding of these two modules, look through the document validation routines to get a better sense of how to adjust the web-scraping portion of the code to your technology. When you have set up the validation and parsing for your technology, put it all together by adjusting the “process_counties_with_openai” function to call your new routines.