INFRA-COMPASS Execution Basics#
This example walks you through setting up and executing your first INFRA-COMPASS run.
Prerequisites#
We recommend enabling Optical Character Recognition (OCR) for PDF parsing, which
allows the program to process scanned documents. To enable OCR, you’ll need to install pytesseract
.
If you installed COMPASS via PyPI, you may need to install a few additional dependencies:
pip install pytesseract pdf2image
If you’re using pixi
to run the pipeline (recommended), these libraries are included by default.
In either case, you may still need to complete a few additional setup steps if this is your first time
installing Google’s tesseract
utility. Follow the installation instructions
here.
Setting Up the Run Configuration#
The INFRA-COMPASS configuration file—written in either JSON
or JSON5
format—is a simple config that
defines parameters for running the process. Each key in the config corresponds to an argument for the function
process_counties_with_openai.
Refer to the linked documentation for detailed and up-to-date descriptions of each input.
Minimal Config#
At a minimum, the INFRA-COMPASS config file requires three keys: "out_dir"
, "jurisdiction_fp"
, and "tech"
.
out_dir
: Path to the output directory. Will be created if it does not exist.jurisdiction_fp
: Path to a CSV file containingCounty
andState
columns. Each row defines a jurisdiction to process. See the example CSV.tech
: A string representing the infrastructure or technology focus for the run.
In config_bare_minimum.json5, we show a minimal working configuration that includes only the required keys.
// config_bare_minimum.json5
{
"out_dir": "./outputs",
"jurisdiction_fp": "jurisdictions.csv",
"tech": "solar",
// Note that this assumes your LLM API keys and endpoints
// have been configured as environment variables!!
}
This configuration is sufficient for a basic run using default settings and assumes the following:
Environment Configuration
Your LLM credentials and endpoints should be configured as environment variables. For example, when using Azure OpenAI:
AZURE_OPENAI_API_KEY
AZURE_OPENAI_VERSION
AZURE_OPENAI_ENDPOINT
LLM Model Defaults
This minimal setup uses the default LLM model for INFRA-COMPASS — gpt-4o
as of April 11, 2025.
To override this default, add a model
key to your config:
"model": "gpt-4o-mini"
Typical Config#
In most cases, you’ll want more control over the execution parameters, especially those related to the LLM configuration. You can review all available inputs in the process_counties_with_openai documentation. In config_recommended.json5, we demonstrate a typical configuration that balances simplicity with additional control over execution parameters.
// config_recommended.json5
{
"out_dir": "./outputs",
"tech": "solar",
"jurisdiction_fp": "jurisdictions.csv",
"model": [
{
"name": "gpt-4o-mini",
"llm_call_kwargs":{
"temperature": 0,
"timeout": 300
},
"llm_service_rate_limit": 500000,
"text_splitter_chunk_size": 10000,
"text_splitter_chunk_overlap": 500,
"client_kwargs": {
// default client is Azure OpenAI
"azure_api_key": "<ADD AZURE OPENAI API KEY HERE>",
"azure_version": "<ADD AZURE OPENAI VERSION HERE>",
"azure_endpoint": "<ADD AZURE OPENAI ENDPOINT HERE>",
},
},
],
"file_loader_kwargs": {
"verify_ssl": false,
},
"pytesseract_exe_fp": "<ADD tesseract.exe PATH HERE OR REMOVE THIS KEY>",
}
This setup supports most users’ needs while providing flexibility for key configurations:
LLM Configuration Customize the LLM behavior and performance using:
llm_call_kwargs
: Sets LLM-specific query parameters liketemperature
andtimeout
.llm_service_rate_limit
: Controls how many tokens can be processed per minute. Set this as high as your deployment will allow to speed up processing.text_splitter_chunk_size
andtext_splitter_chunk_overlap
: Controls how large each text chunk sent to the model is.
Warning
Be cautious when adjusting the "text_splitter_chunk_size"
. Larger chunk sizes increase token usage, which may result in higher costs per query.
LLM Credentials
You can also specify LLM credentials and endpoint details directly in the config under the client_kwargs
key.
Note that while this can be convenient for quick testing, storing credentials in plaintext is not recommended for production environments.
SSL Configuration
Set verify_ssl
to false
in file_loader_kwargs
to bypass certificate verification errors, especially useful when running behind the NREL VPN.
If you’re not using the VPN, it’s best to leave this value as the default (true
).
OCR Integration
As noted in the Prerequisites section, we recommend enabling OCR using pytesseract
. To enable OCR for scanned PDFs,
you must provide the path to the tesseract
executable using the pytesseract_exe_fp
input.
You can locate the executable path by running:
which tesseract
Omit the pytesseract_exe_fp
key to disable OCR functionality.
Kitchen Sink Config#
In config_kitchen_sink.json5, we show what a configuration might look like that utilizes all available parameters.
// config_kitchen_sink.json5
{
"out_dir": "./outputs",
"tech": "solar",
"jurisdiction_fp": "jurisdictions.csv",
"model": [
{
"name": "deployment-gpt-4o-mini",
"llm_call_kwargs":{
"temperature": 0,
"seed": 42,
"timeout": 300
},
"llm_service_rate_limit": 500000,
"text_splitter_chunk_size": 10000,
"text_splitter_chunk_overlap": 500,
"client_type": "azure", // this is the default
"client_kwargs": {
"azure_api_key": "<ADD AZURE OPENAI API KEY HERE>",
"azure_version": "<ADD AZURE OPENAI VERSION HERE>",
"azure_endpoint": "<ADD AZURE OPENAI ENDPOINT HERE>",
},
// "default" has to appear as a task exactly once across
// all models, or you will get an error
"tasks": "default",
},
{
"name": "deployment-gpt-4o",
"llm_call_kwargs":{
"temperature": 0,
"seed": 42,
"timeout": 300
},
"llm_service_rate_limit": 500000,
"text_splitter_chunk_size": 20000,
"text_splitter_chunk_overlap": 1000,
"client_type": "azure", // this is the default
"client_kwargs": {
"azure_api_key": "<ADD AZURE OPENAI API KEY HERE>",
"azure_version": "<ADD AZURE OPENAI VERSION HERE>",
"azure_endpoint": "<ADD AZURE OPENAI ENDPOINT HERE>",
},
"tasks": [
"ordinance_text_extraction",
"ordinance_value_extraction",
"permitted_use_text_extraction",
"permitted_use_value_extraction"
]
},
{
"name": "gpt-4o-mini",
"llm_call_kwargs":{
"temperature": 0,
"timeout": 300
},
"llm_service_rate_limit": 30000,
"text_splitter_chunk_size": 10000,
"text_splitter_chunk_overlap": 500,
"client_type": "openai",
"client_kwargs": {
"api_key": "<ADD OPENAI API KEY HERE>",
},
"tasks": [
"date_extraction",
"document_content_validation",
"document_location_validation",
]
},
],
// Number of URLs to check per jurisdiction
// larger number = more docs to search
"num_urls_to_check_per_jurisdiction": 5,
// Try to keep reasonably low , especially on laptoins
"max_num_concurrent_browsers": 10,
// Only search 5 jurisdictions concurrently (at a time)
// Most likely you are limited by LLM rate limits, so setting this
// to some value (~5-15) will prevent submitting too many concurrent
// futures to sit idly, awaiting their turn to query the LLM
"max_num_concurrent_jurisdictions": 5,
"file_loader_kwargs": {
"pw_launch_kwargs": {
// set to "false" to see browser open and watch
// the search queries happen in real time
"headless": true,
// slow-mo delay, in milliseconds
// only applies if headless=false
"slow_mo": 5000,
},
"verify_ssl": false,
},
"pytesseract_exe_fp": "<ADD tesseract.exe PATH HERE OR REMOVE THIS KEY>",
"td_kwargs": {
"dir": ".temp"
},
"tpe_kwargs": {
"max_workers": 4
},
"ppe_kwargs": {
"max_workers": 4
},
"log_dir": "logs",
"clean_dir": "cleaned_text",
"ordinance_file_dir": "ordinance_files",
"jurisdiction_dbs_dir": "jurisdiction_dbs",
"llm_costs": {
// required input to display running cost
// cost values are in $/million tokens
"deployment-gpt-4o-mini": {"prompt": 0.15, "response": 0.6},
"deployment-gpt-4o": {"prompt": 2.5, "response": 10},
"gpt-4o-mini": {"prompt": 2.5, "response": 10},
},
"log_level": "INFO",
}
This setup provides maximum flexibility and is suitable for power users who need fine-grained control over processing behavior, model assignment, cost monitoring, and concurrency. Below are descriptions of the most notable components:
Multiple Model Definitions
You can specify multiple LLMs under the "model"
key, each with a unique name and a list of associated tasks.
Every task must be handled by exactly one model, and one of the entries must have "tasks": "default"
to catch anything unspecified.
The full list of assignable tasks are found as Attributes
of the LLMTasks
enum.
LLM Configuration Each model includes:
llm_call_kwargs
: Sets LLM-specific query parameters liketemperature
ortimeout
.llm_service_rate_limit
: Controls how many tokens can be processed per minute (useful for avoiding rate limit errors from the LLM provider). Set this as high as your deployment will allow to speed up processing.text_splitter_chunk_size
/text_splitter_chunk_overlap
: Controls how large each text chunk sent to the model is. Larger chunks increase context at the cost of higher token usage.client_type
: Specifies the API provider (e.g.,"azure"
or"openai"
).client_kwargs
: Holds credentials and endpoint configuration for the model client, if not specified using environment variables.
Concurrency Settings The following settings allow tuning for system resource usage and rate limits:
max_num_concurrent_browsers
: Limits the number of headless browsers launched for document discovery.max_num_concurrent_jurisdictions
: Controls how many jurisdictions are processed in parallel.
OCR Integration
Set the pytesseract_exe_fp
key to enable OCR support for scanned PDFs. Omit this key if OCR is not needed.
Directories Several directory names are configurable to manage where outputs and intermediate files are stored:
out_dir
: Final results and processed outputs.log_dir
: Execution logs.clean_dir
: Cleaned text files.ordinance_file_dir
: Raw ordinance documents.jurisdiction_dbs_dir
: Internal DBs for tracking progress.
Note
Be sure to provide full paths to all files/directories unless you are executing the program from your working folder.
LLM Cost Reporting
The llm_costs
block provides per-million-token pricing for each model.
This allows the script to display real-time and final cost estimates based on tracked usage.
Any model not found in the llm_costs
block will not contribute to the final cost estimate.
Execution#
Once you are happy with the configuration parameters, you can kick off the processing using
compass process -c config.json
If you’re using pixi
, activate the shell first:
pixi shell
compass process -c config.json5
or run with pixi
directly:
pixi run compass process -c config.json5
Replace config.json5
with the path to your actual configuration file.
You may also wish to add a -v
option to print logs to the terminal (however, keep in mind that the code runs
asynchronously, so the the logs will not print in order).
During execution, INFRA-COMPASS will:
Load and validate the jurisdiction CSV.
Attempt to locate and download relevant ordinance documents for each jurisdiction.
Parse and validate the documents.
Extract relevant ordinance text from the documents.
Parse the extracted text to determine the quantitative and qualitative ordinance values within, using decision tree-based LLM queries.
Output structured results to your configured
out_dir
.
The runtime duration varies depending on the number of jurisdictions, the number of documents found for each jurisdiction, and the rate limit/output token rate of the LLM(s) used.
Outputs#
After completion, you’ll find several outputs in the out_dir
:
Extracted Ordinances: Structured CSV files containing parsed ordinance values.
Ordinance Documents: PDF or text (HTML) documents containing the legal ordinance.
Cleaned Text Files: Text files containing the ordinance-specific text excerpts portions of the downloaded documents.
Metadata Files: JSON files describing metadata parameters corresponding to your run.
Logs and Debug Files: Helpful for reviewing LLM prompts and tracing any issues.
You can now use these outputs for downstream analysis, visualization, or integration with other NREL tools like reVX setbacks.