compass.extraction.apply.extract_ordinance_text_with_ngram_validation#

async extract_ordinance_text_with_ngram_validation(doc, text_splitter, extractor, original_text_key, n=4, num_extraction_attempts=3, ngram_fraction_threshold=0.9, ngram_ocr_fraction_threshold=0.75)[source]#

Extract ordinance text for a single document with known ord info

This extraction includes an “ngram” check, which attempts to detect whether or not the cleaned text was extracted from the original ordinance text. The processing will attempt to re-extract the text if the validation does not pass a certain threshold until the maximum number of attempts is reached. If the text still does not pass validation at this point, there is a good chance that the LLM hallucinated parts of the output text, so caution should be taken.

Parameters:

doc (elm.web.document.BaseDocument) – A document known to contain ordinance information. This means it must contain an "ordinance_text" key in the attrs. You can run check_for_ordinance_info() to have this attribute populated automatically for documents that are found to contain ordinance data. Note that if the document’s attrs does not contain the "ordinance_text" key, it will not be processed.
text_splitter (TextSplitter, optional) – Optional Langchain text splitter (or subclass instance), or any object that implements a split_text method. The method should take text as input (str) and return a list of text chunks.
original_text_key (str) – String corresponding to the doc.attrs key containing the original text (before extraction).
n (int, optional) – Number of words to include per ngram for the ngram validation, which helps ensure that the LLM did not hallucinate. By default, 4.
num_extraction_attempts (int, optional) – Number of extraction attempts before returning text that did not pass the ngram check. If the processing exceeds this value, there is a good chance that the LLM hallucinated parts of the output text. Cannot be negative or 0. By default, 3.
ngram_fraction_threshold (float, optional) – Fraction of ngrams in the cleaned text that are also found in the original ordinance text (parsed using poppler) for the extraction to be considered successful. Should be a value between 0 and 1 (inclusive). By default, 0.9.
ngram_ocr_fraction_threshold (float, optional) – Fraction of ngrams in the cleaned text that are also found in the original ordinance text (parsed using OCR) for the extraction to be considered successful. Should be a value between 0 and 1 (inclusive). By default, 0.75.

Returns:

elm.web.document.BaseDocument – Document that has been parsed for ordinance text. The results of the extraction are stored in the document’s attrs.