elm.ords.extraction.apply.extract_ordinance_text_with_ngram_validation

async extract_ordinance_text_with_ngram_validation(doc, text_splitter, n=4, num_extraction_attempts=3, ngram_fraction_threshold=0.95, **kwargs)[source]

Extract ordinance text for a single document with known ord info.

This extraction includes an “ngram” check, which attempts to detect wether or not the cleaned text was extracted from the original ordinance text. The processing will attempt to re-extract the text if the validation does not pass a certain threshold until the maximum number of attempts is reached. If the text still does not pass validation at this point, there is a good chance that the LLM hallucinated parts of the output text, so caution should be taken.

Parameters:
  • doc (elm.web.document.BaseDocument) – A document known to contain ordinance information. This means it must contain an "ordinance_text" key in the metadata. You can run check_for_ordinance_info() to have this attribute populated automatically for documents that are found to contain ordinance data. Note that if the document’s metadata does not contain the "ordinance_text" key, it will not be processed.

  • text_splitter (obj) – Instance of an object that implements a split_text method. The method should take text as input (str) and return a list of text chunks. Langchain’s text splitters should work for this input.

  • n (int, optional) – Number of words to include per ngram for the ngram validation, which helps ensure that the LLM did not hallucinate. By default, 4.

  • num_extraction_attempts (int, optional) – Number of extraction attempts before returning text that did not pass the ngram check. If the processing exceeds this value, there is a good chance that the LLM hallucinated parts of the output text. Cannot be negative or 0. By default, 3.

  • ngram_fraction_threshold (float, optional) – Fraction of ngrams in the cleaned text that are also found in the original ordinance text for the extraction to be considered successful. Should be a value between 0 and 1 (inclusive). By default, 0.95.

  • **kwargs – Keyword-value pairs used to initialize an elm.ords.llm.LLMCaller instance.

Returns:

elm.web.document.BaseDocument – Document that has been parsed for ordinance text. The results of the extraction are stored in the document’s metadata. In particular, the metadata will contain a "cleaned_ordinance_text" key that will contain the cleaned ordinance text.