elm.utilities.parse

ELM parsing utilities.

Functions

clean_headers(pages[, char_thresh, ...])

Clean headers/footers that are duplicated across pages of a document.

combine_pages(pages)

Combine pages of GPT cleaned text into a single string.

format_html_tables(text, **kwargs)

Format tables within HTML text into pretty markdown.

html_to_text(html[, ignore_links])

Call to HTML2Text class with basic args.

is_multi_col(text[, separator, threshold_ratio])

Does the text look like it has multiple vertical text columns?

read_pdf(pdf_bytes[, verbose])

Read PDF contents from bytes.

read_pdf_ocr(pdf_bytes[, verbose])

Read PDF contents from bytes using Optical Character recognition (OCR).

remove_blank_pages(pages)

Remove any blank pages from the iterable.

remove_empty_lines_or_page_footers(text)

Replace empty lines (potentially with page numbers only) as newlines

replace_common_pdf_conversion_chars(text)

Re-format text to remove common pdf-converter chars.

replace_excessive_newlines(text)

Replace instances of three or more newlines with \n\n

replace_multi_dot_lines(text)

Replace instances of three or more dots (.....) with just "..."