elm.utilities.parse

ELM parsing utilities.

Functions

`clean_headers`(pages[, char_thresh, ...])	Clean headers/footers that are duplicated across pages of a document.
`combine_pages`(pages)	Combine pages of GPT cleaned text into a single string.
`format_html_tables`(text, **kwargs)	Format tables within HTML text into pretty markdown.
`html_to_text`(html[, ignore_links])	Call to HTML2Text class with basic args.
`is_multi_col`(text[, separator, threshold_ratio])	Does the text look like it has multiple vertical text columns?
`read_pdf`(pdf_bytes[, verbose])	Read PDF contents from bytes.
`read_pdf_ocr`(pdf_bytes[, verbose])	Read PDF contents from bytes using Optical Character recognition (OCR).
`remove_blank_pages`(pages)	Remove any blank pages from the iterable.
`remove_empty_lines_or_page_footers`(text)	Replace empty lines (potentially with page numbers only) as newlines
`replace_common_pdf_conversion_chars`(text)	Re-format text to remove common pdf-converter chars.
`replace_excessive_newlines`(text)	Replace instances of three or more newlines with `\n\n`
`replace_multi_dot_lines`(text)	Replace instances of three or more dots (.....) with just "..."