elm.utilities.parse
ELM parsing utilities.
Functions
|
Clean headers/footers that are duplicated across pages of a document. |
|
Combine pages of GPT cleaned text into a single string. |
|
Format tables within HTML text into pretty markdown. |
|
Call to HTML2Text class with basic args. |
|
Does the text look like it has multiple vertical text columns? |
|
Read PDF contents from bytes. |
|
Read PDF contents from bytes using Optical Character recognition (OCR). |
|
Remove any blank pages from the iterable. |
Replace empty lines (potentially with page numbers only) as newlines |
|
Re-format text to remove common pdf-converter chars. |
|
Replace instances of three or more newlines with |
|
|
Replace instances of three or more dots (.....) with just "..." |