elm.utilities.parse.clean_headers
- clean_headers(pages, char_thresh=0.6, page_thresh=0.8, split_on='\n', iheaders=(0, 1, -2, -1))[source]
Clean headers/footers that are duplicated across pages of a document.
Note that this function will update the items within the pages input.
- Parameters:
pages (list) – List of pages (as str) from document.
char_thresh (float) – Fraction of characters in a given header that are similar between pages to be considered for removal
page_thresh (float) – Fraction of pages that share the header to be considered for removal
split_on (str) – Chars to split lines of a page on
iheaders (list | tuple) – Integer indices to look for headers after splitting a page into lines based on split_on. This needs to go from the start of the page to the end.
- Returns:
out (str) – Clean text with all pages joined