elm.web.document.PDFDocument
- class PDFDocument(pages, metadata=None, percent_raw_pages_to_keep=25, max_raw_pages=18, num_end_pages_to_keep=2, clean_header_kwargs=None)[source]
Bases:
BaseDocument
ELM web PDF document
- Parameters:
pages (iterable) – Iterable of strings, where each string is a page of a document.
metadata (str, optional) – metadata : dict, optional Optional dict containing metadata for the document. By default,
None
.percent_raw_pages_to_keep (int, optional) – Percent of “raw” pages to keep. Useful for extracting info from headers/footers of a doc, which are normally stripped to form the “clean” text. By default,
25
.max_raw_pages (int, optional) – The max number of raw pages to keep. The number of raw pages will never exceed the total of this value + num_end_pages_to_keep. By default,
18
.num_end_pages_to_keep (int, optional) – Number of additional pages to keep from the end of the document. This can be useful to extract more meta info. The number of raw pages will never exceed the total of this value + max_raw_pages. By default,
2
.clean_header_kwargs (dict, optional) – Optional dictionary of keyword-value pair arguments to pass to the
clean_headers()
function. By default,None
.
Methods
from_file
(fp, **init_kwargs)Initialize a PDFDocument object from a .pdf file on disk.
Attributes
Default
clean_headers()
argumentsFILE_EXTENSION
WRITE_KWARGS
True
if the document contains no pages.Number of raw pages to keep from PDF document
List of (a limited count of) raw pages
Cleaned text from document
- CLEAN_HEADER_KWARGS = {'char_thresh': 0.6, 'iheaders': [0, 1, 3, -3, -2, -1], 'page_thresh': 0.8, 'split_on': '\n'}
Default
clean_headers()
arguments
- classmethod from_file(fp, **init_kwargs)[source]
Initialize a PDFDocument object from a .pdf file on disk. This method will try to use pdftotext (a poppler utility) and then OCR with pytesseract.
- Parameters:
fp (str) – filepath to .pdf on disk
init_kwargs (dict) – Optional kwargs for PDFDocument Initialization
- Returns:
out (PDFDocument) – Initialized PDFDocument class from input fp