elm.web.document.PDFDocument

class PDFDocument(pages, metadata=None, percent_raw_pages_to_keep=25, max_raw_pages=18, num_end_pages_to_keep=2, clean_header_kwargs=None)[source]

Bases: BaseDocument

ELM web PDF document

Parameters:

pages (iterable) – Iterable of strings, where each string is a page of a document.
metadata (str, optional) – metadata : dict, optional Optional dict containing metadata for the document. By default, None.
percent_raw_pages_to_keep (int, optional) – Percent of “raw” pages to keep. Useful for extracting info from headers/footers of a doc, which are normally stripped to form the “clean” text. By default, 25.
max_raw_pages (int, optional) – The max number of raw pages to keep. The number of raw pages will never exceed the total of this value + num_end_pages_to_keep. By default, 18.
num_end_pages_to_keep (int, optional) – Number of additional pages to keep from the end of the document. This can be useful to extract more meta info. The number of raw pages will never exceed the total of this value + max_raw_pages. By default, 2.
clean_header_kwargs (dict, optional) – Optional dictionary of keyword-value pair arguments to pass to the clean_headers() function. By default, None.

Methods

from_file(fp, **init_kwargs)

Initialize a PDFDocument object from a .pdf file on disk.

Attributes

`CLEAN_HEADER_KWARGS`	Default `clean_headers()` arguments
`FILE_EXTENSION`
`WRITE_KWARGS`
`empty`	`True` if the document contains no pages.
`num_raw_pages_to_keep`	Number of raw pages to keep from PDF document
`raw_pages`	List of (a limited count of) raw pages
`text`	Cleaned text from document

CLEAN_HEADER_KWARGS = {'char_thresh': 0.6, 'iheaders': [0, 1, 3, -3, -2, -1], 'page_thresh': 0.8, 'split_on': '\n'}: Default clean_headers() arguments

property num_raw_pages_to_keep

Number of raw pages to keep from PDF document

Type:: int

classmethod from_file(fp, **init_kwargs)[source]

Initialize a PDFDocument object from a .pdf file on disk. This method will try to use pdftotext (a poppler utility) and then OCR with pytesseract.

Parameters:

fp (str) – filepath to .pdf on disk
init_kwargs (dict) – Optional kwargs for PDFDocument Initialization

Returns:

out (PDFDocument) – Initialized PDFDocument class from input fp

property empty

True if the document contains no pages.

Type:: bool

property raw_pages

List of (a limited count of) raw pages

Type:: list

property text

Cleaned text from document

Type:: str