elm.web.document.PDFDocument

class PDFDocument(pages, metadata=None, percent_raw_pages_to_keep=25, max_raw_pages=18, num_end_pages_to_keep=2, clean_header_kwargs=None)[source]

Bases: BaseDocument

ELM web PDF document

Parameters:
  • pages (iterable) – Iterable of strings, where each string is a page of a document.

  • metadata (str, optional) – metadata : dict, optional Optional dict containing metadata for the document. By default, None.

  • percent_raw_pages_to_keep (int, optional) – Percent of “raw” pages to keep. Useful for extracting info from headers/footers of a doc, which are normally stripped to form the “clean” text. By default, 25.

  • max_raw_pages (int, optional) – The max number of raw pages to keep. The number of raw pages will never exceed the total of this value + num_end_pages_to_keep. By default, 18.

  • num_end_pages_to_keep (int, optional) – Number of additional pages to keep from the end of the document. This can be useful to extract more meta info. The number of raw pages will never exceed the total of this value + max_raw_pages. By default, 2.

  • clean_header_kwargs (dict, optional) – Optional dictionary of keyword-value pair arguments to pass to the clean_headers() function. By default, None.

Methods

from_file(fp, **init_kwargs)

Initialize a PDFDocument object from a .pdf file on disk.

Attributes

CLEAN_HEADER_KWARGS

Default clean_headers() arguments

FILE_EXTENSION

WRITE_KWARGS

empty

True if the document contains no pages.

num_raw_pages_to_keep

Number of raw pages to keep from PDF document

raw_pages

List of (a limited count of) raw pages

text

Cleaned text from document

CLEAN_HEADER_KWARGS = {'char_thresh': 0.6, 'iheaders': [0, 1, 3, -3, -2, -1], 'page_thresh': 0.8, 'split_on': '\n'}

Default clean_headers() arguments

property num_raw_pages_to_keep

Number of raw pages to keep from PDF document

Type:

int

classmethod from_file(fp, **init_kwargs)[source]

Initialize a PDFDocument object from a .pdf file on disk. This method will try to use pdftotext (a poppler utility) and then OCR with pytesseract.

Parameters:
  • fp (str) – filepath to .pdf on disk

  • init_kwargs (dict) – Optional kwargs for PDFDocument Initialization

Returns:

out (PDFDocument) – Initialized PDFDocument class from input fp

property empty

True if the document contains no pages.

Type:

bool

property raw_pages

List of (a limited count of) raw pages

Type:

list

property text

Cleaned text from document

Type:

str