elm.utilities.parse.read_pdf_ocr

read_pdf_ocr(pdf_bytes, image_to_string_kwargs=None, convert_from_bytes_kwargs=None, verbose=True)[source]

Read PDF contents from bytes using Optical Character recognition (OCR).

This method attempt to read the PDF document using OCR. This is one of the only ways to parse a scanned PDF document. To use this function, you will need to install the pytesseract and pdf2image Modules. Installation guides here:

Windows users may also need to apply the fix described in this answer before they can use pytesseract: http://tinyurl.com/v9xr4vrj

Parameters:
  • pdf_bytes (bytes) – Bytes corresponding to a PDF file.

  • image_to_string_kwargs (dictionary, optional) – Optional dictionary of keyword-value pairs to pass as arguments to the pytesseract.image_to_string() function. By default, None.

  • convert_from_bytes_kwargs (dictionary, optional) – Optional dictionary of keyword-value pairs to pass as arguments to the pdf2image.convert_from_bytes() function. By default, None.

  • verbose (bool, optional) – Option to log errors during parsing. By default, True.

Returns:

iterable – Iterable containing pages of the PDF document. This iterable may be empty if there was an error reading the PDF file.