elm.utilities.parse.read_pdf_ocr

read_pdf_ocr(pdf_bytes, verbose=True)[source]

Read PDF contents from bytes using Optical Character recognition (OCR).

This method attempt to read the PDF document using OCR. This is one of the only ways to parse a scanned PDF document. To use this function, you will need to install the pytesseract and pdf2image Modules. Installation guides here:

Windows users may also need to apply the fix described in this answer before they can use pytesseract: http://tinyurl.com/v9xr4vrj

Parameters:
  • pdf_bytes (bytes) – Bytes corresponding to a PDF file.

  • verbose (bool, optional) – Option to log errors during parsing. By default, True.

Returns:

iterable – Iterable containing pages of the PDF document. This iterable may be empty if there was an error reading the PDF file.