elm.utilities.parse.read_pdf_ocr

read_pdf_ocr(pdf_bytes, verbose=True)[source]

Read PDF contents from bytes using Optical Character recognition (OCR).

This method attempt to read the PDF document using OCR. This is one of the only ways to parse a scanned PDF document. To use this function, you will need to install the pytesseract and pdf2image Modules. Installation guides here:

pytesseract:

https://github.com/madmaze/pytesseract?tab=readme-ov-file#installation

pdf2image:

https://github.com/Belval/pdf2image?tab=readme-ov-file#how-to-install

Windows users may also need to apply the fix described in this answer before they can use pytesseract: http://tinyurl.com/v9xr4vrj

Parameters:

pdf_bytes (bytes) – Bytes corresponding to a PDF file.
verbose (bool, optional) – Option to log errors during parsing. By default, True.

Returns:

iterable – Iterable containing pages of the PDF document. This iterable may be empty if there was an error reading the PDF file.