elm.utilities.parse.read_pdf_ocr
- read_pdf_ocr(pdf_bytes, verbose=True)[source]
Read PDF contents from bytes using Optical Character recognition (OCR).
This method attempt to read the PDF document using OCR. This is one of the only ways to parse a scanned PDF document. To use this function, you will need to install the pytesseract and pdf2image Modules. Installation guides here:
Windows users may also need to apply the fix described in this answer before they can use pytesseract: http://tinyurl.com/v9xr4vrj
- Parameters:
pdf_bytes (bytes) – Bytes corresponding to a PDF file.
verbose (bool, optional) – Option to log errors during parsing. By default,
True
.
- Returns:
iterable – Iterable containing pages of the PDF document. This iterable may be empty if there was an error reading the PDF file.