elm.ords.services.cpu.read_pdf_doc_ocr

async read_pdf_doc_ocr(pdf_bytes, **kwargs)[source]

Read PDF file from bytes using OCR (pytesseract) in a Process Pool.

Note that Pytesseract must be set up properly for this method to work. In particular, the pytesseract.pytesseract.tesseract_cmd attribute must be set to point to the pytesseract exe.

Parameters:
  • pdf_bytes (bytes) – Bytes containing PDF file.

  • **kwargs – Keyword-value arguments to pass to elm.web.document.PDFDocument initializer.

Returns:

elm.web.document.PDFDocument – PDFDocument instances with pages loaded as text.