elm.ords.services.cpu.PDFLoader

class PDFLoader(**kwargs)[source]

Bases: ProcessPoolService

Class to load PDFs in a ProcessPoolExecutor.

Parameters:

**kwargs – Keyword-value argument pairs to pass to concurrent.futures.ProcessPoolExecutor. By default, None.

Methods

acquire_resources()

Open thread pool and temp directory

call(*args, **kwargs)

Call the service.

process(fn, pdf_bytes, **kwargs)

Write URL doc to file asynchronously.

process_using_futures(fut, *args, **kwargs)

Process a call to the service.

release_resources()

Shutdown thread pool and cleanup temp directory

Attributes

MAX_CONCURRENT_JOBS

Max number of concurrent job submissions.

can_process

Always True (limiting is handled by asyncio)

name

Service name used to pull the correct queue object.

property can_process

Always True (limiting is handled by asyncio)

Type:

bool

async process(fn, pdf_bytes, **kwargs)[source]

Write URL doc to file asynchronously.

Parameters:
  • doc (elm.web.document.Document) – Document containing meta information about the file. Must have a “source” key in the metadata dict containing the URL, which will be converted to a file name using compute_fn_from_url().

  • file_content (str | bytes) – File content, typically string text for HTML files and bytes for PDF file.

  • make_name_unique (bool, optional) – Option to make file name unique by adding a UUID at the end of the file name. By default, False.

Returns:

Path – Path to output file.

MAX_CONCURRENT_JOBS = 10000

Max number of concurrent job submissions.

acquire_resources()

Open thread pool and temp directory

async classmethod call(*args, **kwargs)

Call the service.

Parameters:

*args, **kwargs – Positional and keyword arguments to be passed to the underlying service processing function.

Returns:

obj – A response object from the underlying service.

property name

Service name used to pull the correct queue object.

Type:

str

async process_using_futures(fut, *args, **kwargs)

Process a call to the service.

Parameters:
  • fut (asyncio.Future) – A future object that should get the result of the processing operation. If the processing function returns answer, this method should call fut.set_result(answer).

  • **kwargs – Keyword arguments to be passed to the underlying processing function.

release_resources()

Shutdown thread pool and cleanup temp directory