elm.web.document.BaseDocument

class BaseDocument(pages, metadata=None)[source]

Bases: ABC

Base ELM web document representation

Purpose:

Track document content and perform minor processing on it.

Responsibilities:
  1. Store “raw” document text.

  2. Compute “cleaned” text, which combines pages, strips HTML, and formats tables.

  3. Track pages and other document metadata.

Key Relationships:

Created by AsyncFileLoader and used all over ordinance code.

Parameters:
  • pages (iterable) – Iterable of strings, where each string is a page of a document.

  • metadata (dict, optional) – Optional dict containing metadata for the document. By default, None.

Methods

Attributes

FILE_EXTENSION

Cleaned document file extension.

WRITE_KWARGS

Dict of kwargs to pass to open when writing this doc.

empty

True if the document contains no pages.

raw_pages

List of (a limited count of) raw pages

text

Cleaned text from document

property empty

True if the document contains no pages.

Type:

bool

property raw_pages

List of (a limited count of) raw pages

Type:

list

property text

Cleaned text from document

Type:

str

abstract property WRITE_KWARGS

Dict of kwargs to pass to open when writing this doc.

Type:

dict

abstract property FILE_EXTENSION

Cleaned document file extension.

Type:

str