elm.web.document.BaseDocument
- class BaseDocument(pages, metadata=None)[source]
Bases:
ABC
Base ELM web document representation
- Purpose:
Track document content and perform minor processing on it.
- Responsibilities:
Store “raw” document text.
Compute “cleaned” text, which combines pages, strips HTML, and formats tables.
Track pages and other document metadata.
- Key Relationships:
Created by
AsyncFileLoader
and used all over ordinance code.
- Parameters:
pages (iterable) – Iterable of strings, where each string is a page of a document.
metadata (dict, optional) – Optional dict containing metadata for the document. By default,
None
.
Methods
Attributes
Cleaned document file extension.
Dict of kwargs to pass to open when writing this doc.
True
if the document contains no pages.List of (a limited count of) raw pages
Cleaned text from document