elm.web.document.BaseDocument
- class BaseDocument(pages, attrs=None)[source]
Bases:
ABCBase ELM web document representation
- Purpose:
Track document content and perform minor processing on it.
- Responsibilities:
Store “raw” document text.
Compute “cleaned” text, which combines pages, strips HTML, and formats tables.
Track pages and other document metadata.
- Key Relationships:
Created by
AsyncFileLoaderand used all over ordinance code.
- Parameters:
pages (iterable) – Iterable of strings, where each string is a page of a document.
attrs (dict, optional) – Optional dict containing metadata for the document. By default,
None.
Methods
Attributes
Cleaned document file extension.
Dict of kwargs to pass to open when writing this doc.
Trueif the document contains no pages.List of (a limited count of) raw pages
Cleaned text from document