elm.web.document.HTMLDocument

class HTMLDocument(pages, metadata=None, html_table_to_markdown_kwargs=None, ignore_html_links=True, text_splitter=None)[source]

Bases: BaseDocument

ELM web HTML document

Parameters:
  • pages (iterable) – Iterable of strings, where each string is a page of a document.

  • metadata (dict, optional) – Optional dict containing metadata for the document. By default, None.

  • html_table_to_markdown_kwargs (dict, optional) – Optional dictionary of keyword-value pair arguments to pass to the format_html_tables() function. By default, None.

  • ignore_html_links (bool, optional) – Option to ignore link in HTML text during parsing. By default, True.

  • text_splitter (obj, optional) – Instance of an object that implements a split_text method. The method should take text as input (str) and return a list of text chunks. The raw pages will be passed through this splitter to create raw pages for this document. Langchain’s text splitters should work for this input. By default, None, which means the original pages input becomes the raw pages attribute.

Methods

Attributes

FILE_EXTENSION

HTML_TABLE_TO_MARKDOWN_KWARGS

Default format_html_tables() arguments

WRITE_KWARGS

empty

True if the document contains no pages.

raw_pages

List of (a limited count of) raw pages

text

Cleaned text from document

HTML_TABLE_TO_MARKDOWN_KWARGS = {'floatfmt': '.5f', 'index': True, 'tablefmt': 'psql'}

Default format_html_tables() arguments

property empty

True if the document contains no pages.

Type:

bool

property raw_pages

List of (a limited count of) raw pages

Type:

list

property text

Cleaned text from document

Type:

str