elm.pdf.PDFtoTXT

class PDFtoTXT(fp, page_range=None, model=None)[source]

Bases: ApiBase

Class to parse text from a PDF document.

Parameters:
  • fp (str) – Filepath to .pdf file to extract.

  • page_range (None | list) – Optional 2-entry list/tuple to set starting and ending pages (python indexing)

  • model (None | str) – Optional specification of OpenAI model to use. Default is cls.DEFAULT_MODEL

Methods

call_api(url, headers, request_json)

Make an asyncronous OpenAI API call.

call_api_async(url, headers, all_request_jsons)

Use GPT to clean raw pdf text in parallel calls to the OpenAI API.

chat(query[, temperature])

Have a continuous chat with the LLM including context from previous chat() calls stored as attributes in this class.

clean_headers([char_thresh, page_thresh, ...])

Clean headers/footers that are duplicated across pages

clean_poppler([layout])

Clean the pdf using the poppler pdftotxt utility

clean_txt()

Use GPT to clean raw pdf text in serial calls to the OpenAI API.

clean_txt_async([ignore_error, rate_limit])

Use GPT to clean raw pdf text in parallel calls to the OpenAI API.

clear()

Clear chat history and reduce messages to just the initial model role message.

count_tokens(text, model)

Return the number of tokens in a string.

generic_async_query(queries[, model_role, ...])

Run a number of generic single queries asynchronously (not conversational)

generic_query(query[, model_role, temperature])

Ask a generic single query without conversation

get_embedding(text)

Get the 1D array (list) embedding of a text string.

is_double_col([separator])

Does the text look like it has multiple vertical text columns?

load_pdf(page_range)

Basic load of pdf to text strings

make_gpt_messages(pdf_raw_text)

Make the chat completion messages list for input to GPT

validate_clean()

Run some basic checks on the GPT cleaned text vs.

Attributes

DEFAULT_MODEL

Default model to do pdf text cleaning.

EMBEDDING_MODEL

Default model to do text embeddings.

EMBEDDING_URL

OpenAI embedding API URL

HEADERS

OpenAI API Headers

MODEL_INSTRUCTION

Instructions to the model with python format braces for pdf text

MODEL_ROLE

High level model role.

URL

OpenAI API URL to be used with environment variable OPENAI_API_KEY.

all_messages_txt

Get a string printout of the full conversation with the LLM

MODEL_ROLE = 'You clean up poorly formatted text extracted from PDF documents.'

High level model role.

MODEL_INSTRUCTION = 'Text extracted from a PDF: \n"""\n{}\n"""\n\nThe text above was extracted from a PDF document. Can you make it nicely formatted? Please only return the formatted text without comments or added information.'

Instructions to the model with python format braces for pdf text

load_pdf(page_range)[source]

Basic load of pdf to text strings

Parameters:

page_range (None | list) – Optional 2-entry list/tuple to set starting and ending pages (python indexing)

Returns:

out (list) – List of strings where each entry is a page. This is the raw PDF text before GPT cleaning

make_gpt_messages(pdf_raw_text)[source]

Make the chat completion messages list for input to GPT

Parameters:

pdf_raw_text (str) – Raw PDF text to be cleaned

Returns:

messages (list) – Messages for OpenAI chat completion model. Typically this looks like this:

[{“role”: “system”, “content”: “You do this…”},

{“role”: “user”, “content”: “Please do this: {}”}]

clean_txt()[source]

Use GPT to clean raw pdf text in serial calls to the OpenAI API.

Returns:

clean_pages (list) – List of clean text strings where each list entry is a page from the PDF

async clean_txt_async(ignore_error=None, rate_limit=40000.0)[source]

Use GPT to clean raw pdf text in parallel calls to the OpenAI API.

NOTE: you need to call this using the await command in ipython or jupyter, e.g.: out = await PDFtoTXT.clean_txt_async()

Parameters:
  • ignore_error (None | callable) – Optional callable to parse API error string. If the callable returns True, the error will be ignored, the API call will not be tried again, and the output will be an empty string.

  • rate_limit (float) – OpenAI API rate limit (tokens / minute). Note that the gpt-3.5-turbo limit is 90k as of 4/2023, but we’re using a large factor of safety (~1/2) because we can only count the tokens on the input side and assume the output is about the same count.

Returns:

clean_pages (list) – List of clean text strings where each list entry is a page from the PDF

is_double_col(separator='    ')[source]

Does the text look like it has multiple vertical text columns?

Parameters:

separator (str) – Heuristic split string to look for spaces between columns

Returns:

out (bool) – True if more than one vertical text column

clean_poppler(layout=True)[source]

Clean the pdf using the poppler pdftotxt utility

Requires the pdftotext command line utility from this software:

https://poppler.freedesktop.org/

Parameters:

layout (bool) – Layout flag for poppler pdftotxt utility: “maintain original physical layout”. Layout=True works well for single column text, layout=False collapses the double columns into single columns which works better for downstream chunking and LLM work.

Returns:

out (str) – Joined cleaned pages

DEFAULT_MODEL = 'gpt-3.5-turbo'

Default model to do pdf text cleaning.

EMBEDDING_MODEL = 'text-embedding-ada-002'

Default model to do text embeddings.

EMBEDDING_URL = 'https://api.openai.com/v1/embeddings'

OpenAI embedding API URL

HEADERS = {'Authorization': 'Bearer None', 'Content-Type': 'application/json', 'api-key': 'None'}

OpenAI API Headers

URL = 'https://api.openai.com/v1/chat/completions'

OpenAI API URL to be used with environment variable OPENAI_API_KEY. Use an Azure API endpoint to trigger Azure usage along with environment variables AZURE_OPENAI_KEY, AZURE_OPENAI_VERSION, and AZURE_OPENAI_ENDPOINT

property all_messages_txt

Get a string printout of the full conversation with the LLM

Returns:

str

async static call_api(url, headers, request_json)

Make an asyncronous OpenAI API call.

Parameters:
  • url (str) –

    OpenAI API url, typically either:

    https://api.openai.com/v1/embeddings https://api.openai.com/v1/chat/completions

  • headers (dict) –

    OpenAI API headers, typically:
    {“Content-Type”: “application/json”,

    “Authorization”: f”Bearer {openai.api_key}”}

  • request_json (dict) –

    API data input, typically looks like this for chat completion:
    {“model”: “gpt-3.5-turbo”,
    “messages”: [{“role”: “system”, “content”: “You do this…”},

    {“role”: “user”, “content”: “Do this: {}”}],

    “temperature”: 0.0}

Returns:

out (dict) – API response in json format

async call_api_async(url, headers, all_request_jsons, ignore_error=None, rate_limit=40000.0)

Use GPT to clean raw pdf text in parallel calls to the OpenAI API.

NOTE: you need to call this using the await command in ipython or jupyter, e.g.: out = await PDFtoTXT.clean_txt_async()

Parameters:
  • url (str) –

    OpenAI API url, typically either:

    https://api.openai.com/v1/embeddings https://api.openai.com/v1/chat/completions

  • headers (dict) –

    OpenAI API headers, typically:
    {“Content-Type”: “application/json”,

    “Authorization”: f”Bearer {openai.api_key}”}

  • all_request_jsons (list) – List of API data input, one entry typically looks like this for chat completion:

    {“model”: “gpt-3.5-turbo”,
    “messages”: [{“role”: “system”, “content”: “You do this…”},

    {“role”: “user”, “content”: “Do this: {}”}],

    “temperature”: 0.0}

  • ignore_error (None | callable) – Optional callable to parse API error string. If the callable returns True, the error will be ignored, the API call will not be tried again, and the output will be an empty string.

  • rate_limit (float) – OpenAI API rate limit (tokens / minute). Note that the gpt-3.5-turbo limit is 90k as of 4/2023, but we’re using a large factor of safety (~1/2) because we can only count the tokens on the input side and assume the output is about the same count.

Returns:

out (list) – List of API outputs where each list entry is a GPT answer from the corresponding message in the all_request_jsons input.

chat(query, temperature=0)

Have a continuous chat with the LLM including context from previous chat() calls stored as attributes in this class.

Parameters:
  • query (str) – Question to ask ChatGPT

  • temperature (float) – GPT model temperature, a measure of response entropy from 0 to 1. 0 is more reliable and nearly deterministic; 1 will give the model more creative freedom and may not return as factual of results.

Returns:

response (str) – Model response

clear()

Clear chat history and reduce messages to just the initial model role message.

static count_tokens(text, model)

Return the number of tokens in a string.

Parameters:
  • text (str) – Text string to get number of tokens for

  • model (str) – specification of OpenAI model to use (e.g., “gpt-3.5-turbo”)

Returns:

n (int) – Number of tokens in text

async generic_async_query(queries, model_role=None, temperature=0, ignore_error=None, rate_limit=40000.0)

Run a number of generic single queries asynchronously (not conversational)

NOTE: you need to call this using the await command in ipython or jupyter, e.g.: out = await Summary.run_async()

Parameters:
  • query (list) – Questions to ask ChatGPT (list of strings)

  • model_role (str | None) – Role for the model to take, e.g.: “You are a research assistant”. This defaults to self.MODEL_ROLE

  • temperature (float) – GPT model temperature, a measure of response entropy from 0 to 1. 0 is more reliable and nearly deterministic; 1 will give the model more creative freedom and may not return as factual of results.

  • ignore_error (None | callable) – Optional callable to parse API error string. If the callable returns True, the error will be ignored, the API call will not be tried again, and the output will be an empty string.

  • rate_limit (float) – OpenAI API rate limit (tokens / minute). Note that the gpt-3.5-turbo limit is 90k as of 4/2023, but we’re using a large factor of safety (~1/2) because we can only count the tokens on the input side and assume the output is about the same count.

Returns:

response (list) – Model responses with same length as query input.

generic_query(query, model_role=None, temperature=0)

Ask a generic single query without conversation

Parameters:
  • query (str) – Question to ask ChatGPT

  • model_role (str | None) – Role for the model to take, e.g.: “You are a research assistant”. This defaults to self.MODEL_ROLE

  • temperature (float) – GPT model temperature, a measure of response entropy from 0 to 1. 0 is more reliable and nearly deterministic; 1 will give the model more creative freedom and may not return as factual of results.

Returns:

response (str) – Model response

classmethod get_embedding(text)

Get the 1D array (list) embedding of a text string.

Parameters:

text (str) – Text to embed

Returns:

embedding (list) – List of float that represents the numerical embedding of the text

validate_clean()[source]

Run some basic checks on the GPT cleaned text vs. the raw text

clean_headers(char_thresh=0.6, page_thresh=0.8, split_on='\n', iheaders=(0, 1, -2, -1))[source]

Clean headers/footers that are duplicated across pages

Parameters:
  • char_thresh (float) – Fraction of characters in a given header that are similar between pages to be considered for removal

  • page_thresh (float) – Fraction of pages that share the header to be considered for removal

  • split_on (str) – Chars to split lines of a page on

  • iheaders (list | tuple) – Integer indices to look for headers after splitting a page into lines based on split_on. This needs to go from the start of the page to the end.

Returns:

out (str) – Clean text with all pages joined