elm.chunk.Chunker
- class Chunker(text, tag=None, tokens_per_chunk=500, overlap=1, split_on='\n\n')[source]
Bases:
ApiBase
Class to break text up into overlapping chunks
NOTE: very large paragraphs that exceed the tokens per chunk will not be split up and will still be padded with overlap.
- Parameters:
text (str) – Single body of text to break up. Works well if this is a single document with empty lines between paragraphs.
tag (None | str) – Optional reference tag to include at the beginning of each text chunk
tokens_per_chunk (float) – Nominal token count per text chunk. Overlap paragraphs will exceed this.
overlap (int) – Number of paragraphs to overlap between chunks
split_on (str) – Sub string to split text into paragraphs.
Methods
add_overlap
(chunks_input)Add overlap on either side of a text chunk.
call_api
(url, headers, request_json)Make an asyncronous OpenAI API call.
call_api_async
(url, headers, all_request_jsons)Use GPT to clean raw pdf text in parallel calls to the OpenAI API.
chat
(query[, temperature])Have a continuous chat with the LLM including context from previous chat() calls stored as attributes in this class.
Perform the text chunking operation
clean_paragraphs
(text)Clean up double line breaks to make sure paragraphs can be detected in the text.
clear
()Clear chat history and reduce messages to just the initial model role message.
count_tokens
(text, model)Return the number of tokens in a string.
generic_async_query
(queries[, model_role, ...])Run a number of generic single queries asynchronously (not conversational)
generic_query
(query[, model_role, temperature])Ask a generic single query without conversation
get_embedding
(text)Get the 1D array (list) embedding of a text string.
is_good_paragraph
(paragraph)Basic tests to make sure the paragraph is useful text.
merge_chunks
(chunks_input)Merge chunks until they reach the token limit per chunk.
Attributes
Default model to do pdf text cleaning.
Default model to do text embeddings.
OpenAI embedding API URL
OpenAI API Headers
High level model role
Optional mappings for unusual Azure names to tiktoken/openai names.
OpenAI API URL to be used with environment variable OPENAI_API_KEY.
Get a string printout of the full conversation with the LLM
Number of tokens per chunk.
List of overlapping text chunks (strings).
Number of tokens per paragraph.
Get a list of paragraphs in the text demarkated by an empty line.
- property chunks
List of overlapping text chunks (strings).
- Returns:
list
- property paragraphs
Get a list of paragraphs in the text demarkated by an empty line.
- Returns:
list
- static clean_paragraphs(text)[source]
Clean up double line breaks to make sure paragraphs can be detected in the text.
- property paragraph_tokens
Number of tokens per paragraph.
- Returns:
list
- property chunk_tokens
Number of tokens per chunk.
- Returns:
list
- merge_chunks(chunks_input)[source]
Merge chunks until they reach the token limit per chunk.
- Parameters:
chunks_input (list) – List of list of integers: [[0, 1], [2], [3, 4]] where nested lists are chunks and the integers are paragraph indices
- Returns:
chunks (list) – List of list of integers: [[0, 1], [2], [3, 4]] where nested lists are chunks and the integers are paragraph indices
- add_overlap(chunks_input)[source]
Add overlap on either side of a text chunk. This ignores token limit.
- Parameters:
chunks_input (list) – List of list of integers: [[0, 1], [2], [3, 4]] where nested lists are chunks and the integers are paragraph indices
- Returns:
chunks (list) – List of list of integers: [[0, 1], [2], [3, 4]] where nested lists are chunks and the integers are paragraph indices
- chunk_text()[source]
Perform the text chunking operation
- Returns:
chunks (list) – List of strings where each string is an overlapping chunk of text
- DEFAULT_MODEL = 'gpt-3.5-turbo'
Default model to do pdf text cleaning.
- EMBEDDING_MODEL = 'text-embedding-ada-002'
Default model to do text embeddings.
- EMBEDDING_URL = 'https://api.openai.com/v1/embeddings'
OpenAI embedding API URL
- HEADERS = {'Authorization': 'Bearer None', 'Content-Type': 'application/json', 'api-key': 'None'}
OpenAI API Headers
- MODEL_ROLE = 'You are a research assistant that answers questions.'
High level model role
- TOKENIZER_ALIASES = {'gpt-35-turbo': 'gpt-3.5-turbo', 'gpt-4-32k': 'gpt-4-32k-0314', 'llmev-gpt-4-32k': 'gpt-4-32k-0314'}
Optional mappings for unusual Azure names to tiktoken/openai names.
- URL = 'https://api.openai.com/v1/chat/completions'
OpenAI API URL to be used with environment variable OPENAI_API_KEY. Use an Azure API endpoint to trigger Azure usage along with environment variables AZURE_OPENAI_KEY, AZURE_OPENAI_VERSION, and AZURE_OPENAI_ENDPOINT
- property all_messages_txt
Get a string printout of the full conversation with the LLM
- Returns:
str
- async static call_api(url, headers, request_json)
Make an asyncronous OpenAI API call.
- Parameters:
url (str) –
- OpenAI API url, typically either:
https://api.openai.com/v1/embeddings https://api.openai.com/v1/chat/completions
headers (dict) –
- OpenAI API headers, typically:
- {“Content-Type”: “application/json”,
“Authorization”: f”Bearer {openai.api_key}”}
request_json (dict) –
- API data input, typically looks like this for chat completion:
- {“model”: “gpt-3.5-turbo”,
- “messages”: [{“role”: “system”, “content”: “You do this…”},
{“role”: “user”, “content”: “Do this: {}”}],
“temperature”: 0.0}
- Returns:
out (dict) – API response in json format
- async call_api_async(url, headers, all_request_jsons, ignore_error=None, rate_limit=40000.0)
Use GPT to clean raw pdf text in parallel calls to the OpenAI API.
NOTE: you need to call this using the await command in ipython or jupyter, e.g.: out = await PDFtoTXT.clean_txt_async()
- Parameters:
url (str) –
- OpenAI API url, typically either:
https://api.openai.com/v1/embeddings https://api.openai.com/v1/chat/completions
headers (dict) –
- OpenAI API headers, typically:
- {“Content-Type”: “application/json”,
“Authorization”: f”Bearer {openai.api_key}”}
all_request_jsons (list) – List of API data input, one entry typically looks like this for chat completion:
- {“model”: “gpt-3.5-turbo”,
- “messages”: [{“role”: “system”, “content”: “You do this…”},
{“role”: “user”, “content”: “Do this: {}”}],
“temperature”: 0.0}
ignore_error (None | callable) – Optional callable to parse API error string. If the callable returns True, the error will be ignored, the API call will not be tried again, and the output will be an empty string.
rate_limit (float) – OpenAI API rate limit (tokens / minute). Note that the gpt-3.5-turbo limit is 90k as of 4/2023, but we’re using a large factor of safety (~1/2) because we can only count the tokens on the input side and assume the output is about the same count.
- Returns:
out (list) – List of API outputs where each list entry is a GPT answer from the corresponding message in the all_request_jsons input.
- chat(query, temperature=0)
Have a continuous chat with the LLM including context from previous chat() calls stored as attributes in this class.
- Parameters:
query (str) – Question to ask ChatGPT
temperature (float) – GPT model temperature, a measure of response entropy from 0 to 1. 0 is more reliable and nearly deterministic; 1 will give the model more creative freedom and may not return as factual of results.
- Returns:
response (str) – Model response
- clear()
Clear chat history and reduce messages to just the initial model role message.
- classmethod count_tokens(text, model)
Return the number of tokens in a string.
- Parameters:
text (str) – Text string to get number of tokens for
model (str) – specification of OpenAI model to use (e.g., “gpt-3.5-turbo”)
- Returns:
n (int) – Number of tokens in text
- async generic_async_query(queries, model_role=None, temperature=0, ignore_error=None, rate_limit=40000.0)
Run a number of generic single queries asynchronously (not conversational)
NOTE: you need to call this using the await command in ipython or jupyter, e.g.: out = await Summary.run_async()
- Parameters:
query (list) – Questions to ask ChatGPT (list of strings)
model_role (str | None) – Role for the model to take, e.g.: “You are a research assistant”. This defaults to self.MODEL_ROLE
temperature (float) – GPT model temperature, a measure of response entropy from 0 to 1. 0 is more reliable and nearly deterministic; 1 will give the model more creative freedom and may not return as factual of results.
ignore_error (None | callable) – Optional callable to parse API error string. If the callable returns True, the error will be ignored, the API call will not be tried again, and the output will be an empty string.
rate_limit (float) – OpenAI API rate limit (tokens / minute). Note that the gpt-3.5-turbo limit is 90k as of 4/2023, but we’re using a large factor of safety (~1/2) because we can only count the tokens on the input side and assume the output is about the same count.
- Returns:
response (list) – Model responses with same length as query input.
- generic_query(query, model_role=None, temperature=0)
Ask a generic single query without conversation
- Parameters:
query (str) – Question to ask ChatGPT
model_role (str | None) – Role for the model to take, e.g.: “You are a research assistant”. This defaults to self.MODEL_ROLE
temperature (float) – GPT model temperature, a measure of response entropy from 0 to 1. 0 is more reliable and nearly deterministic; 1 will give the model more creative freedom and may not return as factual of results.
- Returns:
response (str) – Model response
- classmethod get_embedding(text)
Get the 1D array (list) embedding of a text string.
- Parameters:
text (str) – Text to embed
- Returns:
embedding (list) – List of float that represents the numerical embedding of the text