elm
  • Home page
  • Installation
  • Examples
    • The Energy Wizard
      • Downloading and Embedding PDFs
      • Running the Streamlit App
    • Ordinance GPT
      • Prerequisites
      • Running from Python
      • Running from the Command Line Utility
        • Execution
        • Debugging
      • Source Ordinance Documents
      • Extension to Other Technologies
  • Development
    • OrdinanceGPT: Architectural Design Document
      • 1. Introduction
        • 1.1 Purpose
        • 1.2 Audience
        • 1.3 Scope
      • 2. High-Level Architecture
        • 2.1 System Context
      • 3. Detailed Design
        • 3.1 Web Scraper
        • 3.2 Document Parser
      • 4 Key Concepts and Classes
        • 4.1 Key Concept: Services
        • 4.2 Key Classes
      • 5. Workflows
        • 5.1 Downloading documents from Google
        • 5.2 Querying OpenAI
      • 6. Appendix
        • 6.1 Tools and Libraries
      • 7. Deliverables
  • API reference
    • elm.base
      • elm.base.ApiBase
        • ApiBase
      • elm.base.ApiQueue
        • ApiQueue
    • elm.chunk
      • elm.chunk.Chunker
        • Chunker
    • elm.cli
    • elm.embed
      • elm.embed.ChunkAndEmbed
        • ChunkAndEmbed
    • elm.exceptions
      • elm.exceptions.ELMError
        • ELMError
      • elm.exceptions.ELMInputError
        • ELMInputError
      • elm.exceptions.ELMKeyError
        • ELMKeyError
      • elm.exceptions.ELMRuntimeError
        • ELMRuntimeError
    • elm.ords
      • elm.ords.download
        • elm.ords.download.download_county_ordinance
      • elm.ords.extraction
        • elm.ords.extraction.apply
        • elm.ords.extraction.date
        • elm.ords.extraction.features
        • elm.ords.extraction.graphs
        • elm.ords.extraction.ngrams
        • elm.ords.extraction.ordinance
        • elm.ords.extraction.parse
        • elm.ords.extraction.tree
      • elm.ords.llm
        • elm.ords.llm.calling
      • elm.ords.process
        • elm.ords.process.process_counties_with_openai
        • elm.ords.process.process_county
        • elm.ords.process.process_county_with_logging
      • elm.ords.services
        • elm.ords.services.base
        • elm.ords.services.cpu
        • elm.ords.services.openai
        • elm.ords.services.provider
        • elm.ords.services.queues
        • elm.ords.services.threaded
        • elm.ords.services.usage
      • elm.ords.utilities
        • elm.ords.utilities.counties
        • elm.ords.utilities.exceptions
        • elm.ords.utilities.location
        • elm.ords.utilities.parsing
        • elm.ords.utilities.queued_logging
      • elm.ords.validation
        • elm.ords.validation.content
        • elm.ords.validation.location
    • elm.pdf
      • elm.pdf.PDFtoTXT
        • PDFtoTXT
    • elm.summary
      • elm.summary.Summary
        • Summary
    • elm.tree
      • elm.tree.DecisionTree
        • DecisionTree
    • elm.utilities
      • elm.utilities.parse
        • elm.utilities.parse.clean_headers
        • elm.utilities.parse.combine_pages
        • elm.utilities.parse.format_html_tables
        • elm.utilities.parse.html_to_text
        • elm.utilities.parse.is_multi_col
        • elm.utilities.parse.read_pdf
        • elm.utilities.parse.read_pdf_ocr
        • elm.utilities.parse.remove_blank_pages
        • elm.utilities.parse.remove_empty_lines_or_page_footers
        • elm.utilities.parse.replace_common_pdf_conversion_chars
        • elm.utilities.parse.replace_excessive_newlines
        • elm.utilities.parse.replace_multi_dot_lines
      • elm.utilities.retry
        • elm.utilities.retry.async_retry_with_exponential_backoff
        • elm.utilities.retry.retry_with_exponential_backoff
      • elm.utilities.try_import
        • elm.utilities.try_import.try_import
      • elm.utilities.validation
        • elm.utilities.validation.validate_azure_api_params
    • elm.version
    • elm.web
      • elm.web.document
        • elm.web.document.BaseDocument
        • elm.web.document.HTMLDocument
        • elm.web.document.PDFDocument
      • elm.web.file_loader
        • elm.web.file_loader.AsyncFileLoader
      • elm.web.html_pw
        • elm.web.html_pw.load_html_with_pw
      • elm.web.osti
        • elm.web.osti.OstiList
        • elm.web.osti.OstiRecord
      • elm.web.rhub
        • elm.web.rhub.ProfilesList
        • elm.web.rhub.ProfilesRecord
        • elm.web.rhub.PublicationsList
        • elm.web.rhub.PublicationsRecord
      • elm.web.search
        • elm.web.search.base
        • elm.web.search.bing
        • elm.web.search.duckduckgo
        • elm.web.search.dux
        • elm.web.search.google
        • elm.web.search.run
        • elm.web.search.tavily
        • elm.web.search.yahoo
      • elm.web.utilities
        • elm.web.utilities.DEFAULT_HEADERS
        • elm.web.utilities.clean_search_query
        • elm.web.utilities.compute_fn_from_url
        • elm.web.utilities.filter_documents
        • elm.web.utilities.get_redirected_url
        • elm.web.utilities.pw_page
        • elm.web.utilities.write_url_doc_to_file
        • elm.web.utilities.PWKwargs
    • elm.wizard
      • elm.wizard.EnergyWizard
        • EnergyWizard
      • elm.wizard.EnergyWizardBase
        • EnergyWizardBase
      • elm.wizard.EnergyWizardPostgres
        • EnergyWizardPostgres
  • CLI reference
    • elm
      • ords
elm
  • elm
  • elm.web
  • elm.web.search
  • elm.web.search.run
  • elm.web.search.run.web_search_links_as_docs
  • Edit on GitHub

elm.web.search.run.web_search_links_as_docs

async web_search_links_as_docs(queries, search_engines=('PlaywrightGoogleLinkSearch', 'PlaywrightDuckDuckGoLinkSearch', 'DuxDistributedGlobalSearch'), num_urls=None, ignore_url_parts=None, search_semaphore=None, browser_semaphore=None, task_name=None, use_fallback_per_query=True, on_search_complete_hook=None, **kwargs)[source]

Retrieve top N search results as document instances

Parameters:
  • queries (collection of str) – Collection of strings representing google queries. Documents for the top num_urls google search results (from all of these queries _combined_ will be returned from this function.

  • search_engines (iterable of str) – Ordered collection of search engine names to attempt for web search. If the first search engine in the list returns a set of URLs, then iteration will end and documents for each URL will be returned. Otherwise, the next engine in this list will be used to run the web search. If this also fails, the next engine is used and so on. If all web searches fail, an empty list is returned. See SEARCH_ENGINE_OPTIONS for supported search engine options. By default, ("PlaywrightGoogleLinkSearch", "PlaywrightDuckDuckGoLinkSearch", "DuxDistributedGlobalSearch").

  • num_urls (int, optional) – Number of unique top Google search result to return as docs. The google search results from all queries are interleaved and the top num_urls unique URL’s are downloaded as docs. If this number is less than len(queries), some of your queries may not contribute to the final output. By default, None, which sets num_urls = 3 * len(queries).

  • ignore_url_parts (iterable of str, optional) – Optional URL components to blacklist. For example, supplying ignore_url_parts={“wikipedia.org”} will ignore all URLs that contain “wikipedia.org”. By default, None.

  • search_semaphore (asyncio.Semaphore, optional) – Semaphore instance that can be used to limit the number of playwright browsers used to submit search engine queries open concurrently. For backwards-compatibility, if this input is None, the input from browser_semaphore will be used in its place (i.e. the searches and file downloads will be limited using the same semaphore). By default, None.

  • browser_semaphore (asyncio.Semaphore, optional) – Semaphore instance that can be used to limit the number of playwright browsers used to download files open concurrently. If None, no limits are applied. By default, None.

  • task_name (str, optional) – Optional task name to use in asyncio.create_task(). By default, None.

  • use_fallback_per_query (bool, default=True) – Option to use the fallback list of search engines on a per-query basis. This means if a single query fails with one search engine, the fallback search engines will be attempted for that query. If this input is False, the fallback search engines are only used if all search queries fail for a single search engine. By default, True.

  • on_search_complete_hook (callable, optional) – If provided, this async callable will be called after the search engine links have been retrieved. A single argument will be passed to this function containing a list of URL’s that were the result of the search queries (this list cna be empty if the search failed). By default, None.

  • **kwargs – Keyword-argument pairs to initialize elm.web.file_loader.AsyncFileLoader. This input can also include and any/all of the following keywords:

    • ddg_api_kwargs

    • google_cse_api_kwargs

    • google_serper_api_kwargs

    • tavily_api_kwargs

    • ddgs_kwargs

    • cf_google_se_kwargs

    • pw_bing_se_kwargs

    • pw_ddg_se_kwargs

    • pw_google_cse_kwargs

    • pw_google_se_kwargs

    • pw_yahoo_se_kwargs

    • pw_launch_kwargs

    Each of these inputs should be a dictionary with keyword-argument pairs that you can use to initialize the search engines in the search_engines input. If pw_launch_kwargs is detected, it will be added to the kwargs for all of the PLaywright-based search engines so that you do not have to repeatedly specify the launch parameters. For example, you may specify pw_launch_kwargs={"headless": False} to have all Playwright-based searches show the browser and _also_ specify google_serper_api_kwargs={"api_key": "..."} to specify the API key for the Google Serper search.

Returns:

list of elm.web.document.BaseDocument – List of documents representing the top num_urls results from the google searches across all queries.

Previous Next

© Copyright 2023, Alliance for Sustainable Energy, LLC.

Built with Sphinx using a theme provided by Read the Docs.