elm.web.search.run.search_with_fallback

async search_with_fallback(queries, search_engines=('PlaywrightGoogleLinkSearch', 'PlaywrightDuckDuckGoLinkSearch', 'DuxDistributedGlobalSearch'), num_urls=None, ignore_url_parts=None, browser_semaphore=None, task_name=None, use_fallback_per_query=True, **kwargs)[source]

Retrieve search query URLs using multiple search engines if needed

Parameters:
  • queries (collection of str) – Collection of strings representing google queries. Documents for the top num_urls google search results (from all of these queries _combined_ will be returned from this function.

  • search_engines (iterable of str) – Ordered collection of search engine names to attempt for web search. If the first search engine in the list returns a set of URLs, then iteration will end and documents for each URL will be returned. Otherwise, the next engine in this list will be used to run the web search. If this also fails, the next engine is used and so on. If all web searches fail, an empty list is returned. See SEARCH_ENGINE_OPTIONS for supported search engine options. By default, ("PlaywrightGoogleLinkSearch", ).

  • num_urls (int, optional) – Number of unique top Google search result to return as docs. The google search results from all queries are interleaved and the top num_urls unique URL’s are downloaded as docs. If this number is less than len(queries), some of your queries may not contribute to the final output. By default, None, which sets num_urls = 3 * len(queries).

  • ignore_url_parts (iterable of str, optional) – Optional URL components to blacklist. For example, supplying ignore_url_parts={“wikipedia.org”} will ignore all URLs that contain “wikipedia.org”. By default, None.

  • browser_semaphore (asyncio.Semaphore, optional) – Semaphore instance that can be used to limit the number of playwright browsers open concurrently. If None, no limits are applied. By default, None.

  • task_name (str, optional) – Optional task name to use in asyncio.create_task(). By default, None.

  • use_fallback_per_query (bool, default=True) – Option to use the fallback list of search engines on a per-query basis. This means if a single query fails with one search engine, the fallback search engines will be attempted for that query. If this input is False, the fallback search engines are only used if all search queries fail for a single search engine. By default, True.

  • **kwargs – Keyword-argument pairs to initialize search engines. This input can include and any/all of the following keywords:

    • ddg_api_kwargs

    • google_cse_api_kwargs

    • google_serper_api_kwargs

    • tavily_api_kwargs

    • ddgs_kwargs

    • cf_google_se_kwargs

    • pw_bing_se_kwargs

    • pw_ddg_se_kwargs

    • pw_google_cse_kwargs

    • pw_google_se_kwargs

    • pw_yahoo_se_kwargs

    • pw_launch_kwargs

    Each of these inputs should be a dictionary with keyword-argument pairs that you can use to initialize the search engines in the search_engines input. If pw_launch_kwargs is detected, it will be added to the kwargs for all of the PLaywright-based search engines so that you do not have to repeatedly specify the launch parameters. For example, you may specify pw_launch_kwargs={"headless": False} to have all Playwright-based searches show the browser and _also_ specify google_serper_api_kwargs={"api_key": "..."} to specify the API key for the Google Serper search.

Returns:

set – Set of URL’s for the queries (empty if search failed).

Raises:

ELMInputError – If search_engines input is empty.