elm.web.search.run.web_search_links_as_docs
- async web_search_links_as_docs(queries, search_engines=('PlaywrightGoogleLinkSearch', 'PlaywrightDuckDuckGoLinkSearch', 'DuxDistributedGlobalSearch'), num_urls=None, ignore_url_parts=None, search_semaphore=None, browser_semaphore=None, task_name=None, use_fallback_per_query=True, on_search_complete_hook=None, **kwargs)[source]
Retrieve top
N
search results as document instances- Parameters:
queries (collection of str) – Collection of strings representing google queries. Documents for the top num_urls google search results (from all of these queries _combined_ will be returned from this function.
search_engines (iterable of str) – Ordered collection of search engine names to attempt for web search. If the first search engine in the list returns a set of URLs, then iteration will end and documents for each URL will be returned. Otherwise, the next engine in this list will be used to run the web search. If this also fails, the next engine is used and so on. If all web searches fail, an empty list is returned. See
SEARCH_ENGINE_OPTIONS
for supported search engine options. By default,("PlaywrightGoogleLinkSearch", "PlaywrightDuckDuckGoLinkSearch", "DuxDistributedGlobalSearch")
.num_urls (int, optional) – Number of unique top Google search result to return as docs. The google search results from all queries are interleaved and the top num_urls unique URL’s are downloaded as docs. If this number is less than
len(queries)
, some of your queries may not contribute to the final output. By default,None
, which setsnum_urls = 3 * len(queries)
.ignore_url_parts (iterable of str, optional) – Optional URL components to blacklist. For example, supplying ignore_url_parts={“wikipedia.org”} will ignore all URLs that contain “wikipedia.org”. By default,
None
.search_semaphore (
asyncio.Semaphore
, optional) – Semaphore instance that can be used to limit the number of playwright browsers used to submit search engine queries open concurrently. For backwards-compatibility, if this input isNone
, the input from browser_semaphore will be used in its place (i.e. the searches and file downloads will be limited using the same semaphore). By default,None
.browser_semaphore (
asyncio.Semaphore
, optional) – Semaphore instance that can be used to limit the number of playwright browsers used to download files open concurrently. IfNone
, no limits are applied. By default,None
.task_name (str, optional) – Optional task name to use in
asyncio.create_task()
. By default,None
.use_fallback_per_query (bool, default=True) – Option to use the fallback list of search engines on a per-query basis. This means if a single query fails with one search engine, the fallback search engines will be attempted for that query. If this input is
False
, the fallback search engines are only used if all search queries fail for a single search engine. By default,True
.on_search_complete_hook (callable, optional) – If provided, this async callable will be called after the search engine links have been retrieved. A single argument will be passed to this function containing a list of URL’s that were the result of the search queries (this list cna be empty if the search failed). By default,
None
.**kwargs – Keyword-argument pairs to initialize
elm.web.file_loader.AsyncFileLoader
. This input can also include and any/all of the following keywords:ddg_api_kwargs
google_cse_api_kwargs
google_serper_api_kwargs
tavily_api_kwargs
ddgs_kwargs
cf_google_se_kwargs
pw_bing_se_kwargs
pw_ddg_se_kwargs
pw_google_cse_kwargs
pw_google_se_kwargs
pw_yahoo_se_kwargs
pw_launch_kwargs
Each of these inputs should be a dictionary with keyword-argument pairs that you can use to initialize the search engines in the search_engines input. If
pw_launch_kwargs
is detected, it will be added to the kwargs for all of the PLaywright-based search engines so that you do not have to repeatedly specify the launch parameters. For example, you may specifypw_launch_kwargs={"headless": False}
to have all Playwright-based searches show the browser and _also_ specifygoogle_serper_api_kwargs={"api_key": "..."}
to specify the API key for the Google Serper search.
- Returns:
list of
elm.web.document.BaseDocument
– List of documents representing the top num_urls results from the google searches across all queries.