elm.web.search.run.web_search_links_as_docs
- async web_search_links_as_docs(queries, search_engines=('PlaywrightGoogleLinkSearch', 'PlaywrightDuckDuckGoLinkSearch', 'APIDuckDuckGoSearch'), num_urls=None, ignore_url_parts=None, browser_semaphore=None, task_name=None, **kwargs)[source]
Retrieve top
N
search results as document instances- Parameters:
queries (collection of str) – Collection of strings representing google queries. Documents for the top num_urls google search results (from all of these queries _combined_ will be returned from this function.
search_engines (iterable of str) – Ordered collection of search engine names to attempt for web search. If the first search engine in the list returns a set of URLs, then iteration will end and documents for each URL will be returned. Otherwise, the next engine in this list will be used to run the web search. If this also fails, the next engine is used and so on. If all web searches fail, an empty list is returned. See
SEARCH_ENGINE_OPTIONS
for supported search engine options. By default,("PlaywrightGoogleLinkSearch", )
.num_urls (int, optional) – Number of unique top Google search result to return as docs. The google search results from all queries are interleaved and the top num_urls unique URL’s are downloaded as docs. If this number is less than
len(queries)
, some of your queries may not contribute to the final output. By default,None
, which setsnum_urls = 3 * len(queries)
.ignore_url_parts (iterable of str, optional) – Optional URL components to blacklist. For example, supplying ignore_url_parts={“wikipedia.org”} will ignore all URLs that contain “wikipedia.org”. By default,
None
.browser_semaphore (
asyncio.Semaphore
, optional) – Semaphore instance that can be used to limit the number of playwright browsers open concurrently. IfNone
, no limits are applied. By default,None
.task_name (str, optional) – Optional task name to use in
asyncio.create_task()
. By default,None
.**kwargs – Keyword-argument pairs to initialize
elm.web.file_loader.AsyncFileLoader
and any of the search engines in the search_engines input with. For example, you may specifypw_launch_kwargs={"headless": False}
to have all Playwright-based searches show the browser and _also_ specifygoogle_serper_api_kwargs={"api_key": "..."}
to specify the API key for the Google Serper search.
- Returns:
list of
elm.web.document.BaseDocument
– List of documents representing the top num_urls results from the google searches across all queries.