├── LICENSE
├── README.md
├── open_deep_researcher_gradio.ipynb
└── open_deep_researcher.ipynb
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2025 mshumer
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # OpenDeepResearcher
2 |
3 | This notebook implements an **AI researcher** that continuously searches for information based on a user query until the system is confident that it has gathered all the necessary details. It makes use of several services to do so:
4 |
5 | - **SERPAPI**: To perform Google searches.
6 | - **Jina**: To fetch and extract webpage content.
7 | - **OpenRouter** (default model: `anthropic/claude-3.5-haiku`): To interact with a LLM for generating search queries, evaluating page relevance, and extracting context.
8 |
9 | ## Features
10 |
11 | - **Iterative Research Loop:** The system refines its search queries iteratively until no further queries are required.
12 | - **Asynchronous Processing:** Searches, webpage fetching, evaluation, and context extraction are performed concurrently to improve speed.
13 | - **Duplicate Filtering:** Aggregates and deduplicates links within each round, ensuring that the same link isn’t processed twice.
14 | - **LLM-Powered Decision Making:** Uses the LLM to generate new search queries, decide on page usefulness, extract relevant context, and produce a final comprehensive report.
15 | - **Gradio Interface:** Use the `open-deep-researcher - gradio` notebook if you want to use this in a functional UI
16 |
17 | ## Requirements
18 |
19 | - API access and keys for:
20 | - **OpenRouter API**
21 | - **SERPAPI API**
22 | - **Jina API**
23 |
24 | ## Setup
25 |
26 | 1. **Clone or Open the Notebook:**
27 | - Download the notebook file or open it directly in [Google Colab](https://colab.research.google.com/github/mshumer/OpenDeepResearcher/blob/main/open_deep_researcher.ipynb).
28 |
29 | 2. **Install `nest_asyncio`:**
30 |
31 | Run the first cell to set up `nest_asyncio`.
32 |
33 | 3. **Configure API Keys:**
34 | - Replace the placeholder values in the notebook for `OPENROUTER_API_KEY`, `SERPAPI_API_KEY`, and `JINA_API_KEY` with your actual API keys.
35 |
36 | ## Usage
37 |
38 | 1. **Run the Notebook Cells:**
39 | Execute all cells in order. The notebook will prompt you for:
40 | - A research query/topic.
41 | - An optional maximum number of iterations (default is 10).
42 |
43 | 2. **Follow the Research Process:**
44 | - **Initial Query & Search Generation:** The notebook uses the LLM to generate initial search queries.
45 | - **Asynchronous Searches & Extraction:** It performs SERPAPI searches for all queries concurrently, aggregates unique links, and processes each link in parallel to determine page usefulness and extract relevant context.
46 | - **Iterative Refinement:** After each round, the aggregated context is analyzed by the LLM to determine if further search queries are needed.
47 | - **Final Report:** Once the LLM indicates that no further research is needed (or the iteration limit is reached), a final report is generated based on all gathered context.
48 |
49 | 3. **View the Final Report:**
50 | The final comprehensive report will be printed in the output.
51 |
52 | ## How It Works
53 |
54 | 1. **Input & Query Generation:**
55 | The user enters a research topic, and the LLM generates up to four distinct search queries.
56 |
57 | 2. **Concurrent Search & Processing:**
58 | - **SERPAPI:** Each search query is sent to SERPAPI concurrently.
59 | - **Deduplication:** All retrieved links are aggregated and deduplicated within the current iteration.
60 | - **Jina & LLM:** Each unique link is processed concurrently to fetch webpage content via Jina, evaluate its usefulness with the LLM, and extract relevant information if the page is deemed useful.
61 |
62 | 3. **Iterative Refinement:**
63 | The system passes the aggregated context to the LLM to determine if further search queries are needed. New queries are generated if required; otherwise, the loop terminates.
64 |
65 | 4. **Final Report Generation:**
66 | All gathered context is compiled and sent to the LLM to produce a final, comprehensive report addressing the original query.
67 |
68 | ## Troubleshooting
69 |
70 | - **RuntimeError with asyncio:**
71 | If you encounter an error like:
72 | ```
73 | RuntimeError: asyncio.run() cannot be called from a running event loop
74 | ```
75 | Ensure you have applied `nest_asyncio` as shown in the setup section.
76 |
77 | - **API Issues:**
78 | Verify that your API keys are correct and that you are not exceeding any rate limits.
79 |
80 | ---
81 |
82 | Follow me on [X](https://x.com/mattshumer_) for updates on this and other AI things I'm working on.
83 |
84 | Head to [ShumerPrompt](https://shumerprompt.com), my "Github for Prompts"!
85 |
86 | OpenDeepResearcher is released under the MIT License. See the LICENSE file for more details.
87 |
--------------------------------------------------------------------------------
/open_deep_researcher_gradio.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "provenance": [],
7 | "authorship_tag": "ABX9TyM6XDBP8oqAaLL0GMT0mBj+",
8 | "include_colab_link": true
9 | },
10 | "kernelspec": {
11 | "name": "python3",
12 | "display_name": "Python 3"
13 | },
14 | "language_info": {
15 | "name": "python"
16 | }
17 | },
18 | "cells": [
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {
22 | "id": "view-in-github",
23 | "colab_type": "text"
24 | },
25 | "source": [
26 | "
"
27 | ]
28 | },
29 | {
30 | "cell_type": "code",
31 | "execution_count": null,
32 | "metadata": {
33 | "id": "bSpd75_8O8rr"
34 | },
35 | "outputs": [],
36 | "source": [
37 | "!pip install nest_asyncio gradio aiohttp\n",
38 | "\n",
39 | "import nest_asyncio\n",
40 | "nest_asyncio.apply()\n",
41 | "\n",
42 | "import asyncio\n",
43 | "import aiohttp\n",
44 | "import gradio as gr\n",
45 | "import json\n",
46 | "\n",
47 | "# ---------------------------\n",
48 | "# Configuration Constants\n",
49 | "# ---------------------------\n",
50 | "OPENROUTER_API_KEY = \"REDACTED\" # Replace with your OpenRouter API key\n",
51 | "SERPAPI_API_KEY = \"REDACTED\" # Replace with your SERPAPI API key\n",
52 | "JINA_API_KEY = \"REDACTED\" # Replace with your Jina API key\n",
53 | "\n",
54 | "OPENROUTER_URL = \"https://openrouter.ai/api/v1/chat/completions\"\n",
55 | "SERPAPI_URL = \"https://serpapi.com/search\"\n",
56 | "JINA_BASE_URL = \"https://r.jina.ai/\"\n",
57 | "\n",
58 | "DEFAULT_MODEL = \"anthropic/claude-3.5-haiku\"\n",
59 | "\n",
60 | "# -------------------------------\n",
61 | "# Asynchronous Helper Functions\n",
62 | "# -------------------------------\n",
63 | "\n",
64 | "async def call_openrouter_async(session, messages, model=DEFAULT_MODEL):\n",
65 | " headers = {\n",
66 | " \"Authorization\": f\"Bearer {OPENROUTER_API_KEY}\",\n",
67 | " \"X-Title\": \"OpenDeepResearcher, by Matt Shumer\",\n",
68 | " \"Content-Type\": \"application/json\"\n",
69 | " }\n",
70 | " payload = {\n",
71 | " \"model\": model,\n",
72 | " \"messages\": messages\n",
73 | " }\n",
74 | " try:\n",
75 | " async with session.post(OPENROUTER_URL, headers=headers, json=payload) as resp:\n",
76 | " if resp.status == 200:\n",
77 | " result = await resp.json()\n",
78 | " try:\n",
79 | " return result['choices'][0]['message']['content']\n",
80 | " except (KeyError, IndexError):\n",
81 | " print(\"Unexpected OpenRouter response structure:\", result)\n",
82 | " return None\n",
83 | " else:\n",
84 | " text = await resp.text()\n",
85 | " print(f\"OpenRouter API error: {resp.status} - {text}\")\n",
86 | " return None\n",
87 | " except Exception as e:\n",
88 | " print(\"Error calling OpenRouter:\", e)\n",
89 | " return None\n",
90 | "\n",
91 | "async def generate_search_queries_async(session, user_query):\n",
92 | " prompt = (\n",
93 | " \"You are an expert research assistant. Given the user's query, generate up to four distinct, \"\n",
94 | " \"precise search queries that would help gather complete information on the topic. \"\n",
95 | " \"Return only a Python list of strings, for example: ['query1', 'query2', 'query3'].\"\n",
96 | " )\n",
97 | " messages = [\n",
98 | " {\"role\": \"system\", \"content\": \"You are a helpful and precise research assistant.\"},\n",
99 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\n\\n{prompt}\"}\n",
100 | " ]\n",
101 | " response = await call_openrouter_async(session, messages)\n",
102 | " if response:\n",
103 | " try:\n",
104 | " search_queries = eval(response)\n",
105 | " if isinstance(search_queries, list):\n",
106 | " return search_queries\n",
107 | " else:\n",
108 | " print(\"LLM did not return a list. Response:\", response)\n",
109 | " return []\n",
110 | " except Exception as e:\n",
111 | " print(\"Error parsing search queries:\", e, \"\\nResponse:\", response)\n",
112 | " return []\n",
113 | " return []\n",
114 | "\n",
115 | "async def perform_search_async(session, query):\n",
116 | " params = {\n",
117 | " \"q\": query,\n",
118 | " \"api_key\": SERPAPI_API_KEY,\n",
119 | " \"engine\": \"google\"\n",
120 | " }\n",
121 | " try:\n",
122 | " async with session.get(SERPAPI_URL, params=params) as resp:\n",
123 | " if resp.status == 200:\n",
124 | " results = await resp.json()\n",
125 | " if \"organic_results\" in results:\n",
126 | " links = [item.get(\"link\") for item in results[\"organic_results\"] if \"link\" in item]\n",
127 | " return links\n",
128 | " else:\n",
129 | " print(\"No organic results in SERPAPI response.\")\n",
130 | " return []\n",
131 | " else:\n",
132 | " text = await resp.text()\n",
133 | " print(f\"SERPAPI error: {resp.status} - {text}\")\n",
134 | " return []\n",
135 | " except Exception as e:\n",
136 | " print(\"Error performing SERPAPI search:\", e)\n",
137 | " return []\n",
138 | "\n",
139 | "async def fetch_webpage_text_async(session, url):\n",
140 | " full_url = f\"{JINA_BASE_URL}{url}\"\n",
141 | " headers = {\n",
142 | " \"Authorization\": f\"Bearer {JINA_API_KEY}\"\n",
143 | " }\n",
144 | " try:\n",
145 | " async with session.get(full_url, headers=headers) as resp:\n",
146 | " if resp.status == 200:\n",
147 | " return await resp.text()\n",
148 | " else:\n",
149 | " text = await resp.text()\n",
150 | " print(f\"Jina fetch error for {url}: {resp.status} - {text}\")\n",
151 | " return \"\"\n",
152 | " except Exception as e:\n",
153 | " print(\"Error fetching webpage text with Jina:\", e)\n",
154 | " return \"\"\n",
155 | "\n",
156 | "async def is_page_useful_async(session, user_query, page_text):\n",
157 | " prompt = (\n",
158 | " \"You are a critical research evaluator. Given the user's query and the content of a webpage, \"\n",
159 | " \"determine if the webpage contains information that is useful for addressing the query. \"\n",
160 | " \"Respond with exactly one word: 'Yes' if the page is useful, or 'No' if it is not. Do not include any extra text.\"\n",
161 | " )\n",
162 | " messages = [\n",
163 | " {\"role\": \"system\", \"content\": \"You are a strict and concise evaluator of research relevance.\"},\n",
164 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\n\\nWebpage Content (first 20000 characters):\\n{page_text[:20000]}\\n\\n{prompt}\"}\n",
165 | " ]\n",
166 | " response = await call_openrouter_async(session, messages)\n",
167 | " if response:\n",
168 | " answer = response.strip()\n",
169 | " if answer in [\"Yes\", \"No\"]:\n",
170 | " return answer\n",
171 | " else:\n",
172 | " if \"Yes\" in answer:\n",
173 | " return \"Yes\"\n",
174 | " elif \"No\" in answer:\n",
175 | " return \"No\"\n",
176 | " return \"No\"\n",
177 | "\n",
178 | "async def extract_relevant_context_async(session, user_query, search_query, page_text):\n",
179 | " prompt = (\n",
180 | " \"You are an expert information extractor. Given the user's query, the search query that led to this page, \"\n",
181 | " \"and the webpage content, extract all pieces of information that are useful for answering the user's query. \"\n",
182 | " \"Return only the relevant context as plain text without extra commentary.\"\n",
183 | " )\n",
184 | " messages = [\n",
185 | " {\"role\": \"system\", \"content\": \"You are an expert in extracting and summarizing relevant information.\"},\n",
186 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\nSearch Query: {search_query}\\n\\nWebpage Content (first 20000 characters):\\n{page_text[:20000]}\\n\\n{prompt}\"}\n",
187 | " ]\n",
188 | " response = await call_openrouter_async(session, messages)\n",
189 | " if response:\n",
190 | " return response.strip()\n",
191 | " return \"\"\n",
192 | "\n",
193 | "async def get_new_search_queries_async(session, user_query, previous_search_queries, all_contexts):\n",
194 | " context_combined = \"\\n\".join(all_contexts)\n",
195 | " prompt = (\n",
196 | " \"You are an analytical research assistant. Based on the original query, the search queries performed so far, \"\n",
197 | " \"and the extracted contexts from webpages, decide if further research is needed. \"\n",
198 | " \"If further research is needed, provide up to four new search queries as a Python list (for example, \"\n",
199 | " \"['new query1', 'new query2']). If you believe no further research is needed, respond with exactly .\"\n",
200 | " \"\\nOutput only a Python list or the token without any extra text.\"\n",
201 | " )\n",
202 | " messages = [\n",
203 | " {\"role\": \"system\", \"content\": \"You are a systematic research planner.\"},\n",
204 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\nPrevious Search Queries: {previous_search_queries}\\n\\nExtracted Relevant Contexts:\\n{context_combined}\\n\\n{prompt}\"}\n",
205 | " ]\n",
206 | " response = await call_openrouter_async(session, messages)\n",
207 | " if response:\n",
208 | " cleaned = response.strip()\n",
209 | " if cleaned == \"\":\n",
210 | " return \"\"\n",
211 | " try:\n",
212 | " new_queries = eval(cleaned)\n",
213 | " if isinstance(new_queries, list):\n",
214 | " return new_queries\n",
215 | " else:\n",
216 | " print(\"LLM did not return a list for new search queries. Response:\", response)\n",
217 | " return []\n",
218 | " except Exception as e:\n",
219 | " print(\"Error parsing new search queries:\", e, \"\\nResponse:\", response)\n",
220 | " return []\n",
221 | " return []\n",
222 | "\n",
223 | "async def generate_final_report_async(session, user_query, all_contexts):\n",
224 | " context_combined = \"\\n\".join(all_contexts)\n",
225 | " prompt = (\n",
226 | " \"You are an expert researcher and report writer. Based on the gathered contexts below and the original query, \"\n",
227 | " \"write a complete, well-structured, and detailed report that addresses the query thoroughly. \"\n",
228 | " \"Include all useful insights and conclusions without extra commentary.\"\n",
229 | " )\n",
230 | " messages = [\n",
231 | " {\"role\": \"system\", \"content\": \"You are a skilled report writer.\"},\n",
232 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\n\\nGathered Relevant Contexts:\\n{context_combined}\\n\\n{prompt}\"}\n",
233 | " ]\n",
234 | " report = await call_openrouter_async(session, messages)\n",
235 | " return report\n",
236 | "\n",
237 | "async def process_link(session, link, user_query, search_query, log):\n",
238 | " log.append(f\"Fetching content from: {link}\")\n",
239 | " page_text = await fetch_webpage_text_async(session, link)\n",
240 | " if not page_text:\n",
241 | " log.append(f\"Failed to fetch content from: {link}\")\n",
242 | " return None\n",
243 | " usefulness = await is_page_useful_async(session, user_query, page_text)\n",
244 | " log.append(f\"Page usefulness for {link}: {usefulness}\")\n",
245 | " if usefulness == \"Yes\":\n",
246 | " context = await extract_relevant_context_async(session, user_query, search_query, page_text)\n",
247 | " if context:\n",
248 | " log.append(f\"Extracted context from {link} (first 200 chars): {context[:200]}\")\n",
249 | " return context\n",
250 | " return None\n",
251 | "\n",
252 | "# -----------------------------\n",
253 | "# Main Asynchronous Routine\n",
254 | "# -----------------------------\n",
255 | "\n",
256 | "async def async_research(user_query, iteration_limit):\n",
257 | " aggregated_contexts = []\n",
258 | " all_search_queries = []\n",
259 | " log_messages = [] # List to store intermediate steps\n",
260 | " iteration = 0\n",
261 | "\n",
262 | " async with aiohttp.ClientSession() as session:\n",
263 | " log_messages.append(\"Generating initial search queries...\")\n",
264 | " new_search_queries = await generate_search_queries_async(session, user_query)\n",
265 | " if not new_search_queries:\n",
266 | " log_messages.append(\"No search queries were generated by the LLM. Exiting.\")\n",
267 | " return \"No search queries were generated by the LLM. Exiting.\", \"\\n\".join(log_messages)\n",
268 | " all_search_queries.extend(new_search_queries)\n",
269 | " log_messages.append(f\"Initial search queries: {new_search_queries}\")\n",
270 | "\n",
271 | " while iteration < iteration_limit:\n",
272 | " log_messages.append(f\"\\n=== Iteration {iteration + 1} ===\")\n",
273 | " iteration_contexts = []\n",
274 | " search_tasks = [perform_search_async(session, query) for query in new_search_queries]\n",
275 | " search_results = await asyncio.gather(*search_tasks)\n",
276 | " unique_links = {}\n",
277 | " for idx, links in enumerate(search_results):\n",
278 | " query_used = new_search_queries[idx]\n",
279 | " for link in links:\n",
280 | " if link not in unique_links:\n",
281 | " unique_links[link] = query_used\n",
282 | "\n",
283 | " log_messages.append(f\"Aggregated {len(unique_links)} unique links from this iteration.\")\n",
284 | " link_tasks = [\n",
285 | " process_link(session, link, user_query, unique_links[link], log_messages)\n",
286 | " for link in unique_links\n",
287 | " ]\n",
288 | " link_results = await asyncio.gather(*link_tasks)\n",
289 | " for res in link_results:\n",
290 | " if res:\n",
291 | " iteration_contexts.append(res)\n",
292 | "\n",
293 | " if iteration_contexts:\n",
294 | " aggregated_contexts.extend(iteration_contexts)\n",
295 | " log_messages.append(f\"Found {len(iteration_contexts)} useful contexts in this iteration.\")\n",
296 | " else:\n",
297 | " log_messages.append(\"No useful contexts were found in this iteration.\")\n",
298 | "\n",
299 | " new_search_queries = await get_new_search_queries_async(session, user_query, all_search_queries, aggregated_contexts)\n",
300 | " if new_search_queries == \"\":\n",
301 | " log_messages.append(\"LLM indicated that no further research is needed.\")\n",
302 | " break\n",
303 | " elif new_search_queries:\n",
304 | " log_messages.append(f\"LLM provided new search queries: {new_search_queries}\")\n",
305 | " all_search_queries.extend(new_search_queries)\n",
306 | " else:\n",
307 | " log_messages.append(\"LLM did not provide any new search queries. Ending the loop.\")\n",
308 | " break\n",
309 | "\n",
310 | " iteration += 1\n",
311 | "\n",
312 | " log_messages.append(\"\\nGenerating final report...\")\n",
313 | " final_report = await generate_final_report_async(session, user_query, aggregated_contexts)\n",
314 | " return final_report, \"\\n\".join(log_messages)\n",
315 | "\n",
316 | "def run_research(user_query, iteration_limit=10):\n",
317 | " return asyncio.run(async_research(user_query, iteration_limit))\n",
318 | "\n",
319 | "# -----------------------------\n",
320 | "# Gradio UI Setup\n",
321 | "# -----------------------------\n",
322 | "\n",
323 | "def gradio_run(user_query, iteration_limit):\n",
324 | " try:\n",
325 | " final_report, logs = run_research(user_query, int(iteration_limit))\n",
326 | " return final_report, logs\n",
327 | " except Exception as e:\n",
328 | " return f\"An error occurred: {e}\", \"\"\n",
329 | "\n",
330 | "iface = gr.Interface(\n",
331 | " fn=gradio_run,\n",
332 | " inputs=[\n",
333 | " gr.Textbox(lines=2, label=\"Research Query/Topic\"),\n",
334 | " gr.Number(value=10, label=\"Max Iterations\")\n",
335 | " ],\n",
336 | " outputs=[\n",
337 | " gr.Textbox(label=\"Final Report\"),\n",
338 | " gr.Textbox(label=\"Intermediate Steps Log\")\n",
339 | " ],\n",
340 | " title=\"Research Assistant\",\n",
341 | " description=\"Enter your query and a maximum iteration count to generate a report. The log will show the steps taken.\"\n",
342 | ")\n",
343 | "\n",
344 | "iface.launch()"
345 | ]
346 | }
347 | ]
348 | }
--------------------------------------------------------------------------------
/open_deep_researcher.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "provenance": [],
7 | "authorship_tag": "ABX9TyOe5BsaH0aplNCjknkFtnjg",
8 | "include_colab_link": true
9 | },
10 | "kernelspec": {
11 | "name": "python3",
12 | "display_name": "Python 3"
13 | },
14 | "language_info": {
15 | "name": "python"
16 | }
17 | },
18 | "cells": [
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {
22 | "id": "view-in-github",
23 | "colab_type": "text"
24 | },
25 | "source": [
26 | "
"
27 | ]
28 | },
29 | {
30 | "cell_type": "code",
31 | "source": [
32 | "!pip install nest_asyncio\n",
33 | "import nest_asyncio\n",
34 | "nest_asyncio.apply()"
35 | ],
36 | "metadata": {
37 | "colab": {
38 | "base_uri": "https://localhost:8080/"
39 | },
40 | "id": "y7cTpP9rDZW-",
41 | "outputId": "5a443ad2-7a8d-4fef-f315-12108c28f1a2"
42 | },
43 | "execution_count": null,
44 | "outputs": [
45 | {
46 | "output_type": "stream",
47 | "name": "stdout",
48 | "text": [
49 | "Requirement already satisfied: nest_asyncio in /usr/local/lib/python3.11/dist-packages (1.6.0)\n"
50 | ]
51 | }
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": null,
57 | "metadata": {
58 | "id": "GJTo96a7DGUz"
59 | },
60 | "outputs": [],
61 | "source": [
62 | "import asyncio\n",
63 | "import aiohttp\n",
64 | "import json\n",
65 | "\n",
66 | "# =======================\n",
67 | "# Configuration Constants\n",
68 | "# =======================\n",
69 | "OPENROUTER_API_KEY = \"REDACTED\" # Replace with your OpenRouter API key\n",
70 | "SERPAPI_API_KEY = \"REDACTED\" # Replace with your SERPAPI API key\n",
71 | "JINA_API_KEY = \"REDACTED\" # Replace with your JINA API key\n",
72 | "\n",
73 | "# Endpoints\n",
74 | "OPENROUTER_URL = \"https://openrouter.ai/api/v1/chat/completions\"\n",
75 | "SERPAPI_URL = \"https://serpapi.com/search\"\n",
76 | "JINA_BASE_URL = \"https://r.jina.ai/\"\n",
77 | "\n",
78 | "# Default LLM model (can be changed if desired)\n",
79 | "DEFAULT_MODEL = \"anthropic/claude-3.5-haiku\"\n",
80 | "\n",
81 | "\n",
82 | "# ============================\n",
83 | "# Asynchronous Helper Functions\n",
84 | "# ============================\n",
85 | "\n",
86 | "async def call_openrouter_async(session, messages, model=DEFAULT_MODEL):\n",
87 | " \"\"\"\n",
88 | " Asynchronously call the OpenRouter chat completion API with the provided messages.\n",
89 | " Returns the content of the assistant’s reply.\n",
90 | " \"\"\"\n",
91 | " headers = {\n",
92 | " \"Authorization\": f\"Bearer {OPENROUTER_API_KEY}\",\n",
93 | " \"X-Title\": \"OpenDeepResearcher, by Matt Shumer\",\n",
94 | " \"Content-Type\": \"application/json\"\n",
95 | " }\n",
96 | " payload = {\n",
97 | " \"model\": model,\n",
98 | " \"messages\": messages\n",
99 | " }\n",
100 | " try:\n",
101 | " async with session.post(OPENROUTER_URL, headers=headers, json=payload) as resp:\n",
102 | " if resp.status == 200:\n",
103 | " result = await resp.json()\n",
104 | " try:\n",
105 | " return result['choices'][0]['message']['content']\n",
106 | " except (KeyError, IndexError) as e:\n",
107 | " print(\"Unexpected OpenRouter response structure:\", result)\n",
108 | " return None\n",
109 | " else:\n",
110 | " text = await resp.text()\n",
111 | " print(f\"OpenRouter API error: {resp.status} - {text}\")\n",
112 | " return None\n",
113 | " except Exception as e:\n",
114 | " print(\"Error calling OpenRouter:\", e)\n",
115 | " return None\n",
116 | "\n",
117 | "\n",
118 | "async def generate_search_queries_async(session, user_query):\n",
119 | " \"\"\"\n",
120 | " Ask the LLM to produce up to four precise search queries (in Python list format)\n",
121 | " based on the user’s query.\n",
122 | " \"\"\"\n",
123 | " prompt = (\n",
124 | " \"You are an expert research assistant. Given the user's query, generate up to four distinct, \"\n",
125 | " \"precise search queries that would help gather comprehensive information on the topic. \"\n",
126 | " \"Return only a Python list of strings, for example: ['query1', 'query2', 'query3'].\"\n",
127 | " )\n",
128 | " messages = [\n",
129 | " {\"role\": \"system\", \"content\": \"You are a helpful and precise research assistant.\"},\n",
130 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\n\\n{prompt}\"}\n",
131 | " ]\n",
132 | " response = await call_openrouter_async(session, messages)\n",
133 | " if response:\n",
134 | " try:\n",
135 | " # Expect exactly a Python list (e.g., \"['query1', 'query2']\")\n",
136 | " search_queries = eval(response)\n",
137 | " if isinstance(search_queries, list):\n",
138 | " return search_queries\n",
139 | " else:\n",
140 | " print(\"LLM did not return a list. Response:\", response)\n",
141 | " return []\n",
142 | " except Exception as e:\n",
143 | " print(\"Error parsing search queries:\", e, \"\\nResponse:\", response)\n",
144 | " return []\n",
145 | " return []\n",
146 | "\n",
147 | "\n",
148 | "async def perform_search_async(session, query):\n",
149 | " \"\"\"\n",
150 | " Asynchronously perform a Google search using SERPAPI for the given query.\n",
151 | " Returns a list of result URLs.\n",
152 | " \"\"\"\n",
153 | " params = {\n",
154 | " \"q\": query,\n",
155 | " \"api_key\": SERPAPI_API_KEY,\n",
156 | " \"engine\": \"google\"\n",
157 | " }\n",
158 | " try:\n",
159 | " async with session.get(SERPAPI_URL, params=params) as resp:\n",
160 | " if resp.status == 200:\n",
161 | " results = await resp.json()\n",
162 | " if \"organic_results\" in results:\n",
163 | " links = [item.get(\"link\") for item in results[\"organic_results\"] if \"link\" in item]\n",
164 | " return links\n",
165 | " else:\n",
166 | " print(\"No organic results in SERPAPI response.\")\n",
167 | " return []\n",
168 | " else:\n",
169 | " text = await resp.text()\n",
170 | " print(f\"SERPAPI error: {resp.status} - {text}\")\n",
171 | " return []\n",
172 | " except Exception as e:\n",
173 | " print(\"Error performing SERPAPI search:\", e)\n",
174 | " return []\n",
175 | "\n",
176 | "\n",
177 | "async def fetch_webpage_text_async(session, url):\n",
178 | " \"\"\"\n",
179 | " Asynchronously retrieve the text content of a webpage using Jina.\n",
180 | " The URL is appended to the Jina endpoint.\n",
181 | " \"\"\"\n",
182 | " full_url = f\"{JINA_BASE_URL}{url}\"\n",
183 | " headers = {\n",
184 | " \"Authorization\": f\"Bearer {JINA_API_KEY}\"\n",
185 | " }\n",
186 | " try:\n",
187 | " async with session.get(full_url, headers=headers) as resp:\n",
188 | " if resp.status == 200:\n",
189 | " return await resp.text()\n",
190 | " else:\n",
191 | " text = await resp.text()\n",
192 | " print(f\"Jina fetch error for {url}: {resp.status} - {text}\")\n",
193 | " return \"\"\n",
194 | " except Exception as e:\n",
195 | " print(\"Error fetching webpage text with Jina:\", e)\n",
196 | " return \"\"\n",
197 | "\n",
198 | "\n",
199 | "async def is_page_useful_async(session, user_query, page_text):\n",
200 | " \"\"\"\n",
201 | " Ask the LLM if the provided webpage content is useful for answering the user's query.\n",
202 | " The LLM must reply with exactly \"Yes\" or \"No\".\n",
203 | " \"\"\"\n",
204 | " prompt = (\n",
205 | " \"You are a critical research evaluator. Given the user's query and the content of a webpage, \"\n",
206 | " \"determine if the webpage contains information relevant and useful for addressing the query. \"\n",
207 | " \"Respond with exactly one word: 'Yes' if the page is useful, or 'No' if it is not. Do not include any extra text.\"\n",
208 | " )\n",
209 | " messages = [\n",
210 | " {\"role\": \"system\", \"content\": \"You are a strict and concise evaluator of research relevance.\"},\n",
211 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\n\\nWebpage Content (first 20000 characters):\\n{page_text[:20000]}\\n\\n{prompt}\"}\n",
212 | " ]\n",
213 | " response = await call_openrouter_async(session, messages)\n",
214 | " if response:\n",
215 | " answer = response.strip()\n",
216 | " if answer in [\"Yes\", \"No\"]:\n",
217 | " return answer\n",
218 | " else:\n",
219 | " # Fallback: try to extract Yes/No from the response.\n",
220 | " if \"Yes\" in answer:\n",
221 | " return \"Yes\"\n",
222 | " elif \"No\" in answer:\n",
223 | " return \"No\"\n",
224 | " return \"No\"\n",
225 | "\n",
226 | "\n",
227 | "async def extract_relevant_context_async(session, user_query, search_query, page_text):\n",
228 | " \"\"\"\n",
229 | " Given the original query, the search query used, and the page content,\n",
230 | " have the LLM extract all information relevant for answering the query.\n",
231 | " \"\"\"\n",
232 | " prompt = (\n",
233 | " \"You are an expert information extractor. Given the user's query, the search query that led to this page, \"\n",
234 | " \"and the webpage content, extract all pieces of information that are relevant to answering the user's query. \"\n",
235 | " \"Return only the relevant context as plain text without commentary.\"\n",
236 | " )\n",
237 | " messages = [\n",
238 | " {\"role\": \"system\", \"content\": \"You are an expert in extracting and summarizing relevant information.\"},\n",
239 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\nSearch Query: {search_query}\\n\\nWebpage Content (first 20000 characters):\\n{page_text[:20000]}\\n\\n{prompt}\"}\n",
240 | " ]\n",
241 | " response = await call_openrouter_async(session, messages)\n",
242 | " if response:\n",
243 | " return response.strip()\n",
244 | " return \"\"\n",
245 | "\n",
246 | "\n",
247 | "async def get_new_search_queries_async(session, user_query, previous_search_queries, all_contexts):\n",
248 | " \"\"\"\n",
249 | " Based on the original query, the previously used search queries, and all the extracted contexts,\n",
250 | " ask the LLM whether additional search queries are needed. If yes, return a Python list of up to four queries;\n",
251 | " if the LLM thinks research is complete, it should return \"\".\n",
252 | " \"\"\"\n",
253 | " context_combined = \"\\n\".join(all_contexts)\n",
254 | " prompt = (\n",
255 | " \"You are an analytical research assistant. Based on the original query, the search queries performed so far, \"\n",
256 | " \"and the extracted contexts from webpages, determine if further research is needed. \"\n",
257 | " \"If further research is needed, provide up to four new search queries as a Python list (for example, \"\n",
258 | " \"['new query1', 'new query2']). If you believe no further research is needed, respond with exactly .\"\n",
259 | " \"\\nOutput only a Python list or the token without any additional text.\"\n",
260 | " )\n",
261 | " messages = [\n",
262 | " {\"role\": \"system\", \"content\": \"You are a systematic research planner.\"},\n",
263 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\nPrevious Search Queries: {previous_search_queries}\\n\\nExtracted Relevant Contexts:\\n{context_combined}\\n\\n{prompt}\"}\n",
264 | " ]\n",
265 | " response = await call_openrouter_async(session, messages)\n",
266 | " if response:\n",
267 | " cleaned = response.strip()\n",
268 | " if cleaned == \"\":\n",
269 | " return \"\"\n",
270 | " try:\n",
271 | " new_queries = eval(cleaned)\n",
272 | " if isinstance(new_queries, list):\n",
273 | " return new_queries\n",
274 | " else:\n",
275 | " print(\"LLM did not return a list for new search queries. Response:\", response)\n",
276 | " return []\n",
277 | " except Exception as e:\n",
278 | " print(\"Error parsing new search queries:\", e, \"\\nResponse:\", response)\n",
279 | " return []\n",
280 | " return []\n",
281 | "\n",
282 | "\n",
283 | "async def generate_final_report_async(session, user_query, all_contexts):\n",
284 | " \"\"\"\n",
285 | " Generate the final comprehensive report using all gathered contexts.\n",
286 | " \"\"\"\n",
287 | " context_combined = \"\\n\".join(all_contexts)\n",
288 | " prompt = (\n",
289 | " \"You are an expert researcher and report writer. Based on the gathered contexts below and the original query, \"\n",
290 | " \"write a comprehensive, well-structured, and detailed report that addresses the query thoroughly. \"\n",
291 | " \"Include all relevant insights and conclusions without extraneous commentary.\"\n",
292 | " )\n",
293 | " messages = [\n",
294 | " {\"role\": \"system\", \"content\": \"You are a skilled report writer.\"},\n",
295 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\n\\nGathered Relevant Contexts:\\n{context_combined}\\n\\n{prompt}\"}\n",
296 | " ]\n",
297 | " report = await call_openrouter_async(session, messages)\n",
298 | " return report\n",
299 | "\n",
300 | "\n",
301 | "async def process_link(session, link, user_query, search_query):\n",
302 | " \"\"\"\n",
303 | " Process a single link: fetch its content, judge its usefulness, and if useful, extract the relevant context.\n",
304 | " \"\"\"\n",
305 | " print(f\"Fetching content from: {link}\")\n",
306 | " page_text = await fetch_webpage_text_async(session, link)\n",
307 | " if not page_text:\n",
308 | " return None\n",
309 | " usefulness = await is_page_useful_async(session, user_query, page_text)\n",
310 | " print(f\"Page usefulness for {link}: {usefulness}\")\n",
311 | " if usefulness == \"Yes\":\n",
312 | " context = await extract_relevant_context_async(session, user_query, search_query, page_text)\n",
313 | " if context:\n",
314 | " print(f\"Extracted context from {link} (first 200 chars): {context[:200]}\")\n",
315 | " return context\n",
316 | " return None\n",
317 | "\n",
318 | "\n",
319 | "# =========================\n",
320 | "# Main Asynchronous Routine\n",
321 | "# =========================\n",
322 | "\n",
323 | "async def async_main():\n",
324 | " user_query = input(\"Enter your research query/topic: \").strip()\n",
325 | " iter_limit_input = input(\"Enter maximum number of iterations (default 10): \").strip()\n",
326 | " iteration_limit = int(iter_limit_input) if iter_limit_input.isdigit() else 10\n",
327 | "\n",
328 | " aggregated_contexts = [] # All useful contexts from every iteration\n",
329 | " all_search_queries = [] # Every search query used across iterations\n",
330 | " iteration = 0\n",
331 | "\n",
332 | " async with aiohttp.ClientSession() as session:\n",
333 | " # ----- INITIAL SEARCH QUERIES -----\n",
334 | " new_search_queries = await generate_search_queries_async(session, user_query)\n",
335 | " if not new_search_queries:\n",
336 | " print(\"No search queries were generated by the LLM. Exiting.\")\n",
337 | " return\n",
338 | " all_search_queries.extend(new_search_queries)\n",
339 | "\n",
340 | " # ----- ITERATIVE RESEARCH LOOP -----\n",
341 | " while iteration < iteration_limit:\n",
342 | " print(f\"\\n=== Iteration {iteration + 1} ===\")\n",
343 | " iteration_contexts = []\n",
344 | "\n",
345 | " # For each search query, perform SERPAPI searches concurrently.\n",
346 | " search_tasks = [perform_search_async(session, query) for query in new_search_queries]\n",
347 | " search_results = await asyncio.gather(*search_tasks)\n",
348 | "\n",
349 | " # Aggregate all unique links from all search queries of this iteration.\n",
350 | " # Map each unique link to the search query that produced it.\n",
351 | " unique_links = {}\n",
352 | " for idx, links in enumerate(search_results):\n",
353 | " query = new_search_queries[idx]\n",
354 | " for link in links:\n",
355 | " if link not in unique_links:\n",
356 | " unique_links[link] = query\n",
357 | "\n",
358 | " print(f\"Aggregated {len(unique_links)} unique links from this iteration.\")\n",
359 | "\n",
360 | " # Process each link concurrently: fetch, judge, and extract context.\n",
361 | " link_tasks = [\n",
362 | " process_link(session, link, user_query, unique_links[link])\n",
363 | " for link in unique_links\n",
364 | " ]\n",
365 | " link_results = await asyncio.gather(*link_tasks)\n",
366 | "\n",
367 | " # Collect non-None contexts.\n",
368 | " for res in link_results:\n",
369 | " if res:\n",
370 | " iteration_contexts.append(res)\n",
371 | "\n",
372 | " if iteration_contexts:\n",
373 | " aggregated_contexts.extend(iteration_contexts)\n",
374 | " else:\n",
375 | " print(\"No useful contexts were found in this iteration.\")\n",
376 | "\n",
377 | " # ----- ASK THE LLM IF MORE SEARCHES ARE NEEDED -----\n",
378 | " new_search_queries = await get_new_search_queries_async(session, user_query, all_search_queries, aggregated_contexts)\n",
379 | " if new_search_queries == \"\":\n",
380 | " print(\"LLM indicated that no further research is needed.\")\n",
381 | " break\n",
382 | " elif new_search_queries:\n",
383 | " print(\"LLM provided new search queries:\", new_search_queries)\n",
384 | " all_search_queries.extend(new_search_queries)\n",
385 | " else:\n",
386 | " print(\"LLM did not provide any new search queries. Ending the loop.\")\n",
387 | " break\n",
388 | "\n",
389 | " iteration += 1\n",
390 | "\n",
391 | " # ----- FINAL REPORT -----\n",
392 | " print(\"\\nGenerating final report...\")\n",
393 | " final_report = await generate_final_report_async(session, user_query, aggregated_contexts)\n",
394 | " print(\"\\n==== FINAL REPORT ====\\n\")\n",
395 | " print(final_report)\n",
396 | "\n",
397 | "\n",
398 | "def main():\n",
399 | " asyncio.run(async_main())\n",
400 | "\n",
401 | "\n",
402 | "if __name__ == \"__main__\":\n",
403 | " main()\n"
404 | ]
405 | },
406 | {
407 | "cell_type": "code",
408 | "source": [],
409 | "metadata": {
410 | "id": "46Q5XpapDJZT"
411 | },
412 | "execution_count": null,
413 | "outputs": []
414 | }
415 | ]
416 | }
--------------------------------------------------------------------------------