├── LICENSE ├── README.md ├── open_deep_researcher_gradio.ipynb └── open_deep_researcher.ipynb /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 mshumer 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # OpenDeepResearcher 2 | 3 | This notebook implements an **AI researcher** that continuously searches for information based on a user query until the system is confident that it has gathered all the necessary details. It makes use of several services to do so: 4 | 5 | - **SERPAPI**: To perform Google searches. 6 | - **Jina**: To fetch and extract webpage content. 7 | - **OpenRouter** (default model: `anthropic/claude-3.5-haiku`): To interact with a LLM for generating search queries, evaluating page relevance, and extracting context. 8 | 9 | ## Features 10 | 11 | - **Iterative Research Loop:** The system refines its search queries iteratively until no further queries are required. 12 | - **Asynchronous Processing:** Searches, webpage fetching, evaluation, and context extraction are performed concurrently to improve speed. 13 | - **Duplicate Filtering:** Aggregates and deduplicates links within each round, ensuring that the same link isn’t processed twice. 14 | - **LLM-Powered Decision Making:** Uses the LLM to generate new search queries, decide on page usefulness, extract relevant context, and produce a final comprehensive report. 15 | - **Gradio Interface:** Use the `open-deep-researcher - gradio` notebook if you want to use this in a functional UI 16 | 17 | ## Requirements 18 | 19 | - API access and keys for: 20 | - **OpenRouter API** 21 | - **SERPAPI API** 22 | - **Jina API** 23 | 24 | ## Setup 25 | 26 | 1. **Clone or Open the Notebook:** 27 | - Download the notebook file or open it directly in [Google Colab](https://colab.research.google.com/github/mshumer/OpenDeepResearcher/blob/main/open_deep_researcher.ipynb). 28 | 29 | 2. **Install `nest_asyncio`:** 30 | 31 | Run the first cell to set up `nest_asyncio`. 32 | 33 | 3. **Configure API Keys:** 34 | - Replace the placeholder values in the notebook for `OPENROUTER_API_KEY`, `SERPAPI_API_KEY`, and `JINA_API_KEY` with your actual API keys. 35 | 36 | ## Usage 37 | 38 | 1. **Run the Notebook Cells:** 39 | Execute all cells in order. The notebook will prompt you for: 40 | - A research query/topic. 41 | - An optional maximum number of iterations (default is 10). 42 | 43 | 2. **Follow the Research Process:** 44 | - **Initial Query & Search Generation:** The notebook uses the LLM to generate initial search queries. 45 | - **Asynchronous Searches & Extraction:** It performs SERPAPI searches for all queries concurrently, aggregates unique links, and processes each link in parallel to determine page usefulness and extract relevant context. 46 | - **Iterative Refinement:** After each round, the aggregated context is analyzed by the LLM to determine if further search queries are needed. 47 | - **Final Report:** Once the LLM indicates that no further research is needed (or the iteration limit is reached), a final report is generated based on all gathered context. 48 | 49 | 3. **View the Final Report:** 50 | The final comprehensive report will be printed in the output. 51 | 52 | ## How It Works 53 | 54 | 1. **Input & Query Generation:** 55 | The user enters a research topic, and the LLM generates up to four distinct search queries. 56 | 57 | 2. **Concurrent Search & Processing:** 58 | - **SERPAPI:** Each search query is sent to SERPAPI concurrently. 59 | - **Deduplication:** All retrieved links are aggregated and deduplicated within the current iteration. 60 | - **Jina & LLM:** Each unique link is processed concurrently to fetch webpage content via Jina, evaluate its usefulness with the LLM, and extract relevant information if the page is deemed useful. 61 | 62 | 3. **Iterative Refinement:** 63 | The system passes the aggregated context to the LLM to determine if further search queries are needed. New queries are generated if required; otherwise, the loop terminates. 64 | 65 | 4. **Final Report Generation:** 66 | All gathered context is compiled and sent to the LLM to produce a final, comprehensive report addressing the original query. 67 | 68 | ## Troubleshooting 69 | 70 | - **RuntimeError with asyncio:** 71 | If you encounter an error like: 72 | ``` 73 | RuntimeError: asyncio.run() cannot be called from a running event loop 74 | ``` 75 | Ensure you have applied `nest_asyncio` as shown in the setup section. 76 | 77 | - **API Issues:** 78 | Verify that your API keys are correct and that you are not exceeding any rate limits. 79 | 80 | --- 81 | 82 | Follow me on [X](https://x.com/mattshumer_) for updates on this and other AI things I'm working on. 83 | 84 | Head to [ShumerPrompt](https://shumerprompt.com), my "Github for Prompts"! 85 | 86 | OpenDeepResearcher is released under the MIT License. See the LICENSE file for more details. 87 | -------------------------------------------------------------------------------- /open_deep_researcher_gradio.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "authorship_tag": "ABX9TyM6XDBP8oqAaLL0GMT0mBj+", 8 | "include_colab_link": true 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | }, 14 | "language_info": { 15 | "name": "python" 16 | } 17 | }, 18 | "cells": [ 19 | { 20 | "cell_type": "markdown", 21 | "metadata": { 22 | "id": "view-in-github", 23 | "colab_type": "text" 24 | }, 25 | "source": [ 26 | "\"Open" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": null, 32 | "metadata": { 33 | "id": "bSpd75_8O8rr" 34 | }, 35 | "outputs": [], 36 | "source": [ 37 | "!pip install nest_asyncio gradio aiohttp\n", 38 | "\n", 39 | "import nest_asyncio\n", 40 | "nest_asyncio.apply()\n", 41 | "\n", 42 | "import asyncio\n", 43 | "import aiohttp\n", 44 | "import gradio as gr\n", 45 | "import json\n", 46 | "\n", 47 | "# ---------------------------\n", 48 | "# Configuration Constants\n", 49 | "# ---------------------------\n", 50 | "OPENROUTER_API_KEY = \"REDACTED\" # Replace with your OpenRouter API key\n", 51 | "SERPAPI_API_KEY = \"REDACTED\" # Replace with your SERPAPI API key\n", 52 | "JINA_API_KEY = \"REDACTED\" # Replace with your Jina API key\n", 53 | "\n", 54 | "OPENROUTER_URL = \"https://openrouter.ai/api/v1/chat/completions\"\n", 55 | "SERPAPI_URL = \"https://serpapi.com/search\"\n", 56 | "JINA_BASE_URL = \"https://r.jina.ai/\"\n", 57 | "\n", 58 | "DEFAULT_MODEL = \"anthropic/claude-3.5-haiku\"\n", 59 | "\n", 60 | "# -------------------------------\n", 61 | "# Asynchronous Helper Functions\n", 62 | "# -------------------------------\n", 63 | "\n", 64 | "async def call_openrouter_async(session, messages, model=DEFAULT_MODEL):\n", 65 | " headers = {\n", 66 | " \"Authorization\": f\"Bearer {OPENROUTER_API_KEY}\",\n", 67 | " \"X-Title\": \"OpenDeepResearcher, by Matt Shumer\",\n", 68 | " \"Content-Type\": \"application/json\"\n", 69 | " }\n", 70 | " payload = {\n", 71 | " \"model\": model,\n", 72 | " \"messages\": messages\n", 73 | " }\n", 74 | " try:\n", 75 | " async with session.post(OPENROUTER_URL, headers=headers, json=payload) as resp:\n", 76 | " if resp.status == 200:\n", 77 | " result = await resp.json()\n", 78 | " try:\n", 79 | " return result['choices'][0]['message']['content']\n", 80 | " except (KeyError, IndexError):\n", 81 | " print(\"Unexpected OpenRouter response structure:\", result)\n", 82 | " return None\n", 83 | " else:\n", 84 | " text = await resp.text()\n", 85 | " print(f\"OpenRouter API error: {resp.status} - {text}\")\n", 86 | " return None\n", 87 | " except Exception as e:\n", 88 | " print(\"Error calling OpenRouter:\", e)\n", 89 | " return None\n", 90 | "\n", 91 | "async def generate_search_queries_async(session, user_query):\n", 92 | " prompt = (\n", 93 | " \"You are an expert research assistant. Given the user's query, generate up to four distinct, \"\n", 94 | " \"precise search queries that would help gather complete information on the topic. \"\n", 95 | " \"Return only a Python list of strings, for example: ['query1', 'query2', 'query3'].\"\n", 96 | " )\n", 97 | " messages = [\n", 98 | " {\"role\": \"system\", \"content\": \"You are a helpful and precise research assistant.\"},\n", 99 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\n\\n{prompt}\"}\n", 100 | " ]\n", 101 | " response = await call_openrouter_async(session, messages)\n", 102 | " if response:\n", 103 | " try:\n", 104 | " search_queries = eval(response)\n", 105 | " if isinstance(search_queries, list):\n", 106 | " return search_queries\n", 107 | " else:\n", 108 | " print(\"LLM did not return a list. Response:\", response)\n", 109 | " return []\n", 110 | " except Exception as e:\n", 111 | " print(\"Error parsing search queries:\", e, \"\\nResponse:\", response)\n", 112 | " return []\n", 113 | " return []\n", 114 | "\n", 115 | "async def perform_search_async(session, query):\n", 116 | " params = {\n", 117 | " \"q\": query,\n", 118 | " \"api_key\": SERPAPI_API_KEY,\n", 119 | " \"engine\": \"google\"\n", 120 | " }\n", 121 | " try:\n", 122 | " async with session.get(SERPAPI_URL, params=params) as resp:\n", 123 | " if resp.status == 200:\n", 124 | " results = await resp.json()\n", 125 | " if \"organic_results\" in results:\n", 126 | " links = [item.get(\"link\") for item in results[\"organic_results\"] if \"link\" in item]\n", 127 | " return links\n", 128 | " else:\n", 129 | " print(\"No organic results in SERPAPI response.\")\n", 130 | " return []\n", 131 | " else:\n", 132 | " text = await resp.text()\n", 133 | " print(f\"SERPAPI error: {resp.status} - {text}\")\n", 134 | " return []\n", 135 | " except Exception as e:\n", 136 | " print(\"Error performing SERPAPI search:\", e)\n", 137 | " return []\n", 138 | "\n", 139 | "async def fetch_webpage_text_async(session, url):\n", 140 | " full_url = f\"{JINA_BASE_URL}{url}\"\n", 141 | " headers = {\n", 142 | " \"Authorization\": f\"Bearer {JINA_API_KEY}\"\n", 143 | " }\n", 144 | " try:\n", 145 | " async with session.get(full_url, headers=headers) as resp:\n", 146 | " if resp.status == 200:\n", 147 | " return await resp.text()\n", 148 | " else:\n", 149 | " text = await resp.text()\n", 150 | " print(f\"Jina fetch error for {url}: {resp.status} - {text}\")\n", 151 | " return \"\"\n", 152 | " except Exception as e:\n", 153 | " print(\"Error fetching webpage text with Jina:\", e)\n", 154 | " return \"\"\n", 155 | "\n", 156 | "async def is_page_useful_async(session, user_query, page_text):\n", 157 | " prompt = (\n", 158 | " \"You are a critical research evaluator. Given the user's query and the content of a webpage, \"\n", 159 | " \"determine if the webpage contains information that is useful for addressing the query. \"\n", 160 | " \"Respond with exactly one word: 'Yes' if the page is useful, or 'No' if it is not. Do not include any extra text.\"\n", 161 | " )\n", 162 | " messages = [\n", 163 | " {\"role\": \"system\", \"content\": \"You are a strict and concise evaluator of research relevance.\"},\n", 164 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\n\\nWebpage Content (first 20000 characters):\\n{page_text[:20000]}\\n\\n{prompt}\"}\n", 165 | " ]\n", 166 | " response = await call_openrouter_async(session, messages)\n", 167 | " if response:\n", 168 | " answer = response.strip()\n", 169 | " if answer in [\"Yes\", \"No\"]:\n", 170 | " return answer\n", 171 | " else:\n", 172 | " if \"Yes\" in answer:\n", 173 | " return \"Yes\"\n", 174 | " elif \"No\" in answer:\n", 175 | " return \"No\"\n", 176 | " return \"No\"\n", 177 | "\n", 178 | "async def extract_relevant_context_async(session, user_query, search_query, page_text):\n", 179 | " prompt = (\n", 180 | " \"You are an expert information extractor. Given the user's query, the search query that led to this page, \"\n", 181 | " \"and the webpage content, extract all pieces of information that are useful for answering the user's query. \"\n", 182 | " \"Return only the relevant context as plain text without extra commentary.\"\n", 183 | " )\n", 184 | " messages = [\n", 185 | " {\"role\": \"system\", \"content\": \"You are an expert in extracting and summarizing relevant information.\"},\n", 186 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\nSearch Query: {search_query}\\n\\nWebpage Content (first 20000 characters):\\n{page_text[:20000]}\\n\\n{prompt}\"}\n", 187 | " ]\n", 188 | " response = await call_openrouter_async(session, messages)\n", 189 | " if response:\n", 190 | " return response.strip()\n", 191 | " return \"\"\n", 192 | "\n", 193 | "async def get_new_search_queries_async(session, user_query, previous_search_queries, all_contexts):\n", 194 | " context_combined = \"\\n\".join(all_contexts)\n", 195 | " prompt = (\n", 196 | " \"You are an analytical research assistant. Based on the original query, the search queries performed so far, \"\n", 197 | " \"and the extracted contexts from webpages, decide if further research is needed. \"\n", 198 | " \"If further research is needed, provide up to four new search queries as a Python list (for example, \"\n", 199 | " \"['new query1', 'new query2']). If you believe no further research is needed, respond with exactly .\"\n", 200 | " \"\\nOutput only a Python list or the token without any extra text.\"\n", 201 | " )\n", 202 | " messages = [\n", 203 | " {\"role\": \"system\", \"content\": \"You are a systematic research planner.\"},\n", 204 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\nPrevious Search Queries: {previous_search_queries}\\n\\nExtracted Relevant Contexts:\\n{context_combined}\\n\\n{prompt}\"}\n", 205 | " ]\n", 206 | " response = await call_openrouter_async(session, messages)\n", 207 | " if response:\n", 208 | " cleaned = response.strip()\n", 209 | " if cleaned == \"\":\n", 210 | " return \"\"\n", 211 | " try:\n", 212 | " new_queries = eval(cleaned)\n", 213 | " if isinstance(new_queries, list):\n", 214 | " return new_queries\n", 215 | " else:\n", 216 | " print(\"LLM did not return a list for new search queries. Response:\", response)\n", 217 | " return []\n", 218 | " except Exception as e:\n", 219 | " print(\"Error parsing new search queries:\", e, \"\\nResponse:\", response)\n", 220 | " return []\n", 221 | " return []\n", 222 | "\n", 223 | "async def generate_final_report_async(session, user_query, all_contexts):\n", 224 | " context_combined = \"\\n\".join(all_contexts)\n", 225 | " prompt = (\n", 226 | " \"You are an expert researcher and report writer. Based on the gathered contexts below and the original query, \"\n", 227 | " \"write a complete, well-structured, and detailed report that addresses the query thoroughly. \"\n", 228 | " \"Include all useful insights and conclusions without extra commentary.\"\n", 229 | " )\n", 230 | " messages = [\n", 231 | " {\"role\": \"system\", \"content\": \"You are a skilled report writer.\"},\n", 232 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\n\\nGathered Relevant Contexts:\\n{context_combined}\\n\\n{prompt}\"}\n", 233 | " ]\n", 234 | " report = await call_openrouter_async(session, messages)\n", 235 | " return report\n", 236 | "\n", 237 | "async def process_link(session, link, user_query, search_query, log):\n", 238 | " log.append(f\"Fetching content from: {link}\")\n", 239 | " page_text = await fetch_webpage_text_async(session, link)\n", 240 | " if not page_text:\n", 241 | " log.append(f\"Failed to fetch content from: {link}\")\n", 242 | " return None\n", 243 | " usefulness = await is_page_useful_async(session, user_query, page_text)\n", 244 | " log.append(f\"Page usefulness for {link}: {usefulness}\")\n", 245 | " if usefulness == \"Yes\":\n", 246 | " context = await extract_relevant_context_async(session, user_query, search_query, page_text)\n", 247 | " if context:\n", 248 | " log.append(f\"Extracted context from {link} (first 200 chars): {context[:200]}\")\n", 249 | " return context\n", 250 | " return None\n", 251 | "\n", 252 | "# -----------------------------\n", 253 | "# Main Asynchronous Routine\n", 254 | "# -----------------------------\n", 255 | "\n", 256 | "async def async_research(user_query, iteration_limit):\n", 257 | " aggregated_contexts = []\n", 258 | " all_search_queries = []\n", 259 | " log_messages = [] # List to store intermediate steps\n", 260 | " iteration = 0\n", 261 | "\n", 262 | " async with aiohttp.ClientSession() as session:\n", 263 | " log_messages.append(\"Generating initial search queries...\")\n", 264 | " new_search_queries = await generate_search_queries_async(session, user_query)\n", 265 | " if not new_search_queries:\n", 266 | " log_messages.append(\"No search queries were generated by the LLM. Exiting.\")\n", 267 | " return \"No search queries were generated by the LLM. Exiting.\", \"\\n\".join(log_messages)\n", 268 | " all_search_queries.extend(new_search_queries)\n", 269 | " log_messages.append(f\"Initial search queries: {new_search_queries}\")\n", 270 | "\n", 271 | " while iteration < iteration_limit:\n", 272 | " log_messages.append(f\"\\n=== Iteration {iteration + 1} ===\")\n", 273 | " iteration_contexts = []\n", 274 | " search_tasks = [perform_search_async(session, query) for query in new_search_queries]\n", 275 | " search_results = await asyncio.gather(*search_tasks)\n", 276 | " unique_links = {}\n", 277 | " for idx, links in enumerate(search_results):\n", 278 | " query_used = new_search_queries[idx]\n", 279 | " for link in links:\n", 280 | " if link not in unique_links:\n", 281 | " unique_links[link] = query_used\n", 282 | "\n", 283 | " log_messages.append(f\"Aggregated {len(unique_links)} unique links from this iteration.\")\n", 284 | " link_tasks = [\n", 285 | " process_link(session, link, user_query, unique_links[link], log_messages)\n", 286 | " for link in unique_links\n", 287 | " ]\n", 288 | " link_results = await asyncio.gather(*link_tasks)\n", 289 | " for res in link_results:\n", 290 | " if res:\n", 291 | " iteration_contexts.append(res)\n", 292 | "\n", 293 | " if iteration_contexts:\n", 294 | " aggregated_contexts.extend(iteration_contexts)\n", 295 | " log_messages.append(f\"Found {len(iteration_contexts)} useful contexts in this iteration.\")\n", 296 | " else:\n", 297 | " log_messages.append(\"No useful contexts were found in this iteration.\")\n", 298 | "\n", 299 | " new_search_queries = await get_new_search_queries_async(session, user_query, all_search_queries, aggregated_contexts)\n", 300 | " if new_search_queries == \"\":\n", 301 | " log_messages.append(\"LLM indicated that no further research is needed.\")\n", 302 | " break\n", 303 | " elif new_search_queries:\n", 304 | " log_messages.append(f\"LLM provided new search queries: {new_search_queries}\")\n", 305 | " all_search_queries.extend(new_search_queries)\n", 306 | " else:\n", 307 | " log_messages.append(\"LLM did not provide any new search queries. Ending the loop.\")\n", 308 | " break\n", 309 | "\n", 310 | " iteration += 1\n", 311 | "\n", 312 | " log_messages.append(\"\\nGenerating final report...\")\n", 313 | " final_report = await generate_final_report_async(session, user_query, aggregated_contexts)\n", 314 | " return final_report, \"\\n\".join(log_messages)\n", 315 | "\n", 316 | "def run_research(user_query, iteration_limit=10):\n", 317 | " return asyncio.run(async_research(user_query, iteration_limit))\n", 318 | "\n", 319 | "# -----------------------------\n", 320 | "# Gradio UI Setup\n", 321 | "# -----------------------------\n", 322 | "\n", 323 | "def gradio_run(user_query, iteration_limit):\n", 324 | " try:\n", 325 | " final_report, logs = run_research(user_query, int(iteration_limit))\n", 326 | " return final_report, logs\n", 327 | " except Exception as e:\n", 328 | " return f\"An error occurred: {e}\", \"\"\n", 329 | "\n", 330 | "iface = gr.Interface(\n", 331 | " fn=gradio_run,\n", 332 | " inputs=[\n", 333 | " gr.Textbox(lines=2, label=\"Research Query/Topic\"),\n", 334 | " gr.Number(value=10, label=\"Max Iterations\")\n", 335 | " ],\n", 336 | " outputs=[\n", 337 | " gr.Textbox(label=\"Final Report\"),\n", 338 | " gr.Textbox(label=\"Intermediate Steps Log\")\n", 339 | " ],\n", 340 | " title=\"Research Assistant\",\n", 341 | " description=\"Enter your query and a maximum iteration count to generate a report. The log will show the steps taken.\"\n", 342 | ")\n", 343 | "\n", 344 | "iface.launch()" 345 | ] 346 | } 347 | ] 348 | } -------------------------------------------------------------------------------- /open_deep_researcher.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "authorship_tag": "ABX9TyOe5BsaH0aplNCjknkFtnjg", 8 | "include_colab_link": true 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | }, 14 | "language_info": { 15 | "name": "python" 16 | } 17 | }, 18 | "cells": [ 19 | { 20 | "cell_type": "markdown", 21 | "metadata": { 22 | "id": "view-in-github", 23 | "colab_type": "text" 24 | }, 25 | "source": [ 26 | "\"Open" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "source": [ 32 | "!pip install nest_asyncio\n", 33 | "import nest_asyncio\n", 34 | "nest_asyncio.apply()" 35 | ], 36 | "metadata": { 37 | "colab": { 38 | "base_uri": "https://localhost:8080/" 39 | }, 40 | "id": "y7cTpP9rDZW-", 41 | "outputId": "5a443ad2-7a8d-4fef-f315-12108c28f1a2" 42 | }, 43 | "execution_count": null, 44 | "outputs": [ 45 | { 46 | "output_type": "stream", 47 | "name": "stdout", 48 | "text": [ 49 | "Requirement already satisfied: nest_asyncio in /usr/local/lib/python3.11/dist-packages (1.6.0)\n" 50 | ] 51 | } 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": { 58 | "id": "GJTo96a7DGUz" 59 | }, 60 | "outputs": [], 61 | "source": [ 62 | "import asyncio\n", 63 | "import aiohttp\n", 64 | "import json\n", 65 | "\n", 66 | "# =======================\n", 67 | "# Configuration Constants\n", 68 | "# =======================\n", 69 | "OPENROUTER_API_KEY = \"REDACTED\" # Replace with your OpenRouter API key\n", 70 | "SERPAPI_API_KEY = \"REDACTED\" # Replace with your SERPAPI API key\n", 71 | "JINA_API_KEY = \"REDACTED\" # Replace with your JINA API key\n", 72 | "\n", 73 | "# Endpoints\n", 74 | "OPENROUTER_URL = \"https://openrouter.ai/api/v1/chat/completions\"\n", 75 | "SERPAPI_URL = \"https://serpapi.com/search\"\n", 76 | "JINA_BASE_URL = \"https://r.jina.ai/\"\n", 77 | "\n", 78 | "# Default LLM model (can be changed if desired)\n", 79 | "DEFAULT_MODEL = \"anthropic/claude-3.5-haiku\"\n", 80 | "\n", 81 | "\n", 82 | "# ============================\n", 83 | "# Asynchronous Helper Functions\n", 84 | "# ============================\n", 85 | "\n", 86 | "async def call_openrouter_async(session, messages, model=DEFAULT_MODEL):\n", 87 | " \"\"\"\n", 88 | " Asynchronously call the OpenRouter chat completion API with the provided messages.\n", 89 | " Returns the content of the assistant’s reply.\n", 90 | " \"\"\"\n", 91 | " headers = {\n", 92 | " \"Authorization\": f\"Bearer {OPENROUTER_API_KEY}\",\n", 93 | " \"X-Title\": \"OpenDeepResearcher, by Matt Shumer\",\n", 94 | " \"Content-Type\": \"application/json\"\n", 95 | " }\n", 96 | " payload = {\n", 97 | " \"model\": model,\n", 98 | " \"messages\": messages\n", 99 | " }\n", 100 | " try:\n", 101 | " async with session.post(OPENROUTER_URL, headers=headers, json=payload) as resp:\n", 102 | " if resp.status == 200:\n", 103 | " result = await resp.json()\n", 104 | " try:\n", 105 | " return result['choices'][0]['message']['content']\n", 106 | " except (KeyError, IndexError) as e:\n", 107 | " print(\"Unexpected OpenRouter response structure:\", result)\n", 108 | " return None\n", 109 | " else:\n", 110 | " text = await resp.text()\n", 111 | " print(f\"OpenRouter API error: {resp.status} - {text}\")\n", 112 | " return None\n", 113 | " except Exception as e:\n", 114 | " print(\"Error calling OpenRouter:\", e)\n", 115 | " return None\n", 116 | "\n", 117 | "\n", 118 | "async def generate_search_queries_async(session, user_query):\n", 119 | " \"\"\"\n", 120 | " Ask the LLM to produce up to four precise search queries (in Python list format)\n", 121 | " based on the user’s query.\n", 122 | " \"\"\"\n", 123 | " prompt = (\n", 124 | " \"You are an expert research assistant. Given the user's query, generate up to four distinct, \"\n", 125 | " \"precise search queries that would help gather comprehensive information on the topic. \"\n", 126 | " \"Return only a Python list of strings, for example: ['query1', 'query2', 'query3'].\"\n", 127 | " )\n", 128 | " messages = [\n", 129 | " {\"role\": \"system\", \"content\": \"You are a helpful and precise research assistant.\"},\n", 130 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\n\\n{prompt}\"}\n", 131 | " ]\n", 132 | " response = await call_openrouter_async(session, messages)\n", 133 | " if response:\n", 134 | " try:\n", 135 | " # Expect exactly a Python list (e.g., \"['query1', 'query2']\")\n", 136 | " search_queries = eval(response)\n", 137 | " if isinstance(search_queries, list):\n", 138 | " return search_queries\n", 139 | " else:\n", 140 | " print(\"LLM did not return a list. Response:\", response)\n", 141 | " return []\n", 142 | " except Exception as e:\n", 143 | " print(\"Error parsing search queries:\", e, \"\\nResponse:\", response)\n", 144 | " return []\n", 145 | " return []\n", 146 | "\n", 147 | "\n", 148 | "async def perform_search_async(session, query):\n", 149 | " \"\"\"\n", 150 | " Asynchronously perform a Google search using SERPAPI for the given query.\n", 151 | " Returns a list of result URLs.\n", 152 | " \"\"\"\n", 153 | " params = {\n", 154 | " \"q\": query,\n", 155 | " \"api_key\": SERPAPI_API_KEY,\n", 156 | " \"engine\": \"google\"\n", 157 | " }\n", 158 | " try:\n", 159 | " async with session.get(SERPAPI_URL, params=params) as resp:\n", 160 | " if resp.status == 200:\n", 161 | " results = await resp.json()\n", 162 | " if \"organic_results\" in results:\n", 163 | " links = [item.get(\"link\") for item in results[\"organic_results\"] if \"link\" in item]\n", 164 | " return links\n", 165 | " else:\n", 166 | " print(\"No organic results in SERPAPI response.\")\n", 167 | " return []\n", 168 | " else:\n", 169 | " text = await resp.text()\n", 170 | " print(f\"SERPAPI error: {resp.status} - {text}\")\n", 171 | " return []\n", 172 | " except Exception as e:\n", 173 | " print(\"Error performing SERPAPI search:\", e)\n", 174 | " return []\n", 175 | "\n", 176 | "\n", 177 | "async def fetch_webpage_text_async(session, url):\n", 178 | " \"\"\"\n", 179 | " Asynchronously retrieve the text content of a webpage using Jina.\n", 180 | " The URL is appended to the Jina endpoint.\n", 181 | " \"\"\"\n", 182 | " full_url = f\"{JINA_BASE_URL}{url}\"\n", 183 | " headers = {\n", 184 | " \"Authorization\": f\"Bearer {JINA_API_KEY}\"\n", 185 | " }\n", 186 | " try:\n", 187 | " async with session.get(full_url, headers=headers) as resp:\n", 188 | " if resp.status == 200:\n", 189 | " return await resp.text()\n", 190 | " else:\n", 191 | " text = await resp.text()\n", 192 | " print(f\"Jina fetch error for {url}: {resp.status} - {text}\")\n", 193 | " return \"\"\n", 194 | " except Exception as e:\n", 195 | " print(\"Error fetching webpage text with Jina:\", e)\n", 196 | " return \"\"\n", 197 | "\n", 198 | "\n", 199 | "async def is_page_useful_async(session, user_query, page_text):\n", 200 | " \"\"\"\n", 201 | " Ask the LLM if the provided webpage content is useful for answering the user's query.\n", 202 | " The LLM must reply with exactly \"Yes\" or \"No\".\n", 203 | " \"\"\"\n", 204 | " prompt = (\n", 205 | " \"You are a critical research evaluator. Given the user's query and the content of a webpage, \"\n", 206 | " \"determine if the webpage contains information relevant and useful for addressing the query. \"\n", 207 | " \"Respond with exactly one word: 'Yes' if the page is useful, or 'No' if it is not. Do not include any extra text.\"\n", 208 | " )\n", 209 | " messages = [\n", 210 | " {\"role\": \"system\", \"content\": \"You are a strict and concise evaluator of research relevance.\"},\n", 211 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\n\\nWebpage Content (first 20000 characters):\\n{page_text[:20000]}\\n\\n{prompt}\"}\n", 212 | " ]\n", 213 | " response = await call_openrouter_async(session, messages)\n", 214 | " if response:\n", 215 | " answer = response.strip()\n", 216 | " if answer in [\"Yes\", \"No\"]:\n", 217 | " return answer\n", 218 | " else:\n", 219 | " # Fallback: try to extract Yes/No from the response.\n", 220 | " if \"Yes\" in answer:\n", 221 | " return \"Yes\"\n", 222 | " elif \"No\" in answer:\n", 223 | " return \"No\"\n", 224 | " return \"No\"\n", 225 | "\n", 226 | "\n", 227 | "async def extract_relevant_context_async(session, user_query, search_query, page_text):\n", 228 | " \"\"\"\n", 229 | " Given the original query, the search query used, and the page content,\n", 230 | " have the LLM extract all information relevant for answering the query.\n", 231 | " \"\"\"\n", 232 | " prompt = (\n", 233 | " \"You are an expert information extractor. Given the user's query, the search query that led to this page, \"\n", 234 | " \"and the webpage content, extract all pieces of information that are relevant to answering the user's query. \"\n", 235 | " \"Return only the relevant context as plain text without commentary.\"\n", 236 | " )\n", 237 | " messages = [\n", 238 | " {\"role\": \"system\", \"content\": \"You are an expert in extracting and summarizing relevant information.\"},\n", 239 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\nSearch Query: {search_query}\\n\\nWebpage Content (first 20000 characters):\\n{page_text[:20000]}\\n\\n{prompt}\"}\n", 240 | " ]\n", 241 | " response = await call_openrouter_async(session, messages)\n", 242 | " if response:\n", 243 | " return response.strip()\n", 244 | " return \"\"\n", 245 | "\n", 246 | "\n", 247 | "async def get_new_search_queries_async(session, user_query, previous_search_queries, all_contexts):\n", 248 | " \"\"\"\n", 249 | " Based on the original query, the previously used search queries, and all the extracted contexts,\n", 250 | " ask the LLM whether additional search queries are needed. If yes, return a Python list of up to four queries;\n", 251 | " if the LLM thinks research is complete, it should return \"\".\n", 252 | " \"\"\"\n", 253 | " context_combined = \"\\n\".join(all_contexts)\n", 254 | " prompt = (\n", 255 | " \"You are an analytical research assistant. Based on the original query, the search queries performed so far, \"\n", 256 | " \"and the extracted contexts from webpages, determine if further research is needed. \"\n", 257 | " \"If further research is needed, provide up to four new search queries as a Python list (for example, \"\n", 258 | " \"['new query1', 'new query2']). If you believe no further research is needed, respond with exactly .\"\n", 259 | " \"\\nOutput only a Python list or the token without any additional text.\"\n", 260 | " )\n", 261 | " messages = [\n", 262 | " {\"role\": \"system\", \"content\": \"You are a systematic research planner.\"},\n", 263 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\nPrevious Search Queries: {previous_search_queries}\\n\\nExtracted Relevant Contexts:\\n{context_combined}\\n\\n{prompt}\"}\n", 264 | " ]\n", 265 | " response = await call_openrouter_async(session, messages)\n", 266 | " if response:\n", 267 | " cleaned = response.strip()\n", 268 | " if cleaned == \"\":\n", 269 | " return \"\"\n", 270 | " try:\n", 271 | " new_queries = eval(cleaned)\n", 272 | " if isinstance(new_queries, list):\n", 273 | " return new_queries\n", 274 | " else:\n", 275 | " print(\"LLM did not return a list for new search queries. Response:\", response)\n", 276 | " return []\n", 277 | " except Exception as e:\n", 278 | " print(\"Error parsing new search queries:\", e, \"\\nResponse:\", response)\n", 279 | " return []\n", 280 | " return []\n", 281 | "\n", 282 | "\n", 283 | "async def generate_final_report_async(session, user_query, all_contexts):\n", 284 | " \"\"\"\n", 285 | " Generate the final comprehensive report using all gathered contexts.\n", 286 | " \"\"\"\n", 287 | " context_combined = \"\\n\".join(all_contexts)\n", 288 | " prompt = (\n", 289 | " \"You are an expert researcher and report writer. Based on the gathered contexts below and the original query, \"\n", 290 | " \"write a comprehensive, well-structured, and detailed report that addresses the query thoroughly. \"\n", 291 | " \"Include all relevant insights and conclusions without extraneous commentary.\"\n", 292 | " )\n", 293 | " messages = [\n", 294 | " {\"role\": \"system\", \"content\": \"You are a skilled report writer.\"},\n", 295 | " {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\n\\nGathered Relevant Contexts:\\n{context_combined}\\n\\n{prompt}\"}\n", 296 | " ]\n", 297 | " report = await call_openrouter_async(session, messages)\n", 298 | " return report\n", 299 | "\n", 300 | "\n", 301 | "async def process_link(session, link, user_query, search_query):\n", 302 | " \"\"\"\n", 303 | " Process a single link: fetch its content, judge its usefulness, and if useful, extract the relevant context.\n", 304 | " \"\"\"\n", 305 | " print(f\"Fetching content from: {link}\")\n", 306 | " page_text = await fetch_webpage_text_async(session, link)\n", 307 | " if not page_text:\n", 308 | " return None\n", 309 | " usefulness = await is_page_useful_async(session, user_query, page_text)\n", 310 | " print(f\"Page usefulness for {link}: {usefulness}\")\n", 311 | " if usefulness == \"Yes\":\n", 312 | " context = await extract_relevant_context_async(session, user_query, search_query, page_text)\n", 313 | " if context:\n", 314 | " print(f\"Extracted context from {link} (first 200 chars): {context[:200]}\")\n", 315 | " return context\n", 316 | " return None\n", 317 | "\n", 318 | "\n", 319 | "# =========================\n", 320 | "# Main Asynchronous Routine\n", 321 | "# =========================\n", 322 | "\n", 323 | "async def async_main():\n", 324 | " user_query = input(\"Enter your research query/topic: \").strip()\n", 325 | " iter_limit_input = input(\"Enter maximum number of iterations (default 10): \").strip()\n", 326 | " iteration_limit = int(iter_limit_input) if iter_limit_input.isdigit() else 10\n", 327 | "\n", 328 | " aggregated_contexts = [] # All useful contexts from every iteration\n", 329 | " all_search_queries = [] # Every search query used across iterations\n", 330 | " iteration = 0\n", 331 | "\n", 332 | " async with aiohttp.ClientSession() as session:\n", 333 | " # ----- INITIAL SEARCH QUERIES -----\n", 334 | " new_search_queries = await generate_search_queries_async(session, user_query)\n", 335 | " if not new_search_queries:\n", 336 | " print(\"No search queries were generated by the LLM. Exiting.\")\n", 337 | " return\n", 338 | " all_search_queries.extend(new_search_queries)\n", 339 | "\n", 340 | " # ----- ITERATIVE RESEARCH LOOP -----\n", 341 | " while iteration < iteration_limit:\n", 342 | " print(f\"\\n=== Iteration {iteration + 1} ===\")\n", 343 | " iteration_contexts = []\n", 344 | "\n", 345 | " # For each search query, perform SERPAPI searches concurrently.\n", 346 | " search_tasks = [perform_search_async(session, query) for query in new_search_queries]\n", 347 | " search_results = await asyncio.gather(*search_tasks)\n", 348 | "\n", 349 | " # Aggregate all unique links from all search queries of this iteration.\n", 350 | " # Map each unique link to the search query that produced it.\n", 351 | " unique_links = {}\n", 352 | " for idx, links in enumerate(search_results):\n", 353 | " query = new_search_queries[idx]\n", 354 | " for link in links:\n", 355 | " if link not in unique_links:\n", 356 | " unique_links[link] = query\n", 357 | "\n", 358 | " print(f\"Aggregated {len(unique_links)} unique links from this iteration.\")\n", 359 | "\n", 360 | " # Process each link concurrently: fetch, judge, and extract context.\n", 361 | " link_tasks = [\n", 362 | " process_link(session, link, user_query, unique_links[link])\n", 363 | " for link in unique_links\n", 364 | " ]\n", 365 | " link_results = await asyncio.gather(*link_tasks)\n", 366 | "\n", 367 | " # Collect non-None contexts.\n", 368 | " for res in link_results:\n", 369 | " if res:\n", 370 | " iteration_contexts.append(res)\n", 371 | "\n", 372 | " if iteration_contexts:\n", 373 | " aggregated_contexts.extend(iteration_contexts)\n", 374 | " else:\n", 375 | " print(\"No useful contexts were found in this iteration.\")\n", 376 | "\n", 377 | " # ----- ASK THE LLM IF MORE SEARCHES ARE NEEDED -----\n", 378 | " new_search_queries = await get_new_search_queries_async(session, user_query, all_search_queries, aggregated_contexts)\n", 379 | " if new_search_queries == \"\":\n", 380 | " print(\"LLM indicated that no further research is needed.\")\n", 381 | " break\n", 382 | " elif new_search_queries:\n", 383 | " print(\"LLM provided new search queries:\", new_search_queries)\n", 384 | " all_search_queries.extend(new_search_queries)\n", 385 | " else:\n", 386 | " print(\"LLM did not provide any new search queries. Ending the loop.\")\n", 387 | " break\n", 388 | "\n", 389 | " iteration += 1\n", 390 | "\n", 391 | " # ----- FINAL REPORT -----\n", 392 | " print(\"\\nGenerating final report...\")\n", 393 | " final_report = await generate_final_report_async(session, user_query, aggregated_contexts)\n", 394 | " print(\"\\n==== FINAL REPORT ====\\n\")\n", 395 | " print(final_report)\n", 396 | "\n", 397 | "\n", 398 | "def main():\n", 399 | " asyncio.run(async_main())\n", 400 | "\n", 401 | "\n", 402 | "if __name__ == \"__main__\":\n", 403 | " main()\n" 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "source": [], 409 | "metadata": { 410 | "id": "46Q5XpapDJZT" 411 | }, 412 | "execution_count": null, 413 | "outputs": [] 414 | } 415 | ] 416 | } --------------------------------------------------------------------------------