├── LICENSE
├── README.md
├── open_deep_researcher_gradio.ipynb
└── open_deep_researcher.ipynb


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2025 mshumer
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # OpenDeepResearcher
 2 | 
 3 | This notebook implements an **AI researcher** that continuously searches for information based on a user query until the system is confident that it has gathered all the necessary details. It makes use of several services to do so:
 4 | 
 5 | - **SERPAPI**: To perform Google searches.
 6 | - **Jina**: To fetch and extract webpage content.
 7 | - **OpenRouter** (default model: `anthropic/claude-3.5-haiku`): To interact with a LLM for generating search queries, evaluating page relevance, and extracting context.
 8 | 
 9 | ## Features
10 | 
11 | - **Iterative Research Loop:** The system refines its search queries iteratively until no further queries are required.
12 | - **Asynchronous Processing:** Searches, webpage fetching, evaluation, and context extraction are performed concurrently to improve speed.
13 | - **Duplicate Filtering:** Aggregates and deduplicates links within each round, ensuring that the same link isn’t processed twice.
14 | - **LLM-Powered Decision Making:** Uses the LLM to generate new search queries, decide on page usefulness, extract relevant context, and produce a final comprehensive report.
15 | - **Gradio Interface:** Use the `open-deep-researcher - gradio` notebook if you want to use this in a functional UI
16 | 
17 | ## Requirements
18 | 
19 | - API access and keys for:
20 |   - **OpenRouter API**
21 |   - **SERPAPI API**
22 |   - **Jina API**
23 | 
24 | ## Setup
25 | 
26 | 1. **Clone or Open the Notebook:**
27 |    - Download the notebook file or open it directly in [Google Colab](https://colab.research.google.com/github/mshumer/OpenDeepResearcher/blob/main/open_deep_researcher.ipynb).
28 | 
29 | 2. **Install `nest_asyncio`:**
30 | 
31 |    Run the first cell to set up `nest_asyncio`.
32 | 
33 | 3. **Configure API Keys:**
34 |    - Replace the placeholder values in the notebook for `OPENROUTER_API_KEY`, `SERPAPI_API_KEY`, and `JINA_API_KEY` with your actual API keys.
35 | 
36 | ## Usage
37 | 
38 | 1. **Run the Notebook Cells:**
39 |    Execute all cells in order. The notebook will prompt you for:
40 |    - A research query/topic.
41 |    - An optional maximum number of iterations (default is 10).
42 | 
43 | 2. **Follow the Research Process:**
44 |    - **Initial Query & Search Generation:** The notebook uses the LLM to generate initial search queries.
45 |    - **Asynchronous Searches & Extraction:** It performs SERPAPI searches for all queries concurrently, aggregates unique links, and processes each link in parallel to determine page usefulness and extract relevant context.
46 |    - **Iterative Refinement:** After each round, the aggregated context is analyzed by the LLM to determine if further search queries are needed.
47 |    - **Final Report:** Once the LLM indicates that no further research is needed (or the iteration limit is reached), a final report is generated based on all gathered context.
48 | 
49 | 3. **View the Final Report:**
50 |    The final comprehensive report will be printed in the output.
51 | 
52 | ## How It Works
53 | 
54 | 1. **Input & Query Generation:**  
55 |    The user enters a research topic, and the LLM generates up to four distinct search queries.
56 | 
57 | 2. **Concurrent Search & Processing:**  
58 |    - **SERPAPI:** Each search query is sent to SERPAPI concurrently.
59 |    - **Deduplication:** All retrieved links are aggregated and deduplicated within the current iteration.
60 |    - **Jina & LLM:** Each unique link is processed concurrently to fetch webpage content via Jina, evaluate its usefulness with the LLM, and extract relevant information if the page is deemed useful.
61 | 
62 | 3. **Iterative Refinement:**  
63 |    The system passes the aggregated context to the LLM to determine if further search queries are needed. New queries are generated if required; otherwise, the loop terminates.
64 | 
65 | 4. **Final Report Generation:**  
66 |    All gathered context is compiled and sent to the LLM to produce a final, comprehensive report addressing the original query.
67 | 
68 | ## Troubleshooting
69 | 
70 | - **RuntimeError with asyncio:**  
71 |   If you encounter an error like:
72 |   ```
73 |   RuntimeError: asyncio.run() cannot be called from a running event loop
74 |   ```
75 |   Ensure you have applied `nest_asyncio` as shown in the setup section.
76 | 
77 | - **API Issues:**  
78 |   Verify that your API keys are correct and that you are not exceeding any rate limits.
79 | 
80 | ---
81 | 
82 | Follow me on [X](https://x.com/mattshumer_) for updates on this and other AI things I'm working on.
83 | 
84 | Head to [ShumerPrompt](https://shumerprompt.com), my "Github for Prompts"!
85 | 
86 | OpenDeepResearcher is released under the MIT License. See the LICENSE file for more details.
87 | 


--------------------------------------------------------------------------------
/open_deep_researcher_gradio.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "provenance": [],
  7 |       "authorship_tag": "ABX9TyM6XDBP8oqAaLL0GMT0mBj+",
  8 |       "include_colab_link": true
  9 |     },
 10 |     "kernelspec": {
 11 |       "name": "python3",
 12 |       "display_name": "Python 3"
 13 |     },
 14 |     "language_info": {
 15 |       "name": "python"
 16 |     }
 17 |   },
 18 |   "cells": [
 19 |     {
 20 |       "cell_type": "markdown",
 21 |       "metadata": {
 22 |         "id": "view-in-github",
 23 |         "colab_type": "text"
 24 |       },
 25 |       "source": [
 26 |         "<a href=\"https://colab.research.google.com/github/mshumer/OpenDeepResearcher/blob/main/open_deep_researcher_gradio.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
 27 |       ]
 28 |     },
 29 |     {
 30 |       "cell_type": "code",
 31 |       "execution_count": null,
 32 |       "metadata": {
 33 |         "id": "bSpd75_8O8rr"
 34 |       },
 35 |       "outputs": [],
 36 |       "source": [
 37 |         "!pip install nest_asyncio gradio aiohttp\n",
 38 |         "\n",
 39 |         "import nest_asyncio\n",
 40 |         "nest_asyncio.apply()\n",
 41 |         "\n",
 42 |         "import asyncio\n",
 43 |         "import aiohttp\n",
 44 |         "import gradio as gr\n",
 45 |         "import json\n",
 46 |         "\n",
 47 |         "# ---------------------------\n",
 48 |         "# Configuration Constants\n",
 49 |         "# ---------------------------\n",
 50 |         "OPENROUTER_API_KEY = \"REDACTED\" # Replace with your OpenRouter API key\n",
 51 |         "SERPAPI_API_KEY = \"REDACTED\" # Replace with your SERPAPI API key\n",
 52 |         "JINA_API_KEY = \"REDACTED\" # Replace with your Jina API key\n",
 53 |         "\n",
 54 |         "OPENROUTER_URL = \"https://openrouter.ai/api/v1/chat/completions\"\n",
 55 |         "SERPAPI_URL = \"https://serpapi.com/search\"\n",
 56 |         "JINA_BASE_URL = \"https://r.jina.ai/\"\n",
 57 |         "\n",
 58 |         "DEFAULT_MODEL = \"anthropic/claude-3.5-haiku\"\n",
 59 |         "\n",
 60 |         "# -------------------------------\n",
 61 |         "# Asynchronous Helper Functions\n",
 62 |         "# -------------------------------\n",
 63 |         "\n",
 64 |         "async def call_openrouter_async(session, messages, model=DEFAULT_MODEL):\n",
 65 |         "    headers = {\n",
 66 |         "        \"Authorization\": f\"Bearer {OPENROUTER_API_KEY}\",\n",
 67 |         "        \"X-Title\": \"OpenDeepResearcher, by Matt Shumer\",\n",
 68 |         "        \"Content-Type\": \"application/json\"\n",
 69 |         "    }\n",
 70 |         "    payload = {\n",
 71 |         "        \"model\": model,\n",
 72 |         "        \"messages\": messages\n",
 73 |         "    }\n",
 74 |         "    try:\n",
 75 |         "        async with session.post(OPENROUTER_URL, headers=headers, json=payload) as resp:\n",
 76 |         "            if resp.status == 200:\n",
 77 |         "                result = await resp.json()\n",
 78 |         "                try:\n",
 79 |         "                    return result['choices'][0]['message']['content']\n",
 80 |         "                except (KeyError, IndexError):\n",
 81 |         "                    print(\"Unexpected OpenRouter response structure:\", result)\n",
 82 |         "                    return None\n",
 83 |         "            else:\n",
 84 |         "                text = await resp.text()\n",
 85 |         "                print(f\"OpenRouter API error: {resp.status} - {text}\")\n",
 86 |         "                return None\n",
 87 |         "    except Exception as e:\n",
 88 |         "        print(\"Error calling OpenRouter:\", e)\n",
 89 |         "        return None\n",
 90 |         "\n",
 91 |         "async def generate_search_queries_async(session, user_query):\n",
 92 |         "    prompt = (\n",
 93 |         "        \"You are an expert research assistant. Given the user's query, generate up to four distinct, \"\n",
 94 |         "        \"precise search queries that would help gather complete information on the topic. \"\n",
 95 |         "        \"Return only a Python list of strings, for example: ['query1', 'query2', 'query3'].\"\n",
 96 |         "    )\n",
 97 |         "    messages = [\n",
 98 |         "        {\"role\": \"system\", \"content\": \"You are a helpful and precise research assistant.\"},\n",
 99 |         "        {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\n\\n{prompt}\"}\n",
100 |         "    ]\n",
101 |         "    response = await call_openrouter_async(session, messages)\n",
102 |         "    if response:\n",
103 |         "        try:\n",
104 |         "            search_queries = eval(response)\n",
105 |         "            if isinstance(search_queries, list):\n",
106 |         "                return search_queries\n",
107 |         "            else:\n",
108 |         "                print(\"LLM did not return a list. Response:\", response)\n",
109 |         "                return []\n",
110 |         "        except Exception as e:\n",
111 |         "            print(\"Error parsing search queries:\", e, \"\\nResponse:\", response)\n",
112 |         "            return []\n",
113 |         "    return []\n",
114 |         "\n",
115 |         "async def perform_search_async(session, query):\n",
116 |         "    params = {\n",
117 |         "        \"q\": query,\n",
118 |         "        \"api_key\": SERPAPI_API_KEY,\n",
119 |         "        \"engine\": \"google\"\n",
120 |         "    }\n",
121 |         "    try:\n",
122 |         "        async with session.get(SERPAPI_URL, params=params) as resp:\n",
123 |         "            if resp.status == 200:\n",
124 |         "                results = await resp.json()\n",
125 |         "                if \"organic_results\" in results:\n",
126 |         "                    links = [item.get(\"link\") for item in results[\"organic_results\"] if \"link\" in item]\n",
127 |         "                    return links\n",
128 |         "                else:\n",
129 |         "                    print(\"No organic results in SERPAPI response.\")\n",
130 |         "                    return []\n",
131 |         "            else:\n",
132 |         "                text = await resp.text()\n",
133 |         "                print(f\"SERPAPI error: {resp.status} - {text}\")\n",
134 |         "                return []\n",
135 |         "    except Exception as e:\n",
136 |         "        print(\"Error performing SERPAPI search:\", e)\n",
137 |         "        return []\n",
138 |         "\n",
139 |         "async def fetch_webpage_text_async(session, url):\n",
140 |         "    full_url = f\"{JINA_BASE_URL}{url}\"\n",
141 |         "    headers = {\n",
142 |         "        \"Authorization\": f\"Bearer {JINA_API_KEY}\"\n",
143 |         "    }\n",
144 |         "    try:\n",
145 |         "        async with session.get(full_url, headers=headers) as resp:\n",
146 |         "            if resp.status == 200:\n",
147 |         "                return await resp.text()\n",
148 |         "            else:\n",
149 |         "                text = await resp.text()\n",
150 |         "                print(f\"Jina fetch error for {url}: {resp.status} - {text}\")\n",
151 |         "                return \"\"\n",
152 |         "    except Exception as e:\n",
153 |         "        print(\"Error fetching webpage text with Jina:\", e)\n",
154 |         "        return \"\"\n",
155 |         "\n",
156 |         "async def is_page_useful_async(session, user_query, page_text):\n",
157 |         "    prompt = (\n",
158 |         "        \"You are a critical research evaluator. Given the user's query and the content of a webpage, \"\n",
159 |         "        \"determine if the webpage contains information that is useful for addressing the query. \"\n",
160 |         "        \"Respond with exactly one word: 'Yes' if the page is useful, or 'No' if it is not. Do not include any extra text.\"\n",
161 |         "    )\n",
162 |         "    messages = [\n",
163 |         "        {\"role\": \"system\", \"content\": \"You are a strict and concise evaluator of research relevance.\"},\n",
164 |         "        {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\n\\nWebpage Content (first 20000 characters):\\n{page_text[:20000]}\\n\\n{prompt}\"}\n",
165 |         "    ]\n",
166 |         "    response = await call_openrouter_async(session, messages)\n",
167 |         "    if response:\n",
168 |         "        answer = response.strip()\n",
169 |         "        if answer in [\"Yes\", \"No\"]:\n",
170 |         "            return answer\n",
171 |         "        else:\n",
172 |         "            if \"Yes\" in answer:\n",
173 |         "                return \"Yes\"\n",
174 |         "            elif \"No\" in answer:\n",
175 |         "                return \"No\"\n",
176 |         "    return \"No\"\n",
177 |         "\n",
178 |         "async def extract_relevant_context_async(session, user_query, search_query, page_text):\n",
179 |         "    prompt = (\n",
180 |         "        \"You are an expert information extractor. Given the user's query, the search query that led to this page, \"\n",
181 |         "        \"and the webpage content, extract all pieces of information that are useful for answering the user's query. \"\n",
182 |         "        \"Return only the relevant context as plain text without extra commentary.\"\n",
183 |         "    )\n",
184 |         "    messages = [\n",
185 |         "        {\"role\": \"system\", \"content\": \"You are an expert in extracting and summarizing relevant information.\"},\n",
186 |         "        {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\nSearch Query: {search_query}\\n\\nWebpage Content (first 20000 characters):\\n{page_text[:20000]}\\n\\n{prompt}\"}\n",
187 |         "    ]\n",
188 |         "    response = await call_openrouter_async(session, messages)\n",
189 |         "    if response:\n",
190 |         "        return response.strip()\n",
191 |         "    return \"\"\n",
192 |         "\n",
193 |         "async def get_new_search_queries_async(session, user_query, previous_search_queries, all_contexts):\n",
194 |         "    context_combined = \"\\n\".join(all_contexts)\n",
195 |         "    prompt = (\n",
196 |         "        \"You are an analytical research assistant. Based on the original query, the search queries performed so far, \"\n",
197 |         "        \"and the extracted contexts from webpages, decide if further research is needed. \"\n",
198 |         "        \"If further research is needed, provide up to four new search queries as a Python list (for example, \"\n",
199 |         "        \"['new query1', 'new query2']). If you believe no further research is needed, respond with exactly <done>.\"\n",
200 |         "        \"\\nOutput only a Python list or the token <done> without any extra text.\"\n",
201 |         "    )\n",
202 |         "    messages = [\n",
203 |         "        {\"role\": \"system\", \"content\": \"You are a systematic research planner.\"},\n",
204 |         "        {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\nPrevious Search Queries: {previous_search_queries}\\n\\nExtracted Relevant Contexts:\\n{context_combined}\\n\\n{prompt}\"}\n",
205 |         "    ]\n",
206 |         "    response = await call_openrouter_async(session, messages)\n",
207 |         "    if response:\n",
208 |         "        cleaned = response.strip()\n",
209 |         "        if cleaned == \"<done>\":\n",
210 |         "            return \"<done>\"\n",
211 |         "        try:\n",
212 |         "            new_queries = eval(cleaned)\n",
213 |         "            if isinstance(new_queries, list):\n",
214 |         "                return new_queries\n",
215 |         "            else:\n",
216 |         "                print(\"LLM did not return a list for new search queries. Response:\", response)\n",
217 |         "                return []\n",
218 |         "        except Exception as e:\n",
219 |         "            print(\"Error parsing new search queries:\", e, \"\\nResponse:\", response)\n",
220 |         "            return []\n",
221 |         "    return []\n",
222 |         "\n",
223 |         "async def generate_final_report_async(session, user_query, all_contexts):\n",
224 |         "    context_combined = \"\\n\".join(all_contexts)\n",
225 |         "    prompt = (\n",
226 |         "        \"You are an expert researcher and report writer. Based on the gathered contexts below and the original query, \"\n",
227 |         "        \"write a complete, well-structured, and detailed report that addresses the query thoroughly. \"\n",
228 |         "        \"Include all useful insights and conclusions without extra commentary.\"\n",
229 |         "    )\n",
230 |         "    messages = [\n",
231 |         "        {\"role\": \"system\", \"content\": \"You are a skilled report writer.\"},\n",
232 |         "        {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\n\\nGathered Relevant Contexts:\\n{context_combined}\\n\\n{prompt}\"}\n",
233 |         "    ]\n",
234 |         "    report = await call_openrouter_async(session, messages)\n",
235 |         "    return report\n",
236 |         "\n",
237 |         "async def process_link(session, link, user_query, search_query, log):\n",
238 |         "    log.append(f\"Fetching content from: {link}\")\n",
239 |         "    page_text = await fetch_webpage_text_async(session, link)\n",
240 |         "    if not page_text:\n",
241 |         "        log.append(f\"Failed to fetch content from: {link}\")\n",
242 |         "        return None\n",
243 |         "    usefulness = await is_page_useful_async(session, user_query, page_text)\n",
244 |         "    log.append(f\"Page usefulness for {link}: {usefulness}\")\n",
245 |         "    if usefulness == \"Yes\":\n",
246 |         "        context = await extract_relevant_context_async(session, user_query, search_query, page_text)\n",
247 |         "        if context:\n",
248 |         "            log.append(f\"Extracted context from {link} (first 200 chars): {context[:200]}\")\n",
249 |         "            return context\n",
250 |         "    return None\n",
251 |         "\n",
252 |         "# -----------------------------\n",
253 |         "# Main Asynchronous Routine\n",
254 |         "# -----------------------------\n",
255 |         "\n",
256 |         "async def async_research(user_query, iteration_limit):\n",
257 |         "    aggregated_contexts = []\n",
258 |         "    all_search_queries = []\n",
259 |         "    log_messages = []  # List to store intermediate steps\n",
260 |         "    iteration = 0\n",
261 |         "\n",
262 |         "    async with aiohttp.ClientSession() as session:\n",
263 |         "        log_messages.append(\"Generating initial search queries...\")\n",
264 |         "        new_search_queries = await generate_search_queries_async(session, user_query)\n",
265 |         "        if not new_search_queries:\n",
266 |         "            log_messages.append(\"No search queries were generated by the LLM. Exiting.\")\n",
267 |         "            return \"No search queries were generated by the LLM. Exiting.\", \"\\n\".join(log_messages)\n",
268 |         "        all_search_queries.extend(new_search_queries)\n",
269 |         "        log_messages.append(f\"Initial search queries: {new_search_queries}\")\n",
270 |         "\n",
271 |         "        while iteration < iteration_limit:\n",
272 |         "            log_messages.append(f\"\\n=== Iteration {iteration + 1} ===\")\n",
273 |         "            iteration_contexts = []\n",
274 |         "            search_tasks = [perform_search_async(session, query) for query in new_search_queries]\n",
275 |         "            search_results = await asyncio.gather(*search_tasks)\n",
276 |         "            unique_links = {}\n",
277 |         "            for idx, links in enumerate(search_results):\n",
278 |         "                query_used = new_search_queries[idx]\n",
279 |         "                for link in links:\n",
280 |         "                    if link not in unique_links:\n",
281 |         "                        unique_links[link] = query_used\n",
282 |         "\n",
283 |         "            log_messages.append(f\"Aggregated {len(unique_links)} unique links from this iteration.\")\n",
284 |         "            link_tasks = [\n",
285 |         "                process_link(session, link, user_query, unique_links[link], log_messages)\n",
286 |         "                for link in unique_links\n",
287 |         "            ]\n",
288 |         "            link_results = await asyncio.gather(*link_tasks)\n",
289 |         "            for res in link_results:\n",
290 |         "                if res:\n",
291 |         "                    iteration_contexts.append(res)\n",
292 |         "\n",
293 |         "            if iteration_contexts:\n",
294 |         "                aggregated_contexts.extend(iteration_contexts)\n",
295 |         "                log_messages.append(f\"Found {len(iteration_contexts)} useful contexts in this iteration.\")\n",
296 |         "            else:\n",
297 |         "                log_messages.append(\"No useful contexts were found in this iteration.\")\n",
298 |         "\n",
299 |         "            new_search_queries = await get_new_search_queries_async(session, user_query, all_search_queries, aggregated_contexts)\n",
300 |         "            if new_search_queries == \"<done>\":\n",
301 |         "                log_messages.append(\"LLM indicated that no further research is needed.\")\n",
302 |         "                break\n",
303 |         "            elif new_search_queries:\n",
304 |         "                log_messages.append(f\"LLM provided new search queries: {new_search_queries}\")\n",
305 |         "                all_search_queries.extend(new_search_queries)\n",
306 |         "            else:\n",
307 |         "                log_messages.append(\"LLM did not provide any new search queries. Ending the loop.\")\n",
308 |         "                break\n",
309 |         "\n",
310 |         "            iteration += 1\n",
311 |         "\n",
312 |         "        log_messages.append(\"\\nGenerating final report...\")\n",
313 |         "        final_report = await generate_final_report_async(session, user_query, aggregated_contexts)\n",
314 |         "        return final_report, \"\\n\".join(log_messages)\n",
315 |         "\n",
316 |         "def run_research(user_query, iteration_limit=10):\n",
317 |         "    return asyncio.run(async_research(user_query, iteration_limit))\n",
318 |         "\n",
319 |         "# -----------------------------\n",
320 |         "# Gradio UI Setup\n",
321 |         "# -----------------------------\n",
322 |         "\n",
323 |         "def gradio_run(user_query, iteration_limit):\n",
324 |         "    try:\n",
325 |         "        final_report, logs = run_research(user_query, int(iteration_limit))\n",
326 |         "        return final_report, logs\n",
327 |         "    except Exception as e:\n",
328 |         "        return f\"An error occurred: {e}\", \"\"\n",
329 |         "\n",
330 |         "iface = gr.Interface(\n",
331 |         "    fn=gradio_run,\n",
332 |         "    inputs=[\n",
333 |         "        gr.Textbox(lines=2, label=\"Research Query/Topic\"),\n",
334 |         "        gr.Number(value=10, label=\"Max Iterations\")\n",
335 |         "    ],\n",
336 |         "    outputs=[\n",
337 |         "        gr.Textbox(label=\"Final Report\"),\n",
338 |         "        gr.Textbox(label=\"Intermediate Steps Log\")\n",
339 |         "    ],\n",
340 |         "    title=\"Research Assistant\",\n",
341 |         "    description=\"Enter your query and a maximum iteration count to generate a report. The log will show the steps taken.\"\n",
342 |         ")\n",
343 |         "\n",
344 |         "iface.launch()"
345 |       ]
346 |     }
347 |   ]
348 | }


--------------------------------------------------------------------------------
/open_deep_researcher.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "provenance": [],
  7 |       "authorship_tag": "ABX9TyOe5BsaH0aplNCjknkFtnjg",
  8 |       "include_colab_link": true
  9 |     },
 10 |     "kernelspec": {
 11 |       "name": "python3",
 12 |       "display_name": "Python 3"
 13 |     },
 14 |     "language_info": {
 15 |       "name": "python"
 16 |     }
 17 |   },
 18 |   "cells": [
 19 |     {
 20 |       "cell_type": "markdown",
 21 |       "metadata": {
 22 |         "id": "view-in-github",
 23 |         "colab_type": "text"
 24 |       },
 25 |       "source": [
 26 |         "<a href=\"https://colab.research.google.com/github/mshumer/OpenDeepResearcher/blob/main/open_deep_researcher.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
 27 |       ]
 28 |     },
 29 |     {
 30 |       "cell_type": "code",
 31 |       "source": [
 32 |         "!pip install nest_asyncio\n",
 33 |         "import nest_asyncio\n",
 34 |         "nest_asyncio.apply()"
 35 |       ],
 36 |       "metadata": {
 37 |         "colab": {
 38 |           "base_uri": "https://localhost:8080/"
 39 |         },
 40 |         "id": "y7cTpP9rDZW-",
 41 |         "outputId": "5a443ad2-7a8d-4fef-f315-12108c28f1a2"
 42 |       },
 43 |       "execution_count": null,
 44 |       "outputs": [
 45 |         {
 46 |           "output_type": "stream",
 47 |           "name": "stdout",
 48 |           "text": [
 49 |             "Requirement already satisfied: nest_asyncio in /usr/local/lib/python3.11/dist-packages (1.6.0)\n"
 50 |           ]
 51 |         }
 52 |       ]
 53 |     },
 54 |     {
 55 |       "cell_type": "code",
 56 |       "execution_count": null,
 57 |       "metadata": {
 58 |         "id": "GJTo96a7DGUz"
 59 |       },
 60 |       "outputs": [],
 61 |       "source": [
 62 |         "import asyncio\n",
 63 |         "import aiohttp\n",
 64 |         "import json\n",
 65 |         "\n",
 66 |         "# =======================\n",
 67 |         "# Configuration Constants\n",
 68 |         "# =======================\n",
 69 |         "OPENROUTER_API_KEY = \"REDACTED\" # Replace with your OpenRouter API key\n",
 70 |         "SERPAPI_API_KEY = \"REDACTED\" # Replace with your SERPAPI API key\n",
 71 |         "JINA_API_KEY = \"REDACTED\" # Replace with your JINA API key\n",
 72 |         "\n",
 73 |         "# Endpoints\n",
 74 |         "OPENROUTER_URL = \"https://openrouter.ai/api/v1/chat/completions\"\n",
 75 |         "SERPAPI_URL = \"https://serpapi.com/search\"\n",
 76 |         "JINA_BASE_URL = \"https://r.jina.ai/\"\n",
 77 |         "\n",
 78 |         "# Default LLM model (can be changed if desired)\n",
 79 |         "DEFAULT_MODEL = \"anthropic/claude-3.5-haiku\"\n",
 80 |         "\n",
 81 |         "\n",
 82 |         "# ============================\n",
 83 |         "# Asynchronous Helper Functions\n",
 84 |         "# ============================\n",
 85 |         "\n",
 86 |         "async def call_openrouter_async(session, messages, model=DEFAULT_MODEL):\n",
 87 |         "    \"\"\"\n",
 88 |         "    Asynchronously call the OpenRouter chat completion API with the provided messages.\n",
 89 |         "    Returns the content of the assistant’s reply.\n",
 90 |         "    \"\"\"\n",
 91 |         "    headers = {\n",
 92 |         "        \"Authorization\": f\"Bearer {OPENROUTER_API_KEY}\",\n",
 93 |         "        \"X-Title\": \"OpenDeepResearcher, by Matt Shumer\",\n",
 94 |         "        \"Content-Type\": \"application/json\"\n",
 95 |         "    }\n",
 96 |         "    payload = {\n",
 97 |         "        \"model\": model,\n",
 98 |         "        \"messages\": messages\n",
 99 |         "    }\n",
100 |         "    try:\n",
101 |         "        async with session.post(OPENROUTER_URL, headers=headers, json=payload) as resp:\n",
102 |         "            if resp.status == 200:\n",
103 |         "                result = await resp.json()\n",
104 |         "                try:\n",
105 |         "                    return result['choices'][0]['message']['content']\n",
106 |         "                except (KeyError, IndexError) as e:\n",
107 |         "                    print(\"Unexpected OpenRouter response structure:\", result)\n",
108 |         "                    return None\n",
109 |         "            else:\n",
110 |         "                text = await resp.text()\n",
111 |         "                print(f\"OpenRouter API error: {resp.status} - {text}\")\n",
112 |         "                return None\n",
113 |         "    except Exception as e:\n",
114 |         "        print(\"Error calling OpenRouter:\", e)\n",
115 |         "        return None\n",
116 |         "\n",
117 |         "\n",
118 |         "async def generate_search_queries_async(session, user_query):\n",
119 |         "    \"\"\"\n",
120 |         "    Ask the LLM to produce up to four precise search queries (in Python list format)\n",
121 |         "    based on the user’s query.\n",
122 |         "    \"\"\"\n",
123 |         "    prompt = (\n",
124 |         "        \"You are an expert research assistant. Given the user's query, generate up to four distinct, \"\n",
125 |         "        \"precise search queries that would help gather comprehensive information on the topic. \"\n",
126 |         "        \"Return only a Python list of strings, for example: ['query1', 'query2', 'query3'].\"\n",
127 |         "    )\n",
128 |         "    messages = [\n",
129 |         "        {\"role\": \"system\", \"content\": \"You are a helpful and precise research assistant.\"},\n",
130 |         "        {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\n\\n{prompt}\"}\n",
131 |         "    ]\n",
132 |         "    response = await call_openrouter_async(session, messages)\n",
133 |         "    if response:\n",
134 |         "        try:\n",
135 |         "            # Expect exactly a Python list (e.g., \"['query1', 'query2']\")\n",
136 |         "            search_queries = eval(response)\n",
137 |         "            if isinstance(search_queries, list):\n",
138 |         "                return search_queries\n",
139 |         "            else:\n",
140 |         "                print(\"LLM did not return a list. Response:\", response)\n",
141 |         "                return []\n",
142 |         "        except Exception as e:\n",
143 |         "            print(\"Error parsing search queries:\", e, \"\\nResponse:\", response)\n",
144 |         "            return []\n",
145 |         "    return []\n",
146 |         "\n",
147 |         "\n",
148 |         "async def perform_search_async(session, query):\n",
149 |         "    \"\"\"\n",
150 |         "    Asynchronously perform a Google search using SERPAPI for the given query.\n",
151 |         "    Returns a list of result URLs.\n",
152 |         "    \"\"\"\n",
153 |         "    params = {\n",
154 |         "        \"q\": query,\n",
155 |         "        \"api_key\": SERPAPI_API_KEY,\n",
156 |         "        \"engine\": \"google\"\n",
157 |         "    }\n",
158 |         "    try:\n",
159 |         "        async with session.get(SERPAPI_URL, params=params) as resp:\n",
160 |         "            if resp.status == 200:\n",
161 |         "                results = await resp.json()\n",
162 |         "                if \"organic_results\" in results:\n",
163 |         "                    links = [item.get(\"link\") for item in results[\"organic_results\"] if \"link\" in item]\n",
164 |         "                    return links\n",
165 |         "                else:\n",
166 |         "                    print(\"No organic results in SERPAPI response.\")\n",
167 |         "                    return []\n",
168 |         "            else:\n",
169 |         "                text = await resp.text()\n",
170 |         "                print(f\"SERPAPI error: {resp.status} - {text}\")\n",
171 |         "                return []\n",
172 |         "    except Exception as e:\n",
173 |         "        print(\"Error performing SERPAPI search:\", e)\n",
174 |         "        return []\n",
175 |         "\n",
176 |         "\n",
177 |         "async def fetch_webpage_text_async(session, url):\n",
178 |         "    \"\"\"\n",
179 |         "    Asynchronously retrieve the text content of a webpage using Jina.\n",
180 |         "    The URL is appended to the Jina endpoint.\n",
181 |         "    \"\"\"\n",
182 |         "    full_url = f\"{JINA_BASE_URL}{url}\"\n",
183 |         "    headers = {\n",
184 |         "        \"Authorization\": f\"Bearer {JINA_API_KEY}\"\n",
185 |         "    }\n",
186 |         "    try:\n",
187 |         "        async with session.get(full_url, headers=headers) as resp:\n",
188 |         "            if resp.status == 200:\n",
189 |         "                return await resp.text()\n",
190 |         "            else:\n",
191 |         "                text = await resp.text()\n",
192 |         "                print(f\"Jina fetch error for {url}: {resp.status} - {text}\")\n",
193 |         "                return \"\"\n",
194 |         "    except Exception as e:\n",
195 |         "        print(\"Error fetching webpage text with Jina:\", e)\n",
196 |         "        return \"\"\n",
197 |         "\n",
198 |         "\n",
199 |         "async def is_page_useful_async(session, user_query, page_text):\n",
200 |         "    \"\"\"\n",
201 |         "    Ask the LLM if the provided webpage content is useful for answering the user's query.\n",
202 |         "    The LLM must reply with exactly \"Yes\" or \"No\".\n",
203 |         "    \"\"\"\n",
204 |         "    prompt = (\n",
205 |         "        \"You are a critical research evaluator. Given the user's query and the content of a webpage, \"\n",
206 |         "        \"determine if the webpage contains information relevant and useful for addressing the query. \"\n",
207 |         "        \"Respond with exactly one word: 'Yes' if the page is useful, or 'No' if it is not. Do not include any extra text.\"\n",
208 |         "    )\n",
209 |         "    messages = [\n",
210 |         "        {\"role\": \"system\", \"content\": \"You are a strict and concise evaluator of research relevance.\"},\n",
211 |         "        {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\n\\nWebpage Content (first 20000 characters):\\n{page_text[:20000]}\\n\\n{prompt}\"}\n",
212 |         "    ]\n",
213 |         "    response = await call_openrouter_async(session, messages)\n",
214 |         "    if response:\n",
215 |         "        answer = response.strip()\n",
216 |         "        if answer in [\"Yes\", \"No\"]:\n",
217 |         "            return answer\n",
218 |         "        else:\n",
219 |         "            # Fallback: try to extract Yes/No from the response.\n",
220 |         "            if \"Yes\" in answer:\n",
221 |         "                return \"Yes\"\n",
222 |         "            elif \"No\" in answer:\n",
223 |         "                return \"No\"\n",
224 |         "    return \"No\"\n",
225 |         "\n",
226 |         "\n",
227 |         "async def extract_relevant_context_async(session, user_query, search_query, page_text):\n",
228 |         "    \"\"\"\n",
229 |         "    Given the original query, the search query used, and the page content,\n",
230 |         "    have the LLM extract all information relevant for answering the query.\n",
231 |         "    \"\"\"\n",
232 |         "    prompt = (\n",
233 |         "        \"You are an expert information extractor. Given the user's query, the search query that led to this page, \"\n",
234 |         "        \"and the webpage content, extract all pieces of information that are relevant to answering the user's query. \"\n",
235 |         "        \"Return only the relevant context as plain text without commentary.\"\n",
236 |         "    )\n",
237 |         "    messages = [\n",
238 |         "        {\"role\": \"system\", \"content\": \"You are an expert in extracting and summarizing relevant information.\"},\n",
239 |         "        {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\nSearch Query: {search_query}\\n\\nWebpage Content (first 20000 characters):\\n{page_text[:20000]}\\n\\n{prompt}\"}\n",
240 |         "    ]\n",
241 |         "    response = await call_openrouter_async(session, messages)\n",
242 |         "    if response:\n",
243 |         "        return response.strip()\n",
244 |         "    return \"\"\n",
245 |         "\n",
246 |         "\n",
247 |         "async def get_new_search_queries_async(session, user_query, previous_search_queries, all_contexts):\n",
248 |         "    \"\"\"\n",
249 |         "    Based on the original query, the previously used search queries, and all the extracted contexts,\n",
250 |         "    ask the LLM whether additional search queries are needed. If yes, return a Python list of up to four queries;\n",
251 |         "    if the LLM thinks research is complete, it should return \"<done>\".\n",
252 |         "    \"\"\"\n",
253 |         "    context_combined = \"\\n\".join(all_contexts)\n",
254 |         "    prompt = (\n",
255 |         "        \"You are an analytical research assistant. Based on the original query, the search queries performed so far, \"\n",
256 |         "        \"and the extracted contexts from webpages, determine if further research is needed. \"\n",
257 |         "        \"If further research is needed, provide up to four new search queries as a Python list (for example, \"\n",
258 |         "        \"['new query1', 'new query2']). If you believe no further research is needed, respond with exactly <done>.\"\n",
259 |         "        \"\\nOutput only a Python list or the token <done> without any additional text.\"\n",
260 |         "    )\n",
261 |         "    messages = [\n",
262 |         "        {\"role\": \"system\", \"content\": \"You are a systematic research planner.\"},\n",
263 |         "        {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\nPrevious Search Queries: {previous_search_queries}\\n\\nExtracted Relevant Contexts:\\n{context_combined}\\n\\n{prompt}\"}\n",
264 |         "    ]\n",
265 |         "    response = await call_openrouter_async(session, messages)\n",
266 |         "    if response:\n",
267 |         "        cleaned = response.strip()\n",
268 |         "        if cleaned == \"<done>\":\n",
269 |         "            return \"<done>\"\n",
270 |         "        try:\n",
271 |         "            new_queries = eval(cleaned)\n",
272 |         "            if isinstance(new_queries, list):\n",
273 |         "                return new_queries\n",
274 |         "            else:\n",
275 |         "                print(\"LLM did not return a list for new search queries. Response:\", response)\n",
276 |         "                return []\n",
277 |         "        except Exception as e:\n",
278 |         "            print(\"Error parsing new search queries:\", e, \"\\nResponse:\", response)\n",
279 |         "            return []\n",
280 |         "    return []\n",
281 |         "\n",
282 |         "\n",
283 |         "async def generate_final_report_async(session, user_query, all_contexts):\n",
284 |         "    \"\"\"\n",
285 |         "    Generate the final comprehensive report using all gathered contexts.\n",
286 |         "    \"\"\"\n",
287 |         "    context_combined = \"\\n\".join(all_contexts)\n",
288 |         "    prompt = (\n",
289 |         "        \"You are an expert researcher and report writer. Based on the gathered contexts below and the original query, \"\n",
290 |         "        \"write a comprehensive, well-structured, and detailed report that addresses the query thoroughly. \"\n",
291 |         "        \"Include all relevant insights and conclusions without extraneous commentary.\"\n",
292 |         "    )\n",
293 |         "    messages = [\n",
294 |         "        {\"role\": \"system\", \"content\": \"You are a skilled report writer.\"},\n",
295 |         "        {\"role\": \"user\", \"content\": f\"User Query: {user_query}\\n\\nGathered Relevant Contexts:\\n{context_combined}\\n\\n{prompt}\"}\n",
296 |         "    ]\n",
297 |         "    report = await call_openrouter_async(session, messages)\n",
298 |         "    return report\n",
299 |         "\n",
300 |         "\n",
301 |         "async def process_link(session, link, user_query, search_query):\n",
302 |         "    \"\"\"\n",
303 |         "    Process a single link: fetch its content, judge its usefulness, and if useful, extract the relevant context.\n",
304 |         "    \"\"\"\n",
305 |         "    print(f\"Fetching content from: {link}\")\n",
306 |         "    page_text = await fetch_webpage_text_async(session, link)\n",
307 |         "    if not page_text:\n",
308 |         "        return None\n",
309 |         "    usefulness = await is_page_useful_async(session, user_query, page_text)\n",
310 |         "    print(f\"Page usefulness for {link}: {usefulness}\")\n",
311 |         "    if usefulness == \"Yes\":\n",
312 |         "        context = await extract_relevant_context_async(session, user_query, search_query, page_text)\n",
313 |         "        if context:\n",
314 |         "            print(f\"Extracted context from {link} (first 200 chars): {context[:200]}\")\n",
315 |         "            return context\n",
316 |         "    return None\n",
317 |         "\n",
318 |         "\n",
319 |         "# =========================\n",
320 |         "# Main Asynchronous Routine\n",
321 |         "# =========================\n",
322 |         "\n",
323 |         "async def async_main():\n",
324 |         "    user_query = input(\"Enter your research query/topic: \").strip()\n",
325 |         "    iter_limit_input = input(\"Enter maximum number of iterations (default 10): \").strip()\n",
326 |         "    iteration_limit = int(iter_limit_input) if iter_limit_input.isdigit() else 10\n",
327 |         "\n",
328 |         "    aggregated_contexts = []    # All useful contexts from every iteration\n",
329 |         "    all_search_queries = []     # Every search query used across iterations\n",
330 |         "    iteration = 0\n",
331 |         "\n",
332 |         "    async with aiohttp.ClientSession() as session:\n",
333 |         "        # ----- INITIAL SEARCH QUERIES -----\n",
334 |         "        new_search_queries = await generate_search_queries_async(session, user_query)\n",
335 |         "        if not new_search_queries:\n",
336 |         "            print(\"No search queries were generated by the LLM. Exiting.\")\n",
337 |         "            return\n",
338 |         "        all_search_queries.extend(new_search_queries)\n",
339 |         "\n",
340 |         "        # ----- ITERATIVE RESEARCH LOOP -----\n",
341 |         "        while iteration < iteration_limit:\n",
342 |         "            print(f\"\\n=== Iteration {iteration + 1} ===\")\n",
343 |         "            iteration_contexts = []\n",
344 |         "\n",
345 |         "            # For each search query, perform SERPAPI searches concurrently.\n",
346 |         "            search_tasks = [perform_search_async(session, query) for query in new_search_queries]\n",
347 |         "            search_results = await asyncio.gather(*search_tasks)\n",
348 |         "\n",
349 |         "            # Aggregate all unique links from all search queries of this iteration.\n",
350 |         "            # Map each unique link to the search query that produced it.\n",
351 |         "            unique_links = {}\n",
352 |         "            for idx, links in enumerate(search_results):\n",
353 |         "                query = new_search_queries[idx]\n",
354 |         "                for link in links:\n",
355 |         "                    if link not in unique_links:\n",
356 |         "                        unique_links[link] = query\n",
357 |         "\n",
358 |         "            print(f\"Aggregated {len(unique_links)} unique links from this iteration.\")\n",
359 |         "\n",
360 |         "            # Process each link concurrently: fetch, judge, and extract context.\n",
361 |         "            link_tasks = [\n",
362 |         "                process_link(session, link, user_query, unique_links[link])\n",
363 |         "                for link in unique_links\n",
364 |         "            ]\n",
365 |         "            link_results = await asyncio.gather(*link_tasks)\n",
366 |         "\n",
367 |         "            # Collect non-None contexts.\n",
368 |         "            for res in link_results:\n",
369 |         "                if res:\n",
370 |         "                    iteration_contexts.append(res)\n",
371 |         "\n",
372 |         "            if iteration_contexts:\n",
373 |         "                aggregated_contexts.extend(iteration_contexts)\n",
374 |         "            else:\n",
375 |         "                print(\"No useful contexts were found in this iteration.\")\n",
376 |         "\n",
377 |         "            # ----- ASK THE LLM IF MORE SEARCHES ARE NEEDED -----\n",
378 |         "            new_search_queries = await get_new_search_queries_async(session, user_query, all_search_queries, aggregated_contexts)\n",
379 |         "            if new_search_queries == \"<done>\":\n",
380 |         "                print(\"LLM indicated that no further research is needed.\")\n",
381 |         "                break\n",
382 |         "            elif new_search_queries:\n",
383 |         "                print(\"LLM provided new search queries:\", new_search_queries)\n",
384 |         "                all_search_queries.extend(new_search_queries)\n",
385 |         "            else:\n",
386 |         "                print(\"LLM did not provide any new search queries. Ending the loop.\")\n",
387 |         "                break\n",
388 |         "\n",
389 |         "            iteration += 1\n",
390 |         "\n",
391 |         "        # ----- FINAL REPORT -----\n",
392 |         "        print(\"\\nGenerating final report...\")\n",
393 |         "        final_report = await generate_final_report_async(session, user_query, aggregated_contexts)\n",
394 |         "        print(\"\\n==== FINAL REPORT ====\\n\")\n",
395 |         "        print(final_report)\n",
396 |         "\n",
397 |         "\n",
398 |         "def main():\n",
399 |         "    asyncio.run(async_main())\n",
400 |         "\n",
401 |         "\n",
402 |         "if __name__ == \"__main__\":\n",
403 |         "    main()\n"
404 |       ]
405 |     },
406 |     {
407 |       "cell_type": "code",
408 |       "source": [],
409 |       "metadata": {
410 |         "id": "46Q5XpapDJZT"
411 |       },
412 |       "execution_count": null,
413 |       "outputs": []
414 |     }
415 |   ]
416 | }


--------------------------------------------------------------------------------