├── .devcontainer
    └── devcontainer.json
├── .gitignore
├── README.md
├── dockerfile
├── main.py
├── requirements.txt
└── src
    ├── __init__.py
    ├── __pycache__
        ├── __init__.cpython-39.pyc
        └── deep_research.cpython-39.pyc
    └── deep_research.py


/.devcontainer/devcontainer.json:
--------------------------------------------------------------------------------
 1 | // For format details, see https://aka.ms/devcontainer.json. For config options, see the
 2 | // README at: https://github.com/devcontainers/templates/tree/main/src/docker-existing-dockerfile
 3 | {
 4 | 	"name": "Existing Dockerfile",
 5 | 	"build": {
 6 | 		// Sets the run context to one level up instead of the .devcontainer folder.
 7 | 		"context": "..",
 8 | 		// Update the 'dockerFile' property if you aren't using the standard 'Dockerfile' filename.
 9 | 		"dockerfile": "../dockerfile"
10 | 	}
11 | 
12 | 	// Features to add to the dev container. More info: https://containers.dev/features.
13 | 	// "features": {},
14 | 
15 | 	// Use 'forwardPorts' to make a list of ports inside the container available locally.
16 | 	// "forwardPorts": [],
17 | 
18 | 	// Uncomment the next line to run commands after the container is created.
19 | 	// "postCreateCommand": "cat /etc/os-release",
20 | 
21 | 	// Configure tool-specific properties.
22 | 	// "customizations": {},
23 | 
24 | 	// Uncomment to connect as an existing user other than the container default. More info: https://aka.ms/dev-containers-non-root.
25 | 	// "remoteUser": "devcontainer"
26 | }
27 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .env
2 | 
3 | *.json
4 | 
5 | *.pyc


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Open Gemini Deep Research
  2 | 
  3 | A powerful open-source research assistant powered by Google's Gemini AI that performs deep, multi-layered research on any topic.
  4 | 
  5 | ## Features
  6 | 
  7 | - Automated deep research with adjustable breadth and depth
  8 | - Follow-up question generation for better context
  9 | - Concurrent processing of multiple research queries
 10 | - Comprehensive final report generation with citations
 11 | - Three research modes: fast, balanced, and comprehensive
 12 | - Progress tracking and detailed logging
 13 | - Source tracking and citation management
 14 | 
 15 | ## Prerequisites
 16 | 
 17 | - Python 3.9+
 18 | - Google Gemini API key
 19 | - Docker (if using dev container)
 20 | - VS Code with Dev Containers extension (if using dev container)
 21 | 
 22 | ## Installation
 23 | 
 24 | You can set up this project in one of two ways:
 25 | 
 26 | ### Option 1: Using Dev Container (Recommended)
 27 | 
 28 | 1. Open the project in VS Code
 29 | 2. When prompted, click "Reopen in Container" or run the "Dev Containers: Reopen in Container" command
 30 | 3. Create a `.env` file in the root directory and add your Gemini API key:
 31 |    ```
 32 |    GEMINI_KEY=your_api_key_here
 33 |    ```
 34 | 
 35 | ### Option 2: Local Installation
 36 | 
 37 | 1. Clone the repository:
 38 |    ```bash
 39 |    git clone <repository-url>
 40 |    cd open-gemini-deep-research
 41 |    ```
 42 | 
 43 | 2. Create and activate a virtual environment (recommended):
 44 |    ```bash
 45 |    python -m venv venv
 46 |    source venv/bin/activate  # On Windows: venv\Scripts\activate
 47 |    ```
 48 | 
 49 | 3. Install dependencies:
 50 |    ```bash
 51 |    pip install -r requirements.txt
 52 |    ```
 53 | 
 54 | 4. Create a `.env` file in the root directory and add your Gemini API key:
 55 |    ```
 56 |    GEMINI_KEY=your_api_key_here
 57 |    ```
 58 | 
 59 | ## Usage
 60 | 
 61 | Run the main script with your research query:
 62 | ```bash
 63 | python main.py "your research query here"
 64 | ```
 65 | 
 66 | ### Optional Arguments
 67 | 
 68 | - `--mode`: Research mode (choices: fast, balanced, comprehensive) [default: balanced]
 69 | - `--num-queries`: Number of queries to generate [default: 3]
 70 | - `--learnings`: List of previous learnings [optional]
 71 | 
 72 | Example:
 73 | ```bash
 74 | python main.py "Impact of artificial intelligence on healthcare" --mode comprehensive --num-queries 5
 75 | ```
 76 | 
 77 | 
 78 | ## Output
 79 | 
 80 | The script will:
 81 | 1. Analyze your query for optimal research parameters
 82 | 2. Ask follow-up questions for clarification
 83 | 3. Conduct multi-layered research
 84 | 4. Generate a comprehensive report saved as `final_report.md`
 85 | 5. Show progress updates throughout the process
 86 | 
 87 | ## Project Structure
 88 | 
 89 | ```
 90 | open-gemini-deep-research/
 91 | ├── .devcontainer/
 92 | │   └── devcontainer.json
 93 | ├── src/
 94 | │   ├── __init__.py
 95 | │   └── deep_research.py
 96 | ├── .env
 97 | ├── .gitignore
 98 | ├── dockerfile
 99 | ├── main.py
100 | ├── README.md
101 | └── requirements.txt
102 | ```
103 | 
104 | ## How It Works
105 | 
106 | ### Research Modes
107 | 
108 | The application offers three research modes that affect how deeply and broadly the research is conducted:
109 | 
110 | 1. **Fast Mode**
111 |    - Performs quick, surface-level research
112 |    - Maximum of 3 concurrent queries
113 |    - No recursive deep diving
114 |    - Typically generates 2-3 follow-up questions per query
115 |    - Best for time-sensitive queries or initial exploration
116 |    - Processing time: ~1-3 minutes
117 | 
118 | 2. **Balanced Mode** (Default)
119 |    - Provides moderate depth and breadth
120 |    - Maximum of 7 concurrent queries
121 |    - No recursive deep diving
122 |    - Generates 3-5 follow-up questions per query
123 |    - Explores main concepts and their immediate relationships
124 |    - Processing time: ~3-6 minutes
125 |    - Recommended for most research needs
126 | 
127 | 3. **Comprehensive Mode**
128 |    - Conducts exhaustive, in-depth research
129 |    - Maximum of 5 initial queries, but includes recursive deep diving
130 |    - Each query can spawn sub-queries that go deeper into the topic
131 |    - Generates 5-7 follow-up questions with recursive exploration
132 |    - Explores primary, secondary, and tertiary relationships
133 |    - Includes counter-arguments and alternative viewpoints
134 |    - Processing time: ~5-12 minutes
135 |    - Best for academic or detailed analysis
136 | 
137 | ### Research Process
138 | 
139 | 1. **Query Analysis**
140 |    - Analyzes initial query to determine optimal research parameters
141 |    - Assigns breadth (1-10 scale) and depth (1-5 scale) values
142 |    - Adjusts parameters based on query complexity and chosen mode
143 | 
144 | 2. **Query Generation**
145 |    - Creates unique, non-overlapping search queries
146 |    - Uses semantic similarity checking to avoid redundant queries
147 |    - Maintains query history to prevent duplicates
148 |    - Adapts number of queries based on mode settings
149 | 
150 | 3. **Research Tree Building**
151 |    - Implements a tree structure to track research progress
152 |    - Each query gets a unique UUID for tracking
153 |    - Maintains parent-child relationships between queries
154 |    - Tracks query order and completion status
155 |    - Provides detailed progress visualization through JSON tree structure
156 | 
157 | 4. **Deep Research** (Comprehensive Mode)
158 |    - Implements recursive research strategy
159 |    - Each query can generate one follow-up query
160 |    - Reduces breadth at deeper levels (breadth/2)
161 |    - Maintains visited URLs to avoid duplicates
162 |    - Combines learnings from all levels
163 | 
164 | 5. **Report Generation**
165 |    - Synthesizes findings into a coherent narrative
166 |    - Minimum 3000-word detailed report
167 |    - Includes inline citations and source tracking
168 |    - Organizes information by relevance and relationship
169 |    - Adds creative elements like scenarios and analogies
170 |    - Maintains factual accuracy while being engaging
171 | 
172 | ### Technical Implementation
173 | 
174 | - Uses Google's Gemini AI for:
175 |   - Query analysis and generation
176 |   - Content processing and synthesis
177 |   - Semantic similarity checking
178 |   - Report generation
179 | - Implements concurrent processing for queries
180 | - Uses progress tracking system with tree visualization
181 | - Maintains research tree structure for relationship mapping
182 | 
183 | #### Research Tree Implementation
184 | 
185 | The research tree is implemented through the `ResearchProgress` class that tracks:
186 | - Query relationships (parent-child)
187 | - Query completion status
188 | - Learnings per query
189 | - Query order
190 | - Unique IDs for each query
191 | 
192 | The complete research tree structure is automatically saved to `research_tree.json` when generating the final report, allowing for later analysis or visualization of the research process.
193 | 
194 | Example tree structure:
195 | ```json
196 | {
197 |   "query": "root query",
198 |   "id": "uuid-1",
199 |   "status": "completed",
200 |   "depth": 2,
201 |   "learnings": ["learning 1", "learning 2"],
202 |   "sub_queries": [
203 |     {
204 |       "query": "sub-query 1",
205 |       "id": "uuid-2",
206 |       "status": "completed",
207 |       "depth": 1,
208 |       "learnings": ["learning 3"],
209 |       "sub_queries": [],
210 |       "parent_query": "root query"
211 |     }
212 |   ],
213 |   "parent_query": null
214 | }
215 | ```
216 | 


--------------------------------------------------------------------------------
/dockerfile:
--------------------------------------------------------------------------------
 1 | FROM python:3.9-slim-buster
 2 | 
 3 | # Update package lists
 4 | RUN apt-get update && apt-get install gcc g++ git build-essential -y
 5 | 
 6 | # Make working directories
 7 | RUN  mkdir -p  /open-gemini-deep-research
 8 | WORKDIR  /open-gemini-deep-research
 9 | 
10 | # Copy the requirements.txt file to the container
11 | COPY requirements.txt .
12 | 
13 | # Install dependencies
14 | RUN pip install --upgrade pip
15 | 
16 | RUN pip install -r requirements.txt
17 | 
18 | # Copy the .env file to the container
19 | COPY .env .
20 | 
21 | # Copy every file in the source folder to the created working directory
22 | COPY  . .
23 | 
24 | # Expose the port that the application will run on
25 | EXPOSE 8080
26 | 
27 | # Start the application
28 | CMD ["python3.9", "main.py"]


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import asyncio
 3 | import os
 4 | import time
 5 | 
 6 | from src.deep_research import DeepSearch
 7 | 
 8 | 
 9 | if __name__ == "__main__":
10 |     parser = argparse.ArgumentParser(description='Run deep search queries')
11 |     parser.add_argument('query', type=str, help='The search query')
12 |     parser.add_argument('--mode', type=str, choices=['fast', 'balanced', 'comprehensive'],
13 |                         default='balanced', help='Research mode (default: balanced)')
14 |     parser.add_argument('--num-queries', type=int, default=3,
15 |                         help='Number of queries to generate (default: 3)')
16 |     parser.add_argument('--learnings', nargs='*', default=[],
17 |                         help='List of previous learnings')
18 | 
19 |     args = parser.parse_args()
20 | 
21 |     # Start the timer
22 |     start_time = time.time()
23 | 
24 |     # Get API key from environment variable
25 |     api_key = os.getenv('GEMINI_KEY')
26 |     if not api_key:
27 |         raise ValueError("Please set GEMINI_KEY environment variable")
28 | 
29 |     deep_search = DeepSearch(api_key, mode=args.mode)
30 | 
31 |     breadth_and_depth = deep_search.determine_research_breadth_and_depth(
32 |         args.query)
33 | 
34 |     breadth = breadth_and_depth["breadth"]
35 |     depth = breadth_and_depth["depth"]
36 |     explanation = breadth_and_depth["explanation"]
37 | 
38 |     print(f"Breadth: {breadth}")
39 |     print(f"Depth: {depth}")
40 |     print(f"Explanation: {explanation}")
41 | 
42 |     print("To better understand your research needs, please answer these follow-up questions:")
43 | 
44 |     follow_up_questions = deep_search.generate_follow_up_questions(args.query)
45 | 
46 |     # get answers to the follow up questions
47 |     answers = []
48 |     for question in follow_up_questions:
49 |         answer = input(f"{question}: ")
50 |         answers.append({
51 |             "question": question,
52 |             "answer": answer
53 |         })
54 | 
55 |     questions_and_answers = "\n".join(
56 |         [f"{answer['question']}: {answer['answer']}" for answer in answers])
57 | 
58 |     combined_query = f"Initial query: {args.query}\n\n Follow up questions and answers: {questions_and_answers}"
59 | 
60 |     print(f"\nHere is the combined query: {combined_query}\n\n")
61 | 
62 |     print("Starting research... \n")
63 | 
64 |     # Run the deep research
65 |     results = asyncio.run(deep_search.deep_research(
66 |         query=combined_query,
67 |         breadth=breadth,
68 |         depth=depth,
69 |         learnings=[],
70 |         visited_urls={}
71 |     ))
72 | 
73 |     # Generate and print the final report
74 |     final_report = asyncio.run(deep_search.generate_final_report(
75 |         query=combined_query,
76 |         learnings=results["learnings"],
77 |         visited_urls=results["visited_urls"]
78 |     ))
79 | 
80 |     # Calculate elapsed time
81 |     elapsed_time = time.time() - start_time
82 |     minutes = int(elapsed_time // 60)
83 |     seconds = int(elapsed_time % 60)
84 | 
85 |     print("\nFinal Research Report:")
86 |     print("=====================")
87 |     print(final_report)
88 |     print(f"\nTotal research time: {minutes} minutes and {seconds} seconds")
89 | 
90 |     # Save the report to a file
91 |     with open("final_report.md", "w") as f:
92 |         f.write(final_report)
93 |         f.write(
94 |             f"\n\nTotal research time: {minutes} minutes and {seconds} seconds")
95 |             


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | annotated-types==0.7.0
 2 | anyio==4.8.0
 3 | cachetools==5.5.2
 4 | certifi==2025.1.31
 5 | charset-normalizer==3.4.1
 6 | exceptiongroup==1.2.2
 7 | google-ai-generativelanguage==0.6.10
 8 | google-api-core==2.24.1
 9 | google-api-python-client==2.161.0
10 | google-auth==2.38.0
11 | google-auth-httplib2==0.2.0
12 | google-genai==1.4.0
13 | google-generativeai==0.8.3
14 | googleapis-common-protos==1.68.0
15 | grpcio==1.70.0
16 | grpcio-status==1.70.0
17 | h11==0.14.0
18 | httpcore==1.0.7
19 | httplib2==0.22.0
20 | httpx==0.28.1
21 | idna==3.10
22 | pillow==11.1.0
23 | proto-plus==1.26.0
24 | protobuf==5.29.3
25 | pyasn1==0.6.1
26 | pyasn1_modules==0.4.1
27 | pydantic==2.11.0a2
28 | pydantic_core==2.29.0
29 | pyparsing==3.2.1
30 | python-dotenv==1.0.1
31 | requests==2.32.3
32 | rsa==4.9
33 | sniffio==1.3.1
34 | tqdm==4.67.1
35 | typing_extensions==4.12.2
36 | uritemplate==4.1.1
37 | urllib3==2.3.0
38 | websockets==14.2
39 | 


--------------------------------------------------------------------------------
/src/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/eRuaro/open-gemini-deep-research/1fe4892d267a7624737b935194c12c43c1e69ee5/src/__init__.py


--------------------------------------------------------------------------------
/src/__pycache__/__init__.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/eRuaro/open-gemini-deep-research/1fe4892d267a7624737b935194c12c43c1e69ee5/src/__pycache__/__init__.cpython-39.pyc


--------------------------------------------------------------------------------
/src/__pycache__/deep_research.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/eRuaro/open-gemini-deep-research/1fe4892d267a7624737b935194c12c43c1e69ee5/src/__pycache__/deep_research.cpython-39.pyc


--------------------------------------------------------------------------------
/src/deep_research.py:
--------------------------------------------------------------------------------
  1 | import asyncio
  2 | import datetime
  3 | import json
  4 | import os
  5 | from typing import List
  6 | import uuid
  7 | import argparse
  8 | import time
  9 | import math
 10 | 
 11 | from dotenv import load_dotenv
 12 | 
 13 | from google import genai
 14 | from google.genai import types
 15 | from pydantic import BaseModel
 16 | 
 17 | 
 18 | class ResearchProgress:
 19 |     def __init__(self, depth: int, breadth: int):
 20 |         self.total_depth = depth
 21 |         self.total_breadth = breadth
 22 |         self.current_depth = depth
 23 |         self.current_breadth = 0
 24 |         self.queries_by_depth = {}
 25 |         self.query_order = []  # Track order of queries
 26 |         self.query_parents = {}  # Track parent-child relationships
 27 |         self.total_queries = 0  # Total number of queries including sub-queries
 28 |         self.completed_queries = 0
 29 |         self.query_ids = {}  # Store persistent IDs for queries
 30 |         self.root_query = None  # Store the root query
 31 | 
 32 |     async def start_query(self, query: str, depth: int, parent_query: str = None):
 33 |         """Record the start of a new query"""
 34 |         # Generate a unique ID for this query
 35 |         query_id = str(uuid.uuid4())
 36 |         self.query_ids[query] = query_id
 37 | 
 38 |         # If this is the first query, set it as the root
 39 |         if self.root_query is None:
 40 |             self.root_query = query
 41 | 
 42 |         # Initialize the depth level if it doesn't exist
 43 |         if depth not in self.queries_by_depth:
 44 |             self.queries_by_depth[depth] = {}
 45 | 
 46 |         # Add the query to the appropriate depth level if it's not already there
 47 |         if query not in self.queries_by_depth[depth]:
 48 |             self.queries_by_depth[depth][query] = {
 49 |                 "completed": False,
 50 |                 "learnings": [],
 51 |                 "sources": [],  # Add sources list to store source information
 52 |                 "id": self.query_ids[query]  # Use persistent ID
 53 |             }
 54 |             self.query_order.append(query)
 55 |             if parent_query:
 56 |                 self.query_parents[query] = parent_query
 57 |             self.total_queries += 1
 58 | 
 59 |         self.current_depth = depth
 60 |         self.current_breadth = len(self.queries_by_depth[depth])
 61 |         await self._report_progress("query_started")
 62 | 
 63 |     async def add_learning(self, query: str, depth: int, learning: str):
 64 |         if depth in self.queries_by_depth and query in self.queries_by_depth[depth]:
 65 |             self.queries_by_depth[depth][query]["learnings"].append(
 66 |                 learning)
 67 |             await self._report_progress("learning_added")
 68 | 
 69 |     async def complete_query(self, query: str, depth: int):
 70 |         """Mark a query as completed"""
 71 |         if depth in self.queries_by_depth and query in self.queries_by_depth[depth]:
 72 |             if not self.queries_by_depth[depth][query]["completed"]:
 73 |                 self.queries_by_depth[depth][query]["completed"] = True
 74 |                 self.completed_queries += 1
 75 |                 await self._report_progress(f"Completed query: {query}")
 76 | 
 77 |                 # Check if parent query exists and update its status if all children are complete
 78 |                 parent_query = self.query_parents.get(query)
 79 |                 if parent_query:
 80 |                     await self._update_parent_status(parent_query)
 81 | 
 82 |     async def add_sources(self, query: str, depth: int, sources: list[dict[str, str]]):
 83 |         """Record sources for a specific query"""
 84 |         if depth in self.queries_by_depth and query in self.queries_by_depth[depth]:
 85 |             # Add new sources that aren't already in the list
 86 |             current_sources = self.queries_by_depth[depth][query]["sources"]
 87 |             current_urls = {source["url"] for source in current_sources}
 88 | 
 89 |             for source in sources:
 90 |                 if source["url"] not in current_urls:
 91 |                     current_sources.append(source)
 92 |                     current_urls.add(source["url"])
 93 | 
 94 |             await self._report_progress(f"Added sources for query: {query}")
 95 | 
 96 |     async def _update_parent_status(self, parent_query: str):
 97 |         """Update parent query status based on children completion"""
 98 |         # Find all children of this parent
 99 |         children = [q for q, p in self.query_parents.items() if p ==
100 |                     parent_query]
101 | 
102 |         # Check if all children are complete
103 |         parent_depth = next((d for d, queries in self.queries_by_depth.items()
104 |                              if parent_query in queries), None)
105 | 
106 |         if parent_depth is not None:
107 |             all_children_complete = all(
108 |                 self.queries_by_depth[d][q]["completed"]
109 |                 for q in children
110 |                 for d in self.queries_by_depth
111 |                 if q in self.queries_by_depth[d]
112 |             )
113 | 
114 |             if all_children_complete:
115 |                 # Complete the parent query
116 |                 await self.complete_query(parent_query, parent_depth)
117 | 
118 |     async def _report_progress(self, action: str):
119 |         """Report current progress and stream to client if callback provided"""
120 |         # Build event data for streaming
121 |         progress_data = {
122 |             "type": "research_progress",
123 |             "action": action,
124 |             "timestamp": datetime.datetime.now().isoformat(),
125 |             "completed_queries": self.completed_queries,
126 |             "total_queries": self.total_queries,
127 |             "progress_percentage": int((self.completed_queries / max(1, self.total_queries)) * 100)
128 |         }
129 | 
130 |         # Add tree structure if root query exists
131 |         if self.root_query:
132 |             progress_data["tree"] = self._build_research_tree()
133 | 
134 |         # Print progress to console
135 |         print(
136 |             f"[Progress] {action}: {progress_data['progress_percentage']}% complete")
137 | 
138 |     def _build_research_tree(self):
139 |         """Build a tree structure of the research queries"""
140 |         def build_node(query):
141 |             """Recursively build the tree node"""
142 |             # Find the depth for this query
143 |             depth = next((d for d, queries in self.queries_by_depth.items()
144 |                           if query in queries), 0)
145 | 
146 |             data = self.queries_by_depth[depth][query]
147 | 
148 |             # Find all children of this query
149 |             children = [q for q, p in self.query_parents.items() if p == query]
150 | 
151 |             return {
152 |                 "query": query,
153 |                 "id": self.query_ids[query],
154 |                 "status": "completed" if data["completed"] else "in_progress",
155 |                 "depth": depth,
156 |                 "learnings": data["learnings"],
157 |                 "sources": data["sources"],  # Include sources in the tree
158 |                 "sub_queries": [build_node(child) for child in children],
159 |                 "parent_query": self.query_parents.get(query)
160 |             }
161 | 
162 |         # Start building from the root query
163 |         if self.root_query:
164 |             return build_node(self.root_query)
165 |         return {}
166 | 
167 |     def get_learnings_by_query(self):
168 |         """Get all learnings organized by query"""
169 |         learnings = {}
170 |         for depth, queries in self.queries_by_depth.items():
171 |             for query, data in queries.items():
172 |                 if data["learnings"]:
173 |                     learnings[query] = data["learnings"]
174 |         return learnings
175 | 
176 | 
177 | load_dotenv()
178 | 
179 | 
180 | class DeepSearch:
181 |     def __init__(self, api_key: str, mode: str = "balanced"):
182 |         """
183 |         Initialize DeepSearch with a mode parameter:
184 |         - "fast": Prioritizes speed (reduced breadth/depth, highest concurrency)
185 |         - "balanced": Default balance of speed and comprehensiveness
186 |         - "comprehensive": Maximum detail and coverage
187 |         """
188 |         self.api_key = api_key
189 |         self.model_name = "gemini-2.0-flash"
190 |         self.query_history = set()
191 |         self.mode = mode
192 |         self.client = genai.Client(api_key=self.api_key)
193 | 
194 |     def determine_research_breadth_and_depth(self, query: str):
195 |         """Determine the appropriate research breadth and depth based on the query complexity"""
196 |         class ResearchParameters(BaseModel):
197 |             breadth: int
198 |             depth: int
199 |             explanation: str
200 | 
201 |         user_prompt = f"""
202 |         Analyze this research query and determine the appropriate breadth (number of parallel search queries) 
203 |         and depth (levels of follow-up questions) needed for thorough research:
204 | 
205 |         Query: {query}
206 | 
207 |         Consider:
208 |         1. Complexity of the topic
209 |         2. Breadth of knowledge required
210 |         3. Depth of expertise needed
211 |         4. Potential for follow-up exploration
212 | 
213 |         Return a JSON object with:
214 |         - "breadth": integer between 1-10 (number of parallel search queries)
215 |         - "depth": integer between 1-5 (levels of follow-up questions)
216 |         - "explanation": brief explanation of your reasoning
217 |         """
218 | 
219 |         generation_config = {
220 |             "temperature": 0.2,
221 |             "top_p": 0.95,
222 |             "top_k": 40,
223 |             "max_output_tokens": 1024,
224 |             "response_mime_type": "application/json",
225 |             "response_schema": ResearchParameters,
226 |         }
227 | 
228 |         try:
229 |             response = self.client.models.generate_content(
230 |                 model="gemini-2.0-flash",
231 |                 contents=user_prompt,
232 |                 config=generation_config
233 |             )
234 | 
235 |             # Get the parsed response using the Pydantic model
236 |             parsed_response = response.parsed
237 | 
238 |             return {
239 |                 "breadth": parsed_response.breadth,
240 |                 "depth": parsed_response.depth,
241 |                 "explanation": parsed_response.explanation
242 |             }
243 | 
244 |         except Exception as e:
245 |             print(f"Error determining research parameters: {str(e)}")
246 |             # Default values based on mode
247 |             defaults = {
248 |                 "fast": {"breadth": 3, "depth": 1},
249 |                 "balanced": {"breadth": 5, "depth": 2},
250 |                 "comprehensive": {"breadth": 7, "depth": 3}
251 |             }
252 |             return defaults.get(self.mode, {"breadth": 5, "depth": 2, "explanation": "Using default values."})
253 | 
254 |     def generate_follow_up_questions(
255 |         self,
256 |         query: str,
257 |         max_questions: int = 3,
258 |     ):
259 |         """Generate follow-up questions based on the initial query"""
260 |         class FollowUpQuestions(BaseModel):
261 |             follow_up_queries: list[str]
262 | 
263 |         user_prompt = f"""
264 |         Based on the following user query, generate {max_questions} follow-up questions that would help clarify what the user wants to know.
265 | 		These questions should:
266 | 		1. Seek to understand the user's specific information needs
267 | 		2. Clarify ambiguous terms or concepts in the original query
268 | 		3. Determine the scope or boundaries of what the user is looking for
269 | 		4. Identify the user's level of familiarity with the topic
270 | 		5. Uncover the user's purpose or goal for seeking this information
271 | 
272 | 		User Query: {query}
273 | 
274 | 		Format your response as a JSON object with a single key "follow_up_queries" containing an array of strings.
275 | 		Example:
276 | 		```json
277 | 		{{
278 | 			"follow_up_queries": [
279 | 				"Could you specify what aspects of electric vehicles you're most interested in learning about?",
280 | 				"Are you looking for information about a specific brand or type of electric vehicle?",
281 | 				"Would you like to know about the technical details, environmental impact, or consumer aspects?"
282 | 			]
283 | 		}}
284 | 		```
285 |         """
286 | 
287 |         generation_config = {
288 |             "temperature": 0.7,
289 |             "top_p": 0.95,
290 |             "top_k": 40,
291 |             "max_output_tokens": 1024,
292 |             "response_mime_type": "application/json",
293 |             "response_schema": FollowUpQuestions,
294 |         }
295 | 
296 |         try:
297 |             response = self.client.models.generate_content(
298 |                 model="gemini-2.0-flash",
299 |                 contents=user_prompt,
300 |                 config=generation_config
301 |             )
302 | 
303 |             try:
304 |                 # Get the parsed response using the Pydantic model
305 |                 parsed_response = response.parsed
306 |                 return parsed_response.follow_up_queries
307 |             except Exception as e:
308 |                 print(f"Error parsing follow-up questions: {str(e)}")
309 |                 # Fallback to simple text parsing
310 |                 lines = response.text.strip().split('\n')
311 |                 questions = []
312 |                 for line in lines:
313 |                     line = line.strip()
314 |                     if line and '?' in line:
315 |                         questions.append(line)
316 |                 return questions[:max_questions]
317 | 
318 |         except Exception as e:
319 |             print(f"Error generating follow-up questions: {str(e)}")
320 |             return [f"What are the key aspects of {query}?"]
321 | 
322 |     async def generate_queries(
323 |             self,
324 |             query: str,
325 |             num_queries: int = 3,
326 |             learnings: list[str] = [],
327 |             previous_queries: set[str] = None  # Add previous_queries parameter
328 |     ):
329 |         """Generate search queries based on the initial query and learnings"""
330 |         if previous_queries is None:
331 |             previous_queries = set()
332 | 
333 |         # Adjust the prompt based on the mode
334 |         prompt_by_mode = {
335 |             "fast": "Generate concise, focused search queries",
336 |             "balanced": "Generate balanced search queries that explore different aspects",
337 |             "comprehensive": "Generate comprehensive search queries that deeply explore the topic"
338 |         }
339 | 
340 |         mode_prompt = prompt_by_mode.get(self.mode, prompt_by_mode["balanced"])
341 | 
342 |         # Format learnings for the prompt
343 |         learnings_text = "\n".join([f"- {learning}" for learning in learnings])
344 |         learnings_section = f"\nBased on what we've learned so far:\n{learnings_text}" if learnings else ""
345 | 
346 |         # Format previous queries for the prompt
347 |         previous_queries_text = "\n".join([f"- {q}" for q in previous_queries])
348 |         previous_queries_section = f"\nPrevious search queries (avoid repeating these):\n{previous_queries_text}" if previous_queries else ""
349 | 
350 |         user_prompt = f"""
351 |         You are a research assistant helping to explore the topic: "{query}"
352 |         
353 |         {mode_prompt}.
354 |         {learnings_section}
355 |         {previous_queries_section}
356 |         
357 |         Generate {num_queries} specific search queries that would help gather comprehensive information about this topic.
358 |         Each query should focus on a different aspect or subtopic.
359 |         Make the queries specific and well-formed for a search engine.
360 |         
361 |         Format your response as a JSON object with a "queries" field containing an array of query strings.
362 |         """
363 | 
364 |         class QueryResponse(BaseModel):
365 |             queries: list[str]
366 | 
367 |         generation_config = {
368 |             "temperature": 0.7,
369 |             "top_p": 0.95,
370 |             "top_k": 40,
371 |             "max_output_tokens": 1024,
372 |             "response_mime_type": "application/json",
373 |             "response_schema": QueryResponse,
374 |         }
375 | 
376 |         try:
377 |             response = await self.client.aio.models.generate_content(
378 |                 model="gemini-2.0-flash",
379 |                 contents=user_prompt,
380 |                 config=generation_config
381 |             )
382 | 
383 |             # Parse the response
384 |             try:
385 |                 # Get the parsed response using the Pydantic model
386 |                 parsed_response = response.parsed
387 |                 queries = set(parsed_response.queries)
388 | 
389 |                 # Filter out any queries that are too similar to previous ones
390 |                 unique_queries = set()
391 |                 for q in queries:
392 |                     is_similar = False
393 |                     for prev_q in previous_queries:
394 |                         if await self._are_queries_similar(q, prev_q):
395 |                             is_similar = True
396 |                             break
397 | 
398 |                     if not is_similar:
399 |                         unique_queries.add(q)
400 | 
401 |                 return unique_queries
402 |             except Exception as e:
403 |                 print(f"Error parsing query response: {str(e)}")
404 |                 # Fallback to simple text parsing if JSON parsing fails
405 |                 lines = response.text.strip().split('\n')
406 |                 queries = set()
407 |                 for line in lines:
408 |                     line = line.strip()
409 |                     if line and not line.startswith('{') and not line.startswith('}'):
410 |                         queries.add(line)
411 |                 return queries
412 | 
413 |         except Exception as e:
414 |             print(f"Error generating queries: {str(e)}")
415 |             # Fallback to basic queries if generation fails
416 |             return {f"{query} - aspect {i+1}" for i in range(num_queries)}
417 | 
418 |     def format_text_with_sources(self, response_dict: dict, answer: str):
419 |         """
420 |         Format text with sources from Gemini response, adding citations at specified positions.
421 |         Returns tuple of (formatted_text, sources_dict).
422 |         """
423 |         if not response_dict or not response_dict.get('candidates'):
424 |             return answer, {}
425 | 
426 |         # Get grounding metadata from the response
427 |         grounding_metadata = response_dict['candidates'][0].get(
428 |             'grounding_metadata')
429 |         if not grounding_metadata:
430 |             return answer, {}
431 | 
432 |         # Get grounding chunks and supports
433 |         grounding_chunks = grounding_metadata.get('grounding_chunks', [])
434 |         grounding_supports = grounding_metadata.get('grounding_supports', [])
435 | 
436 |         if not grounding_chunks or not grounding_supports:
437 |             return answer, {}
438 | 
439 |         try:
440 |             # Create mapping of URLs
441 |             sources = {
442 |                 i: {
443 |                     'link': chunk.get('web', {}).get('uri', ''),
444 |                     'title': chunk.get('web', {}).get('title', '')
445 |                 }
446 |                 for i, chunk in enumerate(grounding_chunks)
447 |                 if chunk.get('web')
448 |             }
449 | 
450 |             # Create a list of (position, citation) tuples
451 |             citations = []
452 |             for support in grounding_supports:
453 |                 segment = support.get('segment', {})
454 |                 indices = support.get('grounding_chunk_indices', [])
455 | 
456 |                 if indices and segment and segment.get('end_index') is not None:
457 |                     end_index = segment['end_index']
458 |                     source_idx = indices[0]
459 |                     if source_idx in sources:
460 |                         citation = f"[[{source_idx + 1}]]({sources[source_idx]['link']})"
461 |                         citations.append((end_index, citation))
462 | 
463 |             # Sort citations by position (end_index)
464 |             citations.sort(key=lambda x: x[0])
465 | 
466 |             # Insert citations into the text
467 |             result = ""
468 |             last_pos = 0
469 |             for pos, citation in citations:
470 |                 result += answer[last_pos:pos]
471 |                 result += citation
472 |                 last_pos = pos
473 | 
474 |             # Add any remaining text
475 |             result += answer[last_pos:]
476 | 
477 |             return result, sources
478 | 
479 |         except Exception as e:
480 |             print(f"Error processing grounding metadata: {e}")
481 |             return answer, {}
482 | 
483 |     async def search(self, query: str):
484 |         model_id = "gemini-2.0-flash"
485 | 
486 |         google_search_tool = types.Tool(
487 |             google_search=types.GoogleSearch()
488 |         )
489 | 
490 |         generation_config = {
491 |             "temperature": 1,
492 |             "top_p": 0.95,
493 |             "top_k": 40,
494 |             "max_output_tokens": 8192,
495 |             "response_mime_type": "text/plain",
496 |             "response_modalities": ["TEXT"],
497 |             "tools": [google_search_tool]
498 |         }
499 | 
500 |         response = await self.client.aio.models.generate_content(
501 |             model=model_id,
502 |             contents=query,
503 |             config=generation_config
504 |         )
505 | 
506 |         response_dict = response.model_dump()
507 | 
508 |         formatted_text, sources = self.format_text_with_sources(
509 |             response_dict, response.text)
510 | 
511 |         return formatted_text, sources
512 | 
513 |     async def process_result(
514 |         self,
515 |         query: str,
516 |         result: str,
517 |         num_learnings: int = 3,
518 |         num_follow_up_questions: int = 3,
519 |     ):
520 |         """Process search results to extract learnings and generate follow-up questions"""
521 |         class ProcessedResult(BaseModel):
522 |             learnings: list[str]
523 |             follow_up_questions: list[str]
524 | 
525 |         user_prompt = f"""
526 |         Analyze the following search results for the query: "{query}"
527 |         
528 |         Search Results:
529 |         {result}
530 |         
531 |         Please extract:
532 |         1. The {num_learnings} most important learnings or insights from these results
533 |         2. {num_follow_up_questions} follow-up questions that would help explore this topic further
534 |         
535 |         Format your response as a JSON object with:
536 |         - "learnings": array of learning strings
537 |         - "follow_up_questions": array of question strings
538 |         """
539 | 
540 |         generation_config = {
541 |             "temperature": 0.7,
542 |             "top_p": 0.95,
543 |             "top_k": 40,
544 |             "max_output_tokens": 2048,
545 |             "response_mime_type": "application/json",
546 |             "response_schema": ProcessedResult,
547 |         }
548 | 
549 |         try:
550 |             response = await self.client.aio.models.generate_content(
551 |                 model="gemini-2.0-flash",
552 |                 contents=user_prompt,
553 |                 config=generation_config
554 |             )
555 | 
556 |             try:
557 |                 # Get the parsed response using the Pydantic model
558 |                 parsed_response = response.parsed
559 |                 return {
560 |                     "learnings": parsed_response.learnings,
561 |                     "follow_up_questions": parsed_response.follow_up_questions
562 |                 }
563 |             except Exception as e:
564 |                 print(f"Error parsing process_result: {str(e)}")
565 |                 # Fallback to generating follow-up questions separately
566 |                 follow_up_questions = self.generate_follow_up_questions(
567 |                     query, num_follow_up_questions
568 |                 )
569 | 
570 |                 # Extract some basic learnings from the result
571 |                 learnings = [
572 |                     f"Information about {query}",
573 |                     f"Details related to {query}"
574 |                 ]
575 | 
576 |                 return {
577 |                     "learnings": learnings,
578 |                     "follow_up_questions": follow_up_questions
579 |                 }
580 | 
581 |         except Exception as e:
582 |             print(f"Error processing result: {str(e)}")
583 |             return {
584 |                 "learnings": [f"Information about {query}"],
585 |                 "follow_up_questions": [f"What are the key aspects of {query}?"]
586 |             }
587 | 
588 |     async def _are_queries_similar(self, query1: str, query2: str) -> bool:
589 |         """Check if two queries are semantically similar"""
590 |         # Simple string comparison for exact matches
591 |         if query1.lower() == query2.lower():
592 |             return True
593 | 
594 |         # For very short queries, use substring check
595 |         if len(query1) < 10 or len(query2) < 10:
596 |             return query1.lower() in query2.lower() or query2.lower() in query1.lower()
597 | 
598 |         # For more complex queries, use Gemini to check similarity
599 |         class SimilarityResult(BaseModel):
600 |             are_similar: bool
601 | 
602 |         user_prompt = f"""
603 |         Compare these two search queries and determine if they are semantically similar 
604 |         (would likely return similar search results):
605 |         
606 |         Query 1: {query1}
607 |         Query 2: {query2}
608 |         
609 |         Return a JSON object with a single boolean field "are_similar" indicating if the queries are similar.
610 |         """
611 | 
612 |         generation_config = {
613 |             "temperature": 0.2,
614 |             "top_p": 0.95,
615 |             "top_k": 40,
616 |             "max_output_tokens": 1024,
617 |             "response_mime_type": "application/json",
618 |             "response_schema": SimilarityResult,
619 |         }
620 | 
621 |         try:
622 |             response = await self.client.aio.models.generate_content(
623 |                 model="gemini-2.0-flash",
624 |                 contents=user_prompt,
625 |                 config=generation_config
626 |             )
627 | 
628 |             # Get the parsed response using the Pydantic model
629 |             parsed_response = response.parsed
630 |             return parsed_response.are_similar
631 |         except Exception as e:
632 |             print(f"Error comparing queries: {str(e)}")
633 |             # In case of error, assume queries are different to avoid missing potentially unique results
634 |             return False
635 | 
636 |     async def deep_research(self, query: str, breadth: int, depth: int, learnings: list[str] = [], visited_urls: dict[int, dict] = {}, parent_query: str = None):
637 |         progress = ResearchProgress(depth, breadth)
638 | 
639 |         # Start the root query
640 |         await progress.start_query(query, depth, parent_query)
641 | 
642 |         # Adjust number of queries based on mode
643 |         max_queries = {
644 |             "fast": 5,
645 |             "balanced": 10,
646 |             "comprehensive": 7  # Changed from 15 to 7
647 |         }[self.mode]
648 | 
649 |         queries = await self.generate_queries(
650 |             query,
651 |             min(breadth, max_queries),
652 |             learnings,
653 |             previous_queries=self.query_history
654 |         )
655 | 
656 |         self.query_history.update(queries)
657 |         unique_queries = list(queries)[:breadth]
658 | 
659 |         async def process_query(query_str: str, current_depth: int, parent: str = None):
660 |             try:
661 |                 # Start this query as a sub-query of the parent
662 |                 await progress.start_query(query_str, current_depth, parent)
663 | 
664 |                 result = await self.search(query_str)
665 |                 # The search method returns a tuple (formatted_text, sources)
666 |                 formatted_text, new_urls = result
667 | 
668 |                 # Add sources to the progress tracker for this query
669 |                 if new_urls:
670 |                     sources_list = [
671 |                         {"url": url_data["link"], "title": url_data["title"]}
672 |                         for url_data in new_urls.values()
673 |                         if "link" in url_data and "title" in url_data
674 |                     ]
675 |                     await progress.add_sources(query_str, current_depth, sources_list)
676 | 
677 |                 processed_result = await self.process_result(
678 |                     query=query_str,
679 |                     result=formatted_text,
680 |                     num_learnings=min(5, math.ceil(breadth / 1.5)),
681 |                     num_follow_up_questions=min(5, math.ceil(breadth / 1.5))
682 |                 )
683 | 
684 |                 # Record learnings
685 |                 for learning in processed_result["learnings"]:
686 |                     await progress.add_learning(query_str, current_depth, learning)
687 | 
688 |                 new_urls = result[1]
689 |                 max_idx = max(visited_urls.keys()) if visited_urls else -1
690 |                 all_urls = {
691 |                     **visited_urls,
692 |                     **{(i + max_idx + 1): url_data for i, url_data in new_urls.items()}
693 |                 }
694 | 
695 |                 # Only go deeper if in comprehensive mode and depth > 1
696 |                 if self.mode == "comprehensive" and current_depth > 1:
697 |                     # Reduced breadth for deeper levels, but increased from previous implementation
698 |                     # Less aggressive reduction
699 |                     new_breadth = min(5, math.ceil(breadth / 1.5))
700 |                     new_depth = current_depth - 1
701 | 
702 |                     # Select most important follow-up questions instead of just one
703 |                     if processed_result['follow_up_questions']:
704 |                         # Take up to 3 most relevant questions instead of just 1
705 |                         follow_up_questions = processed_result['follow_up_questions'][:3]
706 | 
707 |                         # Process each sub-query
708 |                         for next_query in follow_up_questions:
709 |                             sub_results = await process_query(
710 |                                 next_query,
711 |                                 new_depth,
712 |                                 query_str  # Pass current query as parent
713 |                             )
714 | 
715 |                             # Merge the sub-results with the current results
716 |                             if sub_results:
717 |                                 # Add sub-query learnings to all_urls
718 |                                 if "visited_urls" in sub_results:
719 |                                     for url_key, url_data in sub_results["visited_urls"].items():
720 |                                         if url_data['link'] not in [u['link'] for u in all_urls.values()]:
721 |                                             max_idx = max(
722 |                                                 all_urls.keys()) if all_urls else -1
723 |                                             all_urls[max_idx + 1] = url_data
724 | 
725 |                 await progress.complete_query(query_str, current_depth)
726 |                 return {
727 |                     "learnings": processed_result["learnings"],
728 |                     "visited_urls": all_urls
729 |                 }
730 | 
731 |             except Exception as e:
732 |                 print(f"Error processing query {query_str}: {str(e)}")
733 |                 await progress.complete_query(query_str, current_depth)
734 |                 return {
735 |                     "learnings": [],
736 |                     "visited_urls": {}
737 |                 }
738 | 
739 |         # Process queries concurrently
740 |         tasks = [process_query(q, depth, query) for q in unique_queries]
741 |         results = await asyncio.gather(*tasks)
742 | 
743 |         # Combine results
744 |         all_learnings = list(set(
745 |             learning
746 |             for result in results
747 |             for learning in result["learnings"]
748 |         ))
749 | 
750 |         all_urls = {}
751 |         current_idx = 0
752 |         seen_urls = set()
753 |         for result in results:
754 |             for url_data in result["visited_urls"].values():
755 |                 if url_data['link'] not in seen_urls:
756 |                     all_urls[current_idx] = url_data
757 |                     seen_urls.add(url_data['link'])
758 |                     current_idx += 1
759 | 
760 |         # Complete the root query after all sub-queries are done
761 |         await progress.complete_query(query, depth)
762 | 
763 |         # Build the research tree
764 |         research_tree = progress._build_research_tree()
765 | 
766 |         print(f"Research tree built with {len(all_learnings)} learnings")
767 |         # save the research tree to a file
768 |         with open("research_tree.json", "w") as f:
769 |             json.dump(research_tree, f)
770 | 
771 |         return {
772 |             "learnings": all_learnings,
773 |             "visited_urls": all_urls,
774 |             "tree": research_tree  # Return the tree structure
775 |         }
776 | 
777 |     async def generate_final_report(self, query: str, learnings: list[str], visited_urls: dict[int, dict]) -> str:
778 |         # Format learnings for the prompt
779 |         learnings_text = "\n".join([
780 |             f"- {learning}" for learning in learnings
781 |         ])
782 | 
783 |         user_prompt = f"""
784 | 		You are a creative storyteller tasked with transforming research into an engaging and distinctive report.
785 | 
786 |         Research Query: {query}
787 | 
788 |         Key Discoveries:
789 |         {learnings_text}
790 | 
791 |         Craft a captivating report that:
792 | 
793 |         # CREATIVE APPROACH
794 |         1. Opens with an imaginative introduction that draws readers into the topic
795 |         2. Transforms key discoveries into a compelling narrative with your unique voice
796 |         3. Adds fresh perspectives and unexpected connections between ideas
797 |         4. Experiments freely with tone, style, and expression
798 |         5. Concludes with thought-provoking reflections that linger with the reader
799 | 
800 |         # FORMATTING TOOLS
801 |         - Create evocative, imaginative headings
802 |         - Use markdown formatting (##, ###, **bold**, *italics*) for visual interest
803 |         - Incorporate blockquotes for emphasis or contrast
804 |         - Deploy bullet points or numbered lists where they enhance clarity
805 |         - Insert tables to organize information in visually appealing ways
806 |         - Use horizontal rules (---) to create dramatic pauses or section breaks
807 | 
808 |         Feel free to be bold, experimental, and expressive while maintaining clarity and coherence. There are no academic conventions to follow - let your creativity flow!
809 |         """
810 | 
811 |         generation_config = {
812 |             "temperature": 0.9,  # Increased for more creativity
813 |             "top_p": 0.95,
814 |             "top_k": 40,
815 |             "max_output_tokens": 8192,
816 |         }
817 | 
818 |         try:
819 |             response = await self.client.aio.models.generate_content(
820 |                 model="gemini-2.0-flash",
821 |                 contents=user_prompt,
822 |                 config=generation_config
823 |             )
824 |             return response.text
825 |         except Exception as e:
826 |             print(f"Error generating final report: {str(e)}")
827 |             return f"Error generating report: {str(e)}"
828 | 
829 | 


--------------------------------------------------------------------------------