├── .gitignore
├── README.md
├── app.py
├── backend
    ├── __init__.py
    ├── classes
    │   ├── __init__.py
    │   ├── classes.py
    │   └── research_state.py
    ├── graph.py
    ├── nodes
    │   ├── __init__.py
    │   ├── cluster.py
    │   ├── enrich_docs.py
    │   ├── eval.py
    │   ├── generate_report.py
    │   ├── initial_grounding.py
    │   ├── manual_cluster_select.py
    │   ├── publish.py
    │   ├── research.py
    │   └── sub_questions.py
    └── utils
    │   ├── routing_helper.py
    │   └── utils.py
├── frontend
    ├── static
    │   ├── script.js
    │   └── styles.css
    └── templates
    │   └── index.html
├── langgraph.json
├── langgraph_entry.py
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
1 | .env
2 | .venv
3 | *.pyc[cod]
4 | __pycache__/
5 | .DS_Store
6 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Company Researcher with Tavily and Langgraph
  2 | 
  3 | The **Company Researcher** is an open-source tool designed for in-depth company analysis. Built with **Tavily’s `search` and `extract` capabilities** and powered by **LangGraph**, it delivers percise, real-time insights in a structured format. Ideal for competitive intelligence, lead research, and Go-to-Market (GTM) strategies, this tool leverages advanced AI-driven workflows to provide comprehensive, reliable reports for data-driven decision-making.
  4 | 
  5 | ## Table of Contents
  6 | 1. [Overview](#overview)
  7 | 2. [Key Workflow Features](#key-workflow-features)
  8 | 3. [Running the Tool Locally](#running-the-tool-locally)
  9 |    - [Prerequisites](#prerequisites)
 10 |    - [Installation](#installation)
 11 |    - [Running the Application](#running-the-application)
 12 | 4. [Running the Tool in LangGraph Studio](#running-the-tool-in-langgraph-studio)
 13 | 5. [Customization](#customization)
 14 | 6. [Future Directions](#future-directions)
 15 | 
 16 | ---
 17 | 
 18 | ## Overview
 19 | 
 20 | The **Company Researcher** is an open-source tool designed for in-depth company analysis. Built with **Tavily’s search and extract capabilities** and powered by **LangGraph**, it gathers both general and targeted information, using feedback loops and optional human validation for accuracy. It is designed to handle complex scenarios, such as distinguishing similarly named companies or gathering data in sparsely documented fields, and can be easily adapted to other research domains.
 21 | 
 22 | ---
 23 | ![workflow](https://i.imgur.com/92E2kcj.jpeg)
 24 | 
 25 | ---
 26 | 
 27 | ## Key Workflow Features
 28 | 1. **Establishing a Ground Truth with Tavily Extract**: Each session begins by setting a “ground truth” with Tavily’s `extract` tool, using a user-provided company name and URL. This foundational data anchors the subsequent search, ensuring all steps stay within accurate and verified data boundaries.
 29 | 2. **Sub-Question Generation and Tavily Search**: The workflow dynamically generates specific research questions to drive Tavily’s `search`, focusing the retrieval on relevant, high-value information rather than conducting broad, unfocused searches.
 30 | 3. **AI-Driven Document Clustering**: Retrieved documents are clustered based on relevance to the target company. This process, anchored by the ground truth, filters out unrelated content, a critical feature for similarly named companies or entities with minimal online presence.
 31 | 4. **Human-on-the-Loop Validation**: In cases where clustering yields ambiguous results, optional human review allows for manual cluster selection, ensuring the data aligns accurately with the target entity.
 32 | 5. **Document Curation and Enrichment with Tavily Extract**: Once the appropriate cluster is identified, Tavily’s `extract` further refines and enriches the content, adding substantial depth to the research. This step enhances the precision and comprehensiveness of the final output.
 33 | 6. **Report Generation and Evaluation with Feedback Loops**: An LLM synthesizes the enriched data into a structured report. If gaps are detected, feedback loops prompt additional information gathering, enabling iterative improvements without restarting the entire workflow.
 34 | 7. **Multi-Format Output**: The finalized report can be exported in PDF or Markdown formats, making it ready for easy sharing and integration.
 35 | 
 36 | ---
 37 | 
 38 | ## Running the Tool Locally
 39 | 
 40 | ### Prerequisites
 41 | 
 42 | - Python 3.11 or later: [Python Installation Guide](https://www.tutorialsteacher.com/python/install-python)
 43 | - Tavily API Key - [Sign Up](https://tavily.com/)
 44 | - Anthropic API Key - [Sign Up](https://console.anthropic.com/settings/keys)
 45 | 
 46 | ### Installation
 47 | 
 48 | 1. **Clone the Repository**:
 49 | 
 50 |    ```bash
 51 |    git clone https://github.com/danielleyahalom/company-researcher.git
 52 |    cd company-researcher
 53 |    ```
 54 | 
 55 | 2. **Create a Virtual Environment**:
 56 | 
 57 |    To avoid dependency conflicts, it's recommended to create and activate a virtual environment using `venv`:
 58 | 
 59 |    ```bash
 60 |    python -m venv venv
 61 |    source venv/bin/activate    # macOS/Linux
 62 |    venv\Scripts\activate       # Windows
 63 |    ```
 64 | 
 65 | 3. **Set Up API Keys**:
 66 |    Configure your OpenAI and Tavily API keys as environment variables or place them in a `.env` file:
 67 | 
 68 |    ```bash
 69 |    export TAVILY_API_KEY={Your Tavily API Key here}
 70 |    export ANTHROPIC_API_KEY={Your Anthropic API Key here}
 71 |    ```
 72 | 
 73 | 4. **Install Dependencies**:
 74 | 
 75 |    Install the required Python packages:
 76 |    ```bash
 77 |    pip install -r requirements.txt
 78 |    ```
 79 | 
 80 | 5. **Run the Application**:
 81 | 
 82 |    ```bash
 83 |    python app.py
 84 |    ```
 85 | 
 86 | 6. **Open the App in Your Browser**:
 87 | 
 88 |    ```bash
 89 |    http://localhost:5000
 90 |    ```
 91 | 
 92 | ---
 93 | 
 94 | ## Running the Tool in LangGraph Studio
 95 | 
 96 | ---
 97 | <div align="center">
 98 |   <img src="https://i.imgur.com/FEAUhNW.png" alt="Langgraph Studio" height="500">
 99 | </div>
100 | 
101 | ---
102 | 
103 | **LangGraph Studio** enables visualization, debugging, and real-time interaction with the Company Researcher's workflow. Here’s how to set it up:
104 | 
105 | ### Prerequisites
106 | 
107 | 1. **Download LangGraph Studio**:
108 |    - For macOS, download the latest `.dmg` file for LangGraph Studio from [here](https://langgraph-studio.vercel.app/api/mac/latest) or visit the [releases page](https://github.com/langchain-ai/langgraph-studio/releases).
109 |    - **Note**: Currently, only macOS is supported.
110 | 
111 | 2. **Install Docker**:
112 |    - Ensure [Docker Desktop](https://docs.docker.com/engine/install/) is installed and running. LangGraph Studio requires Docker Compose version 2.22.0 or higher.
113 | 
114 | ### Setting Up in LangGraph Studio
115 | 
116 | 1. **Clone the Repository**:
117 |       ```bash
118 |       git clone https://github.com/danielleyahalom/company-researcher.git
119 |       cd company_researcher
120 |       ```
121 |    - **Note**: This repository includes all required files except for the `.env` file, which you need to create to store your API keys.
122 | 
123 | 2. **Configure the Environment**:
124 |    - Create a `.env` file in the root directory to store your API keys:
125 |       ```bash
126 |       touch .env
127 |       ```
128 |    - Add your API keys to the `.env` file:
129 |       ```bash
130 |       TAVILY_API_KEY={Your Tavily API Key here}
131 |       ANTHROPIC_API_KEY={Your Anthropic API Key here}
132 |       ```
133 | 
134 | 3. **Ensure LangGraph Configuration Files Are in Place**:
135 |    - The repository includes `langgraph.json` and `langgraph_entry.py`, defining the entry point and configuration for LangGraph Studio.
136 | 
137 | 4. **Start LangGraph Studio**:
138 |    - Open LangGraph Studio and select the `company_researcher` directory from the dashboard.
139 | 
140 | 5. **Running the Workflow in Studio**:
141 |    - Visualize each step of the workflow, make real-time edits, and monitor the workflow’s state.
142 |    - **Important Note**: If a cluster cannot be automatically selected, the tool will attempt to re-cluster instead.
143 | 
144 | LangGraph Studio provides a hands-on approach to refining the workflow, enhancing both development efficiency and output reliability.
145 | 
146 | ---
147 | 
148 | ## Customization
149 | 
150 | The tool’s modular structure makes it adaptable to various research applications:
151 | 
152 | - **Modify Prompts**: Adjust prompts in question generation or report synthesis for different research needs.
153 | - **Extend Workflow Nodes**: Add, remove, or modify nodes to focus on specific types of analysis.
154 | - **Customize Output Formats**: Tailor output formats (e.g., CSS for PDF styling) to suit organizational standards.
155 | 
156 | ---
157 | 
158 | ## Future Directions
159 | 
160 | This adaptable workflow can be fine-tuned for a range of applications beyond company research:
161 | 
162 | - **Market Analysis**: Apply the workflow to track trends, competitors, and emerging tech.
163 | - **Lead Generation**: Compile detailed profiles on potential clients for targeted outreach.
164 | - **Ongoing Knowledge Bases**: Build continuously updated research repositories in fields like law, finance, or healthcare.
165 | 
166 | This tool exemplifies how AI-driven workflows, backed by precise data extraction and real-time search, can reshape research and analysis across domains.


--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
 1 | from fastapi import FastAPI, WebSocket, WebSocketDisconnect, Request 
 2 | from fastapi.responses import HTMLResponse
 3 | from fastapi.staticfiles import StaticFiles
 4 | from fastapi.templating import Jinja2Templates
 5 | import uvicorn
 6 | from backend.graph import Graph  # Adjust this import if necessary 
 7 | 
 8 | from dotenv import load_dotenv
 9 | load_dotenv('.env')
10 | 
11 | app = FastAPI()
12 | app.mount("/static", StaticFiles(directory="frontend/static"), name="static")
13 | templates = Jinja2Templates(directory="frontend/templates")
14 | 
15 | @app.get("/", response_class=HTMLResponse)
16 | async def index(request: Request):  # Add the type hint here
17 |     return templates.TemplateResponse("index.html", {"request": request})
18 | 
19 | @app.websocket("/ws")
20 | async def websocket_endpoint(websocket: WebSocket):
21 |     await websocket.accept()
22 |     try:
23 |         # Receive initial data from the WebSocket client
24 |         data = await websocket.receive_json()
25 |         company_name = data.get("companyName")
26 |         company_url = data.get("companyUrl")
27 |         output_format = data.get("outputFormat", "pdf")
28 |         
29 |         # Initialize the Graph with company, URL, and output format
30 |         graph = Graph(company=company_name, url=company_url, output_format=output_format, websocket=websocket)
31 |         
32 |         # Progress callback to send messages back to the client
33 |         async def progress_callback(message):
34 |             await websocket.send_text(message)
35 | 
36 |         # Run the graph process without additional arguments
37 |         await graph.run(progress_callback=progress_callback)
38 | 
39 |         await websocket.send_text("✔️ Research completed.")
40 |     except WebSocketDisconnect:
41 |         print("WebSocket disconnected")
42 |     finally:
43 |         await websocket.close()
44 | 
45 | if __name__ == "__main__":
46 |     uvicorn.run(
47 |         "app:app",
48 |         host="127.0.0.1",
49 |         port=5000,
50 |         reload=True
51 |     )
52 |    


--------------------------------------------------------------------------------
/backend/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielleyahalom/company-researcher/03f596b05f69f7b1802620385e512abb729cb87d/backend/__init__.py


--------------------------------------------------------------------------------
/backend/classes/__init__.py:
--------------------------------------------------------------------------------
1 | from .classes import TavilySearchInput, TavilyQuery, DocumentCluster, DocumentClusters, ReportEvaluation
2 | from .research_state import ResearchState
3 | 
4 | 


--------------------------------------------------------------------------------
/backend/classes/classes.py:
--------------------------------------------------------------------------------
 1 | from pydantic import BaseModel, Field
 2 | from typing import List, Optional
 3 | 
 4 | # Add Tavily's arguments to enhance the web search tool's capabilities
 5 | class TavilyQuery(BaseModel):
 6 |     query: str = Field(description="web search query")
 7 |  
 8 | 
 9 | # Define the args_schema for the tavily_search tool using a multi-query approach, enabling more precise queries for Tavily.
10 | class TavilySearchInput(BaseModel):
11 |     sub_queries: List[TavilyQuery] = Field(description="set of sub-queries that can be answered in isolation")
12 | 
13 | 
14 | # Define the structure for clustering output
15 | class DocumentCluster(BaseModel):
16 |     company_name: str = Field(
17 |         ...,
18 |         description="The name or identifier of the company these documents belong to."
19 |     )
20 |     cluster: List[str] = Field(
21 |         ...,
22 |         description="A list of URLs relevant to the identified company."
23 |     )
24 | 
25 | class DocumentClusters(BaseModel):
26 |     clusters: List[DocumentCluster] = Field(default_factory=list, description="List of document clusters")
27 | 
28 | # Define the ReportEvaluation structure
29 | class ReportEvaluation(BaseModel):
30 |     grade: int = Field(
31 |         ..., 
32 |         description="Overall grade of the report on a scale from 1 to 3 (1 = needs improvement, 3 = complete and thorough)."
33 |     )
34 |     critical_gaps: Optional[List[str]] = Field(
35 |         None, 
36 |         description="List of critical gaps to address if the grade is 1."
37 |     ) 
38 | 


--------------------------------------------------------------------------------
/backend/classes/research_state.py:
--------------------------------------------------------------------------------
 1 | from langgraph.graph import add_messages
 2 | from langchain_core.messages import AnyMessage
 3 | from . import TavilySearchInput, DocumentCluster, ReportEvaluation
 4 | from typing import TypedDict, List, Annotated, Dict, Union
 5 | 
 6 | # Import directly from each specific module within format_classes
 7 | from .classes import TavilySearchInput, DocumentCluster, ReportEvaluation
 8 | 
 9 | # Define the research state
10 | class ResearchState(TypedDict):
11 |     company: str 
12 |     company_url: str
13 |     initial_documents: Dict[str, Dict[Union[str, int], Union[str, float]]]
14 |     sub_questions: TavilySearchInput
15 |     documents: Dict[str, Dict[Union[str, int], Union[str, float]]]
16 |     document_clusters: List[DocumentCluster]
17 |     chosen_cluster: int
18 |     report: str
19 |     eval: ReportEvaluation
20 |     output_format: str
21 |     messages: Annotated[list[AnyMessage], add_messages]
22 | 
23 | class InputState(TypedDict):
24 |     company: str
25 |     company_url: str
26 | 
27 | 
28 | class OutputState(TypedDict):
29 |     report: str


--------------------------------------------------------------------------------
/backend/graph.py:
--------------------------------------------------------------------------------
  1 | from langchain_core.messages import SystemMessage, AIMessage
  2 | from functools import partial
  3 | from langgraph.graph import StateGraph
  4 | from langgraph.checkpoint.memory import MemorySaver
  5 | 
  6 | # Import research state class
  7 | from backend.classes.research_state import ResearchState, InputState, OutputState
  8 | 
  9 | # Import node classes
 10 | from backend.nodes import (
 11 |     InitialGroundingNode, 
 12 |     SubQuestionsNode, 
 13 |     ResearcherNode, 
 14 |     ClusterNode, 
 15 |     ManualSelectionNode, 
 16 |     EnrichDocsNode, 
 17 |     GenerateNode,
 18 |     EvaluationNode,
 19 |     PublishNode
 20 | )
 21 | from backend.utils.routing_helper import (
 22 |     route_based_on_cluster, 
 23 |     route_after_manual_selection, 
 24 |     should_continue_research,
 25 |     route_based_on_evaluation
 26 | )
 27 | 
 28 | class Graph:
 29 |     def __init__(self, company=None, url=None, output_format="pdf", websocket=None):
 30 |         # Initial setup of ResearchState and messages
 31 |         self.messages = [
 32 |             SystemMessage(content="You are an expert researcher ready to begin the information gathering process.")
 33 |         ]
 34 | 
 35 |         # Initialize ResearchState
 36 |         self.state = ResearchState(
 37 |             company=company,
 38 |             company_url=url,
 39 |             output_format=output_format,
 40 |             messages=self.messages
 41 |         )
 42 |         
 43 |         # Initialize nodes as attributes
 44 |         self.initial_search_node = InitialGroundingNode()
 45 |         self.sub_questions_node = SubQuestionsNode()
 46 |         self.researcher_node = ResearcherNode()
 47 |         self.cluster_node = ClusterNode()
 48 |         self.manual_selection_node = ManualSelectionNode()
 49 |         self.curate_node = EnrichDocsNode()
 50 |         self.generate_node = GenerateNode()
 51 |         self.evaluation_node = EvaluationNode()
 52 |         self.publish_node = PublishNode()
 53 | 
 54 |         # Initialize workflow for the graph
 55 |         self.workflow = StateGraph(ResearchState, input=InputState, output=OutputState)
 56 | 
 57 |         # Add nodes to the workflow
 58 |         self.workflow.add_node("initial_grounding", self.initial_search_node.run)
 59 |         self.workflow.add_node("sub_questions_gen", self.sub_questions_node.run)
 60 |         self.workflow.add_node("research", self.researcher_node.run)
 61 |         self.workflow.add_node("cluster", self.curried_node(self.cluster_node.run))
 62 |         self.workflow.add_node("manual_cluster_selection", self.curried_node(self.manual_selection_node.run))
 63 |         self.workflow.add_node("enrich_docs", self.curate_node.run)               
 64 |         self.workflow.add_node("generate_report", self.curried_node(self.generate_node.run))
 65 |         self.workflow.add_node("eval_report", self.evaluation_node.run)
 66 |         self.workflow.add_node("publish", self.publish_node.run)
 67 | 
 68 |         # Add edges to the graph
 69 |         self.workflow.add_edge("initial_grounding", "sub_questions_gen")
 70 |         self.workflow.add_edge("sub_questions_gen", "research")
 71 |         self.workflow.add_edge("research", "cluster")
 72 |         self.workflow.add_conditional_edges("cluster", route_based_on_cluster)
 73 |         self.workflow.add_conditional_edges("manual_cluster_selection", route_after_manual_selection)
 74 |         self.workflow.add_conditional_edges("enrich_docs", should_continue_research)
 75 |         self.workflow.add_edge("generate_report", "eval_report")
 76 |         self.workflow.add_conditional_edges("eval_report", route_based_on_evaluation)
 77 | 
 78 |         # Set start and end nodes
 79 |         self.workflow.set_entry_point("initial_grounding")
 80 |         self.workflow.set_finish_point("publish")
 81 | 
 82 |         self.memory = MemorySaver()
 83 |         self.websocket = websocket
 84 | 
 85 |     async def run(self, progress_callback=None):
 86 |         # Compile the graph
 87 |         graph = self.workflow.compile(checkpointer=self.memory)
 88 |         thread = {"configurable": {"thread_id": "2"}}
 89 | 
 90 |         # Execute the graph asynchronously and send progress updates
 91 |         async for s in graph.astream(self.state, thread, stream_mode="values"):
 92 |             if "messages" in s and s["messages"]:  # Check if "messages" exists and is non-empty
 93 |                 message = s["messages"][-1]
 94 |                 output_message = message.content if hasattr(message, "content") else str(message)
 95 |                 if progress_callback and not getattr(message, "is_manual_selection", False):
 96 |                     await progress_callback(output_message)
 97 | 
 98 |     def curried_node(self, node_run_method):
 99 |         # Curried wrapper for handling websocket
100 |         async def wrapper(state):
101 |             return await node_run_method(state, self.websocket)
102 |         return wrapper
103 | 
104 |     # Compile for langgraph studio
105 |     def compile(self):
106 |         # Use a consistent thread ID for state persistence
107 |         thread = {"configurable": {"thread_id": "2"}}
108 | 
109 |         # Compile the workflow with checkpointer and interrupt configuration
110 |         graph = self.workflow.compile(
111 |             checkpointer=self.memory
112 |             # interrupt_before=["manual_cluster_selection"]
113 |         )
114 |         return graph
115 | 


--------------------------------------------------------------------------------
/backend/nodes/__init__.py:
--------------------------------------------------------------------------------
1 | from .initial_grounding import InitialGroundingNode
2 | from .sub_questions import SubQuestionsNode
3 | from .research import ResearcherNode
4 | from .cluster import ClusterNode
5 | from .manual_cluster_select import ManualSelectionNode
6 | from .enrich_docs import EnrichDocsNode
7 | from .generate_report import GenerateNode
8 | from .eval import EvaluationNode
9 | from .publish import PublishNode


--------------------------------------------------------------------------------
/backend/nodes/cluster.py:
--------------------------------------------------------------------------------
  1 | from langchain_core.messages import AIMessage
  2 | from langchain_anthropic import ChatAnthropic
  3 | 
  4 | from ..classes import ResearchState,DocumentClusters
  5 | 
  6 | 
  7 | 
  8 | class ClusterNode:
  9 |     def __init__(self):
 10 |         self.model = ChatAnthropic(
 11 |             model="claude-3-5-haiku-20241022",
 12 |             temperature=0
 13 |         )
 14 | 
 15 |     async def cluster(self, state: ResearchState):
 16 |         company = state['company']
 17 |         company_url = state['company_url']
 18 |         initial_docs = state['initial_documents']
 19 |         documents = state.get('documents', {})
 20 |    
 21 |         # Extract compnay domain from URL
 22 |         target_domain = company_url.split("//")[-1].split("/")[0]
 23 | 
 24 |         # Collect all retrieved documents without duplicates
 25 |         unique_urls = []
 26 |         seen_urls = set()
 27 |         for url, doc, in documents.items():
 28 |             if url not in seen_urls:
 29 |                 unique_urls.append({'url': url, 'content': doc.get('content', '')})
 30 |                 seen_urls.add(url)
 31 | 
 32 |         # Pass in the first 25 URLs
 33 |         urls = unique_urls[:25]
 34 | 
 35 |         # LLM prompt to categorize documents accurately
 36 |         prompt = f"""
 37 |             We conducted a search for a company called '{company}', but the results may include documents from other companies with similar names or domains.
 38 |             Your task is to accurately categorize these retrieved documents based on which specific company they pertain to, using the initial company information as "ground truth."
 39 | 
 40 |             ### Target Company Information
 41 |             - **Company Name**: '{company}'
 42 |             - **Primary Domain**: '{target_domain}'
 43 |             - **Initial Context (Ground Truth)**: Information below should act as a verification baseline. Use it to confirm that the document content aligns directly with {company}.
 44 |             - **{initial_docs}**
 45 | 
 46 |             ### Retrieved Documents for Clustering
 47 |             Below are the retrieved documents, including URLs and brief content snippets:
 48 |             {[{'url': doc['url'], 'snippet': doc['content']} for doc in urls]}
 49 | 
 50 |             ### Clustering Instructions
 51 |             - **Primary Domain Priority**: Documents with URLs containing '{target_domain}' should be prioritized for the main cluster for '{company}'.
 52 |             - **Include Relevant Third-Party Sources**: Documents from third-party domains (e.g., news sites, industry reports) should also be included in the '{company}' cluster if they provide specific information about '{company}', reference '{target_domain}', or closely match the initial company context.
 53 |             - **Separate Similar But Distinct Domains**: Documents from similar but distinct domains (e.g., '{target_domain.replace('.com', '.io')}') should be placed in separate clusters unless they explicitly reference the target domain and align with the company's context.
 54 |             - **Handle Ambiguities Separately**: Documents that lack clear alignment with '{company}' should be placed in an "Ambiguous" cluster for further review.
 55 | 
 56 |             ### Example Output Format
 57 |             {{
 58 |                 "clusters": [
 59 |                     {{
 60 |                         "company_name": "Name of Company A",
 61 |                         "cluster": [
 62 |                             "http://example.com/doc1",
 63 |                             "http://example.com/doc2"
 64 |                         ]
 65 |                     }},
 66 |                     {{
 67 |                         "company_name": "Name of Company B",
 68 |                         "cluster": [
 69 |                             "http://example.com/doc3"
 70 |                         ]
 71 |                     }},
 72 |                     {{
 73 |                         "company_name": "Ambiguous",
 74 |                         "cluster": [
 75 |                             "http://example.com/doc4"
 76 |                         ]
 77 |                     }}
 78 |                 ]
 79 |             }}
 80 | 
 81 |             ### Key Points
 82 |             - **Focus on Relevant Content**: Documents that contain relevant references to '{company}' (even from third-party domains) should be clustered with '{company}' if they align well with the initial information and context provided.
 83 |             - **Identify Ambiguities**: Any documents without clear relevance to '{company}' should be placed in the "Ambiguous" cluster for manual review.
 84 |         """
 85 | 
 86 |         # LLM call with structured output using DocumentClusters
 87 |         messages = ["system","Your job is to generate clusters for the company: '{company}'.\n",
 88 |                 ("human",f"{prompt}")]
 89 |         
 90 |         msg = ""
 91 |         try:
 92 |             # Use the model's structured output with DocumentClusters format
 93 |             response = await self.model.with_structured_output(DocumentClusters).ainvoke(messages)
 94 |             clusters = response.clusters  # Access the structured clusters directly
 95 |       
 96 |         except Exception as e:
 97 |             msg = f"Error: {str(e)}\n"
 98 |             clusters = []
 99 | 
100 | 
101 |         # Summarize the results
102 |         if not clusters:
103 |             msg += "No valid clusters generated. Please check the document formats.\n"
104 |         else:
105 |             msg += "Clusters generated successfully:\n"
106 |             urls = set()
107 |             for  idx, cluster in enumerate(clusters, start=1):
108 |                 msg += f"   📂 Company {idx}: {cluster.company_name}\n"
109 |                 for url in cluster.cluster:
110 |                     domain = url.split("://")[-1].split("/")[0]
111 |                     if domain not in urls:
112 |                         urls.add(domain)
113 |                         msg += f"       📄 {domain}\n"
114 |         
115 |         return {"messages": [AIMessage(content=msg)], "document_clusters": clusters}
116 |     
117 |     # Define the function to choose the correct cluster as a conditional edge
118 |     async def choose_cluster(self, state: ResearchState):
119 |         company_url = state['company_url']
120 |         clusters = state['document_clusters']
121 | 
122 |         # Attempt to automatically choose the correct cluster
123 |         for index,cluster in enumerate(clusters):
124 |             # Check if any URL in the cluster starts with the company URL
125 |             if any(url.startswith(company_url) for url in cluster.cluster):
126 |                 # state['chosen_cluster'] = index
127 |                 msg = f"Automatically selected cluster for '{company_url}' as {cluster.company_name}."
128 |                 return {"messages": [AIMessage(content=msg)], "chosen_cluster": index}
129 | 
130 |         # If no automatic match, indicate that manual selection is needed
131 |         msg = "No automatic cluster match found. Please select the correct cluster manually."
132 |         return {"messages": [AIMessage(content=msg)], "document_clusters": clusters, "chosen_cluster": None}
133 | 
134 |     async def run(self, state: ResearchState, websocket):
135 |         if websocket:
136 |             await websocket.send_text("🔄 Beginning clustering process...")
137 | 
138 |         cluster_result = await self.cluster(state)
139 |         state['document_clusters'] = cluster_result['document_clusters'] 
140 |         choose_cluster_result = await self.choose_cluster(state)
141 |         result = {'chosen_cluster': choose_cluster_result['chosen_cluster']}
142 |         result.update(cluster_result)
143 |         return result


--------------------------------------------------------------------------------
/backend/nodes/enrich_docs.py:
--------------------------------------------------------------------------------
 1 | from langchain_core.messages import AIMessage
 2 | from tavily import AsyncTavilyClient
 3 | import os
 4 | from ..classes import ResearchState
 5 | 
 6 | class EnrichDocsNode:
 7 |     """
 8 |     Curates documents based on the selected cluster stored in `chosen_cluster`,
 9 |     then enriches the content with Tavily Extract for more detailed information.
10 |     """
11 |     def __init__(self):
12 |         self.tavily_client = AsyncTavilyClient(api_key=os.getenv("TAVILY_API_KEY"))
13 |     async def curate(self, state: ResearchState):
14 |         chosen_cluster_index = state['chosen_cluster']
15 |         clusters = state['document_clusters']
16 |         chosen_cluster = clusters[chosen_cluster_index]
17 |         msg = f"🚀 Enriching documents for selected cluster '{chosen_cluster.company_name}'...\n"
18 | 
19 |         # Filter `documents` to include only those in the chosen cluster
20 |         selected_docs = {url: state['documents'][url] for url in chosen_cluster.cluster if url in state['documents']}
21 | 
22 |         # Limit to first 15 URLs 
23 |         urls_to_extract = list(selected_docs.keys())[:15]
24 |         
25 |         # Enrich the content using Tavily Extract
26 |         try:
27 |             extracted_content = await self.tavily_client.extract(urls=urls_to_extract)
28 |             enriched_docs = {}
29 |             
30 |             # Update `documents` with enriched content from Tavily Extract
31 |             for item in extracted_content["results"]:
32 |                 url = item['url']
33 |                 if url in selected_docs:
34 |                     enriched_docs[url] = {
35 |                         **selected_docs[url],  # Existing doc data
36 |                         "raw_content": item.get("raw_content", ""),
37 |                         "extracted_details": item.get("details", {}),
38 |                     }
39 |             
40 |             state['documents'] = enriched_docs  # Update documents with enriched data
41 | 
42 |         except Exception as e:
43 |             msg += f"Error occurred during Tavily Extract: {str(e)}\n"
44 |             msg += f"Extracted URLs: {urls_to_extract}\n"  # Log the urls_to_extract if error
45 | 
46 |         return {"messages": [AIMessage(content=msg)], "documents": state['documents']}
47 |     
48 |     async def run(self, state: ResearchState):
49 |         result = await self.curate(state)
50 |         return result


--------------------------------------------------------------------------------
/backend/nodes/eval.py:
--------------------------------------------------------------------------------
 1 | from langchain_core.messages import AIMessage
 2 | from ..classes import ResearchState, TavilySearchInput, TavilyQuery, ReportEvaluation
 3 | from langchain_anthropic import ChatAnthropic
 4 | 
 5 | 
 6 | class EvaluationNode:
 7 |     def __init__(self):
 8 |         self.model = ChatAnthropic(
 9 |             model="claude-3-5-haiku-20241022",
10 |             temperature=0
11 |         )
12 | 
13 |     # Evaluation function assigns an overall grade from 1 to 3.
14 |     async def evaluate_report(self, state: ResearchState):
15 |         """
16 |         Evaluates the generated report by assigning an overall grade from 1 to 3.
17 |         If the grade is 1, includes critical gaps in the output.
18 |         """
19 |         prompt = f"""
20 |             You have created a report on '{state['company']}' based on the gathered information.
21 |             Grade the report on a scale of 1 to 3 based on completeness, accuracy, and depth of information:
22 |             - **3** indicates a thorough and well-supported report with no major gaps.
23 |             - **2** indicates adequate coverage, but could be improved.
24 |             - **1** indicates significant gaps or missing essential sections.
25 | 
26 |             If the grade is 1, specify any critical gaps that need addressing.
27 |             
28 |             Here is the report for evaluation:
29 |             {state['report']}
30 |         """
31 | 
32 |         # Invoke the model for report evaluation
33 | 
34 |         messages = ["system","Your task is to evaluate a report on a scale of 1 to 3.",
35 |                 ("human",f"{prompt}")]
36 |         evaluation = await self.model.with_structured_output(ReportEvaluation).ainvoke(messages)
37 |         
38 |         # Determine if additional questions are needed based on grade
39 |         if evaluation.grade == 1:
40 |             msg = f"❌ The report received a grade of 1. Critical gaps identified: {', '.join(evaluation.critical_gaps or ['None specified'])}"
41 |             # Create new sub-questions for critical gaps
42 |             new_sub_queries = [
43 |                 TavilyQuery(query=f"Gather information on {gap} for {state['company']}", topic="general", days=30)
44 |                 for gap in evaluation.critical_gaps or []
45 |             ]
46 |             if 'sub_questions' in state:
47 |                 state['sub_questions'].sub_queries.extend(new_sub_queries)
48 |             else:
49 |                 state['sub_questions'] = TavilySearchInput(sub_queries=new_sub_queries)
50 |             return {"messages": [AIMessage(content=msg)], "eval": evaluation, "sub_questions": state['sub_questions']}
51 |         else:
52 |             msg = f"✅ The report received a grade of {evaluation.grade}/3 and is marked as complete."
53 |             return {"messages": [AIMessage(content=msg)], "eval": evaluation}
54 | 
55 |     async def run(self, state: ResearchState):
56 |         result = await self.evaluate_report(state)
57 |         return result
58 | 


--------------------------------------------------------------------------------
/backend/nodes/generate_report.py:
--------------------------------------------------------------------------------
 1 | from datetime import datetime
 2 | from langchain_core.messages import AIMessage
 3 | from langchain_anthropic import ChatAnthropic
 4 | from ..classes import ResearchState
 5 | 
 6 | 
 7 | 
 8 | class GenerateNode:
 9 |     def __init__(self):
10 |         self.model = ChatAnthropic(
11 |             model="claude-3-5-haiku-20241022",
12 |             temperature=0
13 |         )
14 |     def extract_markdown_content(self, content):
15 |     # Strip out extra preamble or conversational text, retaining only Markdown.
16 |         start_index_hash = content.find("#")
17 |         start_index_bold = content.find("**")
18 |         
19 |         if start_index_hash != -1 and (start_index_bold == -1 or start_index_hash < start_index_bold):
20 |             # '#' found and it comes before '**' (or '**' not found)
21 |             return content[start_index_hash:].strip()
22 |         elif start_index_bold != -1:
23 |             # '**' found
24 |             return content[start_index_bold:].strip()
25 |         else:
26 |             # Neither '#' nor '**' found, return the whole content stripped
27 |             return content.strip()
28 | 
29 |     async def generate_report(self, state: ResearchState):
30 |         report_title = f"Weekly Report on {state['company']}"
31 |         report_date = datetime.now().strftime('%B %d, %Y')
32 | 
33 |         prompt = f"""
34 |         You are an expert researcher tasked with writing a fact-based report on recent developments for the company **{state['company']}**. Write the report in Markdown format, but **do not include a title**. Each section must be written in well-structured paragraphs, not lists or bullet points.
35 |         Ensure the report includes:
36 |         - **Inline citations** as Markdown hyperlinks directly in the main sections (e.g., Company X is an innovative leader in AI ([LinkedIn](https://linkedin.com))).
37 |         - A **Citations Section** at the end that lists all URLs used.
38 | 
39 |         ### Report Structure:
40 |         1. **Executive Summary**:
41 |             - High-level overview of the company, its services, location, employee count, and achievements.
42 |             - Make sure to include the general information necessary to understand the company well including any notable achievements.
43 | 
44 |         2. **Leadership and Vision**:
45 |             - Details on the CEO and key team members, their experience, and alignment with company goals.
46 |             - Any personnel changes and their strategic impact.
47 | 
48 |         3. **Product and Service Overview**:
49 |             - Summary of current products/services, features, updates, and market fit.
50 |             - Include details from the company's website, tools, or new integrations.
51 | 
52 |         4. **Financial Performance**:
53 |             - For public companies: key metrics (e.g., revenue, market cap).
54 |             - For startups: funding rounds, investors, and milestones.
55 | 
56 |         5. **Recent Developments**:
57 |             - New product enhancements, partnerships, competitive moves, or market entries.
58 | 
59 |         6. **Citations**:
60 |             - Ensure every source cited in the report is listed in the text as Markdown hyperlinks.
61 |             - Also include a list of all URLs as Markdown hyperlinks in this section.
62 | 
63 |         ### Documents to Base the Report On:
64 |         {state['documents']}
65 |         """
66 | 
67 |         messages = [("system", "Your task is to generate a Markdown report."), ("human", prompt)]
68 | 
69 |         try:
70 |             # Invoke the model
71 |             response = await self.model.ainvoke(messages)
72 | 
73 |             # Extract the Markdown content
74 |             markdown_content = self.extract_markdown_content(response.content)
75 | 
76 |             # Add the title and date to the response
77 |             full_report = f"# {report_title}\n\n*{report_date}*\n\n{markdown_content}"
78 |             return {"messages": [AIMessage(content=f"Report generated successfully!\n{full_report}")], "report": full_report}
79 |         except Exception as e:
80 |             error_message = f"Error generating report: {str(e)}"
81 |             return {
82 |                 "messages": [AIMessage(content=error_message)],
83 |                 "report": f"# Error Generating Report\n\n*{report_date}*\n\n{error_message}"
84 |             }
85 | 
86 | 
87 |     async def run(self, state: ResearchState, websocket):
88 |         if websocket:
89 |             await websocket.send_text("⌛️ Generating report...")
90 |         result = await self.generate_report(state)
91 |         return result


--------------------------------------------------------------------------------
/backend/nodes/initial_grounding.py:
--------------------------------------------------------------------------------
 1 | from langchain_core.messages import AIMessage
 2 | from tavily import AsyncTavilyClient
 3 | import os
 4 | 
 5 | from ..classes import ResearchState
 6 | 
 7 | 
 8 | class InitialGroundingNode:
 9 |     def __init__(self) -> None:
10 |         self.tavily_client = AsyncTavilyClient(api_key=os.getenv("TAVILY_API_KEY"))
11 | 
12 |     # Use Tavily Extract to get base content from provided company URL
13 |     async def initial_search(self, state: ResearchState):
14 |         msg = f"🔎 Initiating initial grounding for company '{state['company']}'...\n"
15 | 
16 |         urls = []
17 |         urls.append(state['company_url'])
18 |         state['initial_documents'] = {}
19 |         
20 |         try:
21 |             search_results = await self.tavily_client.extract(urls=urls)
22 |             for item in search_results["results"]:
23 |                 url = item['url']
24 |                 raw_content = item["raw_content"]
25 |                 state['initial_documents'][url] = {'url': url, 'raw_content': raw_content}
26 |                 # msg += f"Extracted raw content for URL: {url}\n"
27 |                 
28 |         except Exception as e:
29 |             print(f"Error occurred during Tavily Extract request:{e}")
30 |         
31 |         return {"messages": [AIMessage(content=msg)], "initial_documents": state['initial_documents']}
32 |     
33 |     async def run(self, state: ResearchState):
34 |         result = await self.initial_search(state)
35 |         return result
36 | 
37 |     


--------------------------------------------------------------------------------
/backend/nodes/manual_cluster_select.py:
--------------------------------------------------------------------------------
 1 | # In your node file
 2 | from langchain_core.messages import AIMessage
 3 | from langgraph.errors import NodeInterrupt
 4 | from ..classes import ResearchState
 5 | 
 6 | class ManualSelectionNode:
 7 |     async def manual_cluster_selection(self, state: ResearchState, websocket):
 8 |         clusters = state['document_clusters']
 9 |         msg = "Multiple clusters were identified. Please review the options and select the correct cluster for the target company.\n\n"
10 |         msg += "Enter '0' if none of these clusters match the target company.\n"
11 | 
12 |         if websocket:
13 |             # Send cluster options to the frontend via WebSocket
14 |             await websocket.send_text(msg)
15 | 
16 |             # Wait for user selection from WebSocket
17 |             while True:
18 |                 try:
19 |                     selection_text = await websocket.receive_text()
20 |                     selected_cluster_index = int(selection_text) - 1
21 | 
22 |                     if selected_cluster_index == -1:
23 |                         msg = "No suitable cluster found. Trying to cluster again.\n"
24 |                         await websocket.send_text(msg)
25 |                         return {"messages": [AIMessage(content=msg, is_manual_selection=True)], "chosen_cluster": selected_cluster_index}
26 |                     elif 0 <= selected_cluster_index < len(clusters):
27 |                         chosen_cluster = clusters[selected_cluster_index]
28 |                         msg = f"You selected cluster '{chosen_cluster.company_name}' as the correct cluster."
29 |                         await websocket.send_text(msg)
30 |                         return {"messages": [AIMessage(content=msg, is_manual_selection=True)], "chosen_cluster": selected_cluster_index}
31 |                     else:
32 |                         await websocket.send_text("Invalid choice. Please enter a number corresponding to the listed clusters or '0' to re-cluster.")
33 |                 except ValueError:
34 |                     await websocket.send_text("Invalid input. Please enter a valid number.")
35 |         else:
36 |             # Handle selection for studio, attempt to cluster again for now
37 |             
38 |             msg = "Manual selection needed, trying to cluster again.\n"
39 |             return {"messages": [AIMessage(content=msg, is_manual_selection=True)], "chosen_cluster": -1}
40 |             # selected_cluster_index = state.get('chosen_cluster', -1)  # Default to -1 if not set
41 |             # if selected_cluster_index == -1:
42 |             #     raise NodeInterrupt(
43 |             #         "Please input the chosen cluster index for manual selection in LangGraph Studio. "
44 |             #         "Set the chosen cluster index in the state attribute 'chosen_cluster'."
45 |             #     )
46 |             #     msg = "No suitable cluster found. Trying to cluster again.\n"
47 |             #     return {"messages": [AIMessage(content=msg, is_manual_selection=True)], "chosen_cluster": selected_cluster_index}
48 |             # elif 0 <= selected_cluster_index < len(clusters):
49 |             #     chosen_cluster = clusters[selected_cluster_index]
50 |             #     msg = f"You selected cluster '{chosen_cluster.company_name}' as the correct cluster."
51 |             #     return {"messages": [AIMessage(content=msg, is_manual_selection=True)], "chosen_cluster": selected_cluster_index}
52 |             # else:
53 |             #     msg = "Invalid cluster selection in state. Please provide a valid cluster index."
54 |             #     return {"messages": [AIMessage(content=msg)], "chosen_cluster": None}
55 |     async def run(self, state: ResearchState, websocket=None):
56 |         return await self.manual_cluster_selection(state, websocket)
57 | 


--------------------------------------------------------------------------------
/backend/nodes/publish.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import os
 3 | from datetime import datetime
 4 | from langchain_core.messages import AIMessage
 5 | from ..utils.utils import generate_pdf_from_md
 6 | from ..classes import ResearchState
 7 | 
 8 | class PublishNode:
 9 |     def __init__(self, output_dir="reports"):
10 |         self.output_dir = output_dir
11 |         if not os.path.exists(self.output_dir):
12 |             os.makedirs(self.output_dir)
13 | 
14 |     async def markdown_to_pdf(self, markdown_content: str, output_path: str):
15 |         try:  
16 |             # Generate the PDF from Markdown content
17 |             generate_pdf_from_md(markdown_content, output_path)
18 |         except Exception as e:
19 |             raise Exception(f"Failed to generate PDF: {str(e)}")
20 | 
21 |     async def format_output(self, state: ResearchState):
22 |         report = state["report"]
23 |         output_format = state.get("output_format", "pdf")  # Default to PDF
24 | 
25 |         # Set up the directory and file paths
26 |         timestamp = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
27 |         file_base = f"{self.output_dir}/{state['company']}_Weekly_Report_{timestamp}"
28 |         
29 |         if output_format == "pdf":
30 |             pdf_file_path = f"{file_base}.pdf"
31 |             await self.markdown_to_pdf(markdown_content=report, output_path=pdf_file_path)
32 |             formatted_report = f"📥 PDF report saved at {pdf_file_path}"
33 |         else:
34 |             markdown_file_path = f"{file_base}.md"
35 |             with open(markdown_file_path, "w") as md_file:
36 |                 md_file.write(report)
37 |             formatted_report = f"📥 Markdown report saved at {markdown_file_path}"
38 | 
39 |         return {"messages": [AIMessage(content=formatted_report)]}
40 | 
41 |     async def run(self, state: ResearchState):
42 |         result = await self.format_output(state)
43 |         return result
44 |     
45 | 


--------------------------------------------------------------------------------
/backend/nodes/research.py:
--------------------------------------------------------------------------------
 1 | from langchain_core.messages import AIMessage
 2 | from tavily import AsyncTavilyClient
 3 | import os
 4 | import asyncio
 5 | from datetime import datetime
 6 | from typing import List
 7 | 
 8 | 
 9 | from ..classes import ResearchState, TavilyQuery
10 | 
11 | class ResearcherNode():
12 |     def __init__(self):
13 |         self.tavily_client = AsyncTavilyClient(api_key=os.getenv("TAVILY_API_KEY"))
14 | 
15 | 
16 |     async def tavily_search(self, sub_queries: List[TavilyQuery]):
17 |         """Perform searches for each sub-query using the Tavily search tool concurrently."""  
18 |         # Define a coroutine function to perform a single search with error handling
19 |         async def perform_search(itm):
20 |             try:
21 |                 # Add date to the query as we need the most recent results
22 |                 query_with_date = f"{itm.query} {datetime.now().strftime('%m-%Y')}"
23 |                 # Attempt to perform the search, hardcoding days to 7 (days will be used only when topic is news)
24 |                 response = await self.tavily_client.search(query=query_with_date, topic="general", max_results=7)
25 |                 return response['results']
26 |             except Exception as e:
27 |                 # Handle any exceptions, log them, and return an empty list
28 |                 print(f"Error occurred during search for query '{itm.query}': {str(e)}")
29 |                 return []
30 |         
31 |         # Run all the search tasks in parallel
32 |         search_tasks = [perform_search(itm) for itm in sub_queries]
33 |         search_responses = await asyncio.gather(*search_tasks)
34 |         
35 |         # Combine the results from all the responses
36 |         search_results = []
37 |         for response in search_responses:
38 |             search_results.extend(response)
39 |         
40 |         return search_results
41 | 
42 | 
43 |     async def research(self, state: ResearchState):
44 |         """
45 |         Conducts a Tavily Search and stores all documents in a unified 'documents' attribute.
46 |         """
47 |         msg = "🚀 Conducting Tavily Search for the specified company...\n"
48 |         state['documents'] = {}  # Initialize documents if not already present
49 | 
50 |         research_node = ResearcherNode()
51 |         # Perform the search and gather results
52 |         response = await research_node.tavily_search(state['sub_questions'].sub_queries)
53 | 
54 |         # Process each set of search results and add to documents
55 |         for doc in response:
56 |             url = doc.get('url')
57 |             if url and url not in state['documents']:  # Avoid duplicates
58 |                 state['documents'][url] = doc
59 | 
60 |         return {"messages": [AIMessage(content=msg)], "documents": state['documents']}
61 |     
62 |     async def run(self, state: ResearchState):
63 |         result = await self.research(state)
64 |         return result


--------------------------------------------------------------------------------
/backend/nodes/sub_questions.py:
--------------------------------------------------------------------------------
 1 | from langchain_core.messages import AIMessage
 2 | from langchain_anthropic import ChatAnthropic
 3 | from ..classes import ResearchState, TavilySearchInput
 4 | 
 5 | class SubQuestionsNode:
 6 |     def __init__(self) -> None:
 7 |         self.model = ChatAnthropic(
 8 |             model="claude-3-5-haiku-20241022",
 9 |             temperature=0
10 |         )
11 |      
12 |     # Function to generate sub-questions based on initial search data
13 |     async def generate_sub_questions(self, state: ResearchState):
14 |         try:
15 |             msg = "🤔 Generating sub-questions based on the initial search results...\n"
16 |             
17 |             if 'sub_questions_data' not in state:
18 |                 state['sub_questions_data'] = []
19 |                 
20 |             # Prompt to generate detailed sub-questions
21 |             prompt = f"""
22 |             You are an expert researcher focusing on company analysis to generate a report.
23 |             Your task is to generate 4 specific sub-questions that will provide a thorough understanding of the company: '{state['company']}'.
24 |             
25 |             ### Key Areas to Explore:
26 |             - **Company Background**: Include history, mission, headquarters location, CEO, and number of employees.
27 |             - **Products and Services**: Focus on main offerings, unique features, and target customer segments.
28 |             - **Market Position**: Address competitive standing, market reach, and industry impact.
29 |             - **Financials**: Seek recent funding, revenue milestones, financial performance, and growth indicators.
30 | 
31 |             Use the initial information provided from the company's website below to keep questions directly relevant to **{state['company']}**.
32 | 
33 |             Official URL: {state['company_url']}
34 |             Initial Company Information:
35 |             {state["initial_documents"]}
36 |             
37 |             Ensure questions are clear, specific, and well-aligned with the company's context.
38 |             """
39 |             
40 |             # Use LLM to generate sub-questions
41 |             messages = ["system","Your task is to generate sub-questions based on the initial search results.",
42 |                 ("human",f"{prompt}")]
43 | 
44 |             sub_questions = await self.model.with_structured_output(TavilySearchInput).ainvoke(messages)
45 |             
46 |         except Exception as e:
47 |             msg = f"An error occurred during sub-question generation: {str(e)}"
48 |             return {"messages": [AIMessage(content=msg)], "sub_questions": None, "initial_documents": state['initial_documents']}
49 |             
50 |         
51 |         return {"messages": [AIMessage(content=msg)], "sub_questions": sub_questions, "initial_documents": state['initial_documents']}
52 |             
53 |     async def run(self, state: ResearchState):
54 |         result = await self.generate_sub_questions(state)
55 |         return result


--------------------------------------------------------------------------------
/backend/utils/routing_helper.py:
--------------------------------------------------------------------------------
 1 | from typing import Literal
 2 | from ..classes import ResearchState
 3 | 
 4 | from langchain_core.messages import AIMessage
 5 | 
 6 | 
 7 | def route_based_on_cluster(state: ResearchState) -> Literal["enrich_docs", "manual_cluster_selection"]:
 8 |     if state.get('chosen_cluster') is not None:
 9 |         return "enrich_docs"
10 |     return "manual_cluster_selection"
11 | 
12 | def route_after_manual_selection(state: ResearchState) -> Literal["enrich_docs", "cluster"]:
13 |     if state.get('chosen_cluster') >= 0:
14 |         return "enrich_docs"
15 |     return "cluster"
16 | 
17 | def should_continue_research(state: ResearchState) -> Literal["research", "generate_report"]:
18 |     # Minimum threshold for documents
19 |     min_doc_count = 2  # Adjust this as needed 
20 |     # Check document count
21 |     if len(state["documents"]) < min_doc_count:
22 |         return "research"
23 |     return "generate_report"
24 | 
25 | # Define the conditional edge function based on report grade
26 | def route_based_on_evaluation(state: ResearchState) -> Literal["research", "publish"]:
27 |     evaluation = state.get("eval")
28 |     
29 |     # If the report has critical gaps, route to research for additional questions; otherwise, proceed to format
30 |     return "research" if evaluation.grade == 1 else "publish"
31 | 


--------------------------------------------------------------------------------
/backend/utils/utils.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import re
 3 | from fpdf import FPDF
 4 | 
 5 | 
 6 | class CustomPDF(FPDF):
 7 |     def __init__(self):
 8 |         super().__init__()
 9 |         self.set_left_margin(25)
10 |         self.set_right_margin(25)
11 |         self.set_auto_page_break(auto=True, margin=25)
12 | 
13 |     def footer(self):
14 |         self.set_y(-15)
15 |         self.set_font("Arial", "I", 8)
16 |         self.cell(0, 10, f"Page {self.page_no()}", 0, 0, "C")
17 | 
18 | def sanitize_content(content):
19 |     # Encode and decode to ensure consistent handling of special characters
20 |     return content.encode('utf-8', 'ignore').decode('utf-8')
21 | 
22 | def replace_problematic_characters(content):
23 |     replacements = {
24 |         '\u2013': '-',  # en dash to hyphen
25 |         '\u2014': '--',  # em dash to double hyphen
26 |         '\u2018': "'",   # left single quote to apostrophe
27 |         '\u2019': "'",   # right single quote to apostrophe
28 |         '\u201c': '"',   # left double quote to double quote
29 |         '\u201d': '"',   # right double quote to double quote
30 |         '\u2026': '...', # ellipsis
31 |         '\u2022': '*',   # bullet
32 |         '\u2122': 'TM'   # trademark symbol
33 |     }
34 |     for char, replacement in replacements.items():
35 |         content = content.replace(char, replacement)
36 |     return content
37 | 
38 | def generate_pdf_from_md(content, filename='output.pdf'):
39 |     try:
40 |         pdf = CustomPDF()
41 |         pdf.add_page()
42 |         pdf.set_font('Arial', '', 12)
43 | 
44 |         # Sanitize and replace problematic characters in content
45 |         content = sanitize_content(content)
46 |         content = replace_problematic_characters(content)
47 | 
48 |         lines = content.split('\n')
49 |         for line in lines:
50 |             if line.startswith('#'):
51 |                 header_level = min(line.count('#'), 4)
52 |                 header_text = re.sub(r'\*{2,}', '', line.strip('# ').strip())
53 |                 pdf.set_font('Arial', 'B', 18 - header_level * 2)
54 |                 pdf.ln(8 if header_level == 1 else 5)
55 |                 pdf.multi_cell(0, 10, header_text)
56 |                 pdf.set_font('Arial', '', 12)
57 |             else:
58 |                 process_markdown_line(pdf, line)
59 |                 pdf.ln(8)
60 | 
61 |         pdf.output(filename)
62 |         return f"PDF generated: {filename}"
63 | 
64 |     except Exception as e:
65 |         return f"Error generating PDF: {e}"
66 | 
67 | def process_markdown_line(pdf, line):
68 |     """Parses line for Markdown styling, including bold, italics, and links."""
69 |     parts = re.split(r'(\*\*.*?\*\*|\*.*?\*|\[.*?\]\(.*?\)|https?://\S+)', line)
70 |     for part in parts:
71 |         if re.match(r'\*\*.*?\*\*', part):  # Bold
72 |             text = part.strip('*')
73 |             pdf.set_font('Arial', 'B', 12)
74 |             pdf.write(10, text)
75 |         elif re.match(r'\*.*?\*', part):  # Italics
76 |             text = part.strip('*')
77 |             pdf.set_font('Arial', 'I', 12)
78 |             pdf.write(10, text)
79 |         elif re.match(r'\[.*?\]\(.*?\)', part):  # Markdown link
80 |             display_text = re.search(r'\[(.*?)\]', part).group(1)
81 |             url = re.search(r'\((.*?)\)', part).group(1)
82 |             pdf.set_text_color(0, 0, 255)
83 |             pdf.set_font('Arial', 'U', 12)
84 |             pdf.write(10, display_text, url)
85 |             pdf.set_text_color(0, 0, 0)  # Reset color
86 |             pdf.set_font('Arial', '', 12)
87 |         elif re.match(r'https?://\S+', part):  # Plain URL
88 |             url = part
89 |             pdf.set_text_color(0, 0, 255)
90 |             pdf.set_font('Arial', 'U', 12)
91 |             pdf.write(10, url, url)
92 |             pdf.set_text_color(0, 0, 0)
93 |             pdf.set_font('Arial', '', 12)
94 |         else:
95 |             pdf.set_font('Arial', '', 12)
96 |             pdf.write(10, part)
97 | 


--------------------------------------------------------------------------------
/frontend/static/script.js:
--------------------------------------------------------------------------------
  1 | let ws;
  2 | let currentMarkdownContent = '';
  3 | 
  4 | function validateInputs() {
  5 |     const companyName = document.getElementById("companyName").value.trim();
  6 |     const companyUrl = document.getElementById("companyUrl").value.trim();
  7 |     
  8 |     if (!companyName) {
  9 |         alert("Please enter a company name");
 10 |         return false;
 11 |     }
 12 |     
 13 |     if (!companyUrl) {
 14 |         alert("Please enter a company URL");
 15 |         return false;
 16 |     }
 17 |     
 18 |     // Basic URL validation
 19 |     try {
 20 |         new URL(companyUrl);
 21 |     } catch (error) {
 22 |         alert("Please enter a valid URL (including http:// or https://)");
 23 |         return false;
 24 |     }
 25 |     
 26 |     return true;
 27 | }
 28 | function startResearch() {
 29 |     if (!validateInputs()) {
 30 |         return;
 31 |     }
 32 |     
 33 |     const progressDiv = document.getElementById("progress");
 34 |     const clusterSelectionDiv = document.getElementById("cluster-selection");
 35 |     const reportDiv = document.getElementById("report");
 36 |     const copyButton = document.getElementById("copyButton");
 37 |     
 38 |     // Clear previous content
 39 |     progressDiv.innerHTML = "";
 40 |     reportDiv.innerHTML = "";
 41 |     clusterSelectionDiv.style.display = "none";
 42 |     copyButton.style.display = "none";
 43 |     currentMarkdownContent = '';
 44 |     
 45 |     // Open WebSocket connection
 46 |     ws = new WebSocket("ws://127.0.0.1:5000/ws");
 47 |     
 48 |     ws.onmessage = function(event) {
 49 |         const message = event.data;
 50 |         if (message.includes("Please review the options and select the correct cluster")) {
 51 |             clusterSelectionDiv.style.display = "block";
 52 |         }
 53 |         // Handle final report differently
 54 |         if (message.startsWith("Report generated successfully!")) {
 55 |             currentMarkdownContent = message.replace("Report generated successfully!", "").trim();
 56 |             // Render Markdown content
 57 |             reportDiv.innerHTML = marked.parse(currentMarkdownContent);
 58 |             // Show copy button
 59 |             copyButton.style.display = "block";
 60 |             // Add progress message
 61 |             // const messageElement = document.createElement("div");
 62 |             // messageElement.className = "progress-message";
 63 |             // messageElement.textContent = "Report generated successfully!";
 64 |             // progressDiv.appendChild(messageElement);
 65 |         } else {
 66 |             // Create message element for all other messages
 67 |             const messageElement = document.createElement("div");
 68 |             messageElement.className = "progress-message";
 69 |             messageElement.textContent = message;
 70 |             progressDiv.appendChild(messageElement);
 71 |         }
 72 |         // Ensure automatic scrolling to the latest message
 73 |         requestAnimationFrame(() => {
 74 |             progressDiv.scrollTop = progressDiv.scrollHeight;
 75 |         });
 76 |     };
 77 |     
 78 |     ws.onopen = function() {
 79 |         const companyName = document.getElementById("companyName").value;
 80 |         const companyUrl = document.getElementById("companyUrl").value;
 81 |         const outputFormat = document.getElementById("outputFormat").value;
 82 |         const payload = { companyName, companyUrl, outputFormat };
 83 |         console.log("Sending WebSocket payload:", payload);
 84 |         ws.send(JSON.stringify(payload));
 85 |     };
 86 |     
 87 |     ws.onerror = function(error) {
 88 |         const messageElement = document.createElement("div");
 89 |         messageElement.className = "progress-message";
 90 |         messageElement.textContent = "Error: " + error.message;
 91 |         messageElement.style.borderLeftColor = "#FE363B"; // Red border for errors
 92 |         progressDiv.appendChild(messageElement);
 93 |         progressDiv.scrollTop = progressDiv.scrollHeight;
 94 |     };
 95 | }
 96 | 
 97 | function submitClusterSelection() {
 98 |     const clusterSelection = document.getElementById("cluster-input").value;
 99 |     if (ws && clusterSelection) {
100 |         ws.send(clusterSelection);
101 |         document.getElementById("cluster-selection").style.display = "none";
102 |         document.getElementById("cluster-input").value = "";
103 |     }
104 | }
105 | 
106 | async function copyReport() {
107 |     if (currentMarkdownContent) {
108 |         try {
109 |             await navigator.clipboard.writeText(currentMarkdownContent);
110 |             const copyButton = document.getElementById("copyButton");
111 |             const originalText = copyButton.textContent;
112 |             copyButton.textContent = "Copied!";
113 |             setTimeout(() => {
114 |                 copyButton.textContent = originalText;
115 |             }, 2000);
116 |         } catch (err) {
117 |             console.error('Failed to copy text: ', err);
118 |         }
119 |     }
120 | }


--------------------------------------------------------------------------------
/frontend/static/styles.css:
--------------------------------------------------------------------------------
  1 | * { box-sizing: border-box; margin: 0; padding: 0; font-family: Arial, sans-serif; }
  2 | 
  3 | body { 
  4 |     display: flex; 
  5 |     flex-direction: column; 
  6 |     align-items: center; 
  7 |     min-height: 100vh; 
  8 |     background-color: #F9F7F3;
  9 |     padding: 20px; 
 10 |     overflow-x: hidden; 
 11 | }
 12 | 
 13 | .main-title { 
 14 |     color: #2C3E50; 
 15 |     margin-bottom: 10px;
 16 |     font-size: 36px;
 17 |     font-weight: 600;
 18 |     letter-spacing: 0.5px;
 19 |     text-align: center;
 20 |     font-family: 'Segoe UI', Arial, sans-serif;
 21 |     position: relative;
 22 |     padding-bottom: 15px;
 23 | }
 24 | 
 25 | .main-title::after {
 26 |     content: '';
 27 |     position: absolute;
 28 |     bottom: 0;
 29 |     left: 50%;
 30 |     transform: translateX(-50%);
 31 |     width: 60px;
 32 |     height: 3px;
 33 |     background-color: #468BFF;
 34 |     border-radius: 2px;
 35 | }
 36 | 
 37 | .container { 
 38 |     display: flex; 
 39 |     gap: 20px; 
 40 |     width: 100%; 
 41 |     max-width: 1400px; 
 42 |     height: calc(100vh - 100px); 
 43 | }
 44 | 
 45 | .input-section, .progress-section, .final-report-section { 
 46 |     height: 100%; 
 47 |     padding: 20px; 
 48 |     background-color: #ffffff; 
 49 |     border: 1px solid #ddd; 
 50 |     border-radius: 8px; 
 51 |     box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); 
 52 |     display: flex; 
 53 |     flex-direction: column; 
 54 |     flex-shrink: 0; 
 55 |     overflow-y: auto; 
 56 | }
 57 | 
 58 | .section-header {
 59 |     border: 2px solid #e0e0e0;
 60 |     border-radius: 6px;
 61 |     padding: 12px 16px;
 62 |     margin: -5px -5px 15px -5px;
 63 |     background: linear-gradient(to bottom, #ffffff, #f8f9fa);
 64 |     display: flex;
 65 |     align-items: center;
 66 |     justify-content: space-between;
 67 | }
 68 | 
 69 | .section-header h2 {
 70 |     margin: 0;
 71 |     padding: 0;
 72 |     font-size: 20px;
 73 |     color: #2C3E50;
 74 |     font-weight: 600;
 75 | }
 76 | 
 77 | .input-section { flex: 1.5; flex-basis: 20%; }
 78 | 
 79 | .progress-section, .final-report-section { flex: 2; flex-basis: 40%; }
 80 | 
 81 | /* Progress section specific styles */
 82 | .progress-section { background-color: #f8f9fa; }
 83 | 
 84 | #progress { 
 85 |     flex: 1; 
 86 |     overflow-y: auto; 
 87 |     padding: 10px; 
 88 |     margin-top: 10px; 
 89 |     display: flex; 
 90 |     flex-direction: column; 
 91 |     gap: 8px; 
 92 | }
 93 | 
 94 | /* Individual progress messages */
 95 | .progress-message { 
 96 |     padding: 12px; 
 97 |     background-color: white; 
 98 |     border-left: 3px solid #468BFF; 
 99 |     border-radius: 4px; 
100 |     word-wrap: break-word; 
101 |     white-space: pre-wrap; 
102 |     box-shadow: 0 1px 3px rgba(0, 0, 0, 0.05); 
103 | }
104 | 
105 | .copy-button {
106 |     background-color: #f0f0f0;
107 |     color: #666;
108 |     border: 1px solid #ddd;
109 |     padding: 2px 6px;  /* Reduced padding */
110 |     border-radius: 3px;
111 |     cursor: pointer;
112 |     font-size: 11px;   /* Reduced font size */
113 |     transition: all 0.2s ease;
114 |     width: auto;  /* Reduced min-width */
115 |     height: 20px;      /* Reduced height */
116 |     line-height: 1;
117 | }
118 | 
119 | .copy-button:hover {
120 |     background-color: #e0e0e0;
121 | }
122 | 
123 | .copy-button:active {
124 |     background-color: #d0d0d0;
125 | }
126 | 
127 | .form-group { margin-bottom: 20px; }
128 | 
129 | .form-group label { 
130 |     display: block; 
131 |     margin-bottom: 8px; 
132 |     color: #555; 
133 |     font-weight: 500; 
134 | }
135 | 
136 | .form-group input, .form-group select { 
137 |     width: 100%; 
138 |     padding: 12px; 
139 |     border: 1px solid #ddd; 
140 |     border-radius: 6px; 
141 |     font-size: 14px; 
142 |     transition: all 0.2s ease; 
143 | }
144 | 
145 | .form-group input:focus, .form-group select:focus { 
146 |     outline: none; 
147 |     border-color: #468BFF; 
148 |     box-shadow: 0 0 0 3px rgba(70, 139, 255, 0.1); 
149 | }
150 | 
151 | button { 
152 |     background-color: #468BFF; 
153 |     color: white; 
154 |     border: none; 
155 |     padding: 12px 20px; 
156 |     cursor: pointer; 
157 |     border-radius: 6px; 
158 |     width: 100%; 
159 |     font-size: 16px; 
160 |     font-weight: 500; 
161 |     transition: background-color 0.2s ease; 
162 | }
163 | 
164 | button:hover { background-color: #357ABD; }
165 | 
166 | #cluster-selection { 
167 |     display: none; 
168 |     margin-top: 20px; 
169 |     padding-top: 20px; 
170 |     border-top: 1px solid #ddd; 
171 | }
172 | 
173 | /* Scrollbar styling */
174 | #progress::-webkit-scrollbar { width: 8px; }
175 | 
176 | #progress::-webkit-scrollbar-track { 
177 |     background: #f1f1f1; 
178 |     border-radius: 4px; 
179 | }
180 | 
181 | #progress::-webkit-scrollbar-thumb { 
182 |     background: #ccc; 
183 |     border-radius: 4px; 
184 | }
185 | 
186 | #progress::-webkit-scrollbar-thumb:hover { background: #999; }
187 | 
188 | 
189 | .final-report-section {
190 |     background-color: #ffffff;
191 | }
192 | 
193 | #report {
194 |     flex: 1;
195 |     overflow-y: auto;
196 |     padding: 15px 30px;
197 |     color: #2C3E50;
198 |     font-size: 15px;
199 |     line-height: 1.8;
200 | }
201 | #report h1:first-child {
202 |     margin-top: 0; 
203 | }
204 | 
205 | #report h1 {
206 |     font-size: 28px;
207 |     margin: 32px 0 20px 0;
208 |     color: #1a2634;
209 |     border-bottom: 2px solid #eee;
210 |     padding-bottom: 10px;
211 | }
212 | 
213 | #report h2 {
214 |     font-size: 24px;
215 |     margin: 28px 0 16px 0;
216 |     color: #1a2634;
217 | }
218 | 
219 | #report h3 {
220 |     font-size: 20px;
221 |     margin: 24px 0 14px 0;
222 |     color: #1a2634;
223 | }
224 | 
225 | #report p {
226 |     margin: 0 0 20px 0;
227 |     line-height: 1.8;
228 | }
229 | 
230 | #report ul, 
231 | #report ol {
232 |     margin: 0 0 20px 0;
233 |     padding-left: 24px;
234 | }
235 | 
236 | #report li {
237 |     margin-bottom: 12px;
238 |     line-height: 1.6;
239 | }
240 | 
241 | #report blockquote {
242 |     margin: 20px 0;
243 |     padding: 10px 20px;
244 |     border-left: 4px solid #468BFF;
245 |     background-color: #f8f9fa;
246 | }
247 | 
248 | #report code {
249 |     background-color: #f6f8fa;
250 |     padding: 2px 6px;
251 |     border-radius: 4px;
252 |     font-family: monospace;
253 | }
254 | 
255 | #report pre {
256 |     background-color: #f6f8fa;
257 |     padding: 16px;
258 |     border-radius: 6px;
259 |     overflow-x: auto;
260 |     margin: 20px 0;
261 | }
262 | 
263 | #report hr {
264 |     margin: 30px 0;
265 |     border: none;
266 |     border-top: 1px solid #eee;
267 | }
268 | 
269 | #report table {
270 |     border-collapse: collapse;
271 |     width: 100%;
272 |     margin: 20px 0;
273 | }
274 | 
275 | #report th,
276 | #report td {
277 |     border: 1px solid #ddd;
278 |     padding: 12px;
279 |     text-align: left;
280 | }
281 | 
282 | #report th {
283 |     background-color: #f8f9fa;
284 | }
285 | 
286 | #report tr:nth-child(even) {
287 |     background-color: #f8f9fa;
288 | }


--------------------------------------------------------------------------------
/frontend/templates/index.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html lang="en">
 3 | <head>
 4 |     <meta charset="UTF-8">
 5 |     <meta name="viewport" content="width=device-width, initial-scale=1.0">
 6 |     <title>Company Research Tool</title>
 7 |     <link rel="stylesheet" href="/static/styles.css">
 8 |     <script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
 9 | </head>
10 | <body>
11 |     <h1 class="main-title">Company Research Tool</h1>
12 |     
13 |     <div class="container">
14 |         <div class="input-section">
15 |             <div class="section-header">
16 |                 <h2>User Input</h2>
17 |             </div>
18 |             <div class="form-group">
19 |                 <label for="companyName">Company Name</label>
20 |                 <input type="text" id="companyName" placeholder="Enter company name">
21 |             </div>
22 |             
23 |             <div class="form-group">
24 |                 <label for="companyUrl">Company URL</label>
25 |                 <input type="url" id="companyUrl" placeholder="Enter company website">
26 |             </div>
27 |             
28 |             <div class="form-group">
29 |                 <label for="outputFormat">Output Format</label>
30 |                 <select id="outputFormat">
31 |                     <option value="pdf">PDF</option>
32 |                     <option value="markdown">Markdown</option>
33 |                 </select>
34 |             </div>
35 |             
36 |             <button onclick="startResearch()">Generate Report</button>
37 |             
38 |             <div id="cluster-selection">
39 |                 <div class="form-group">
40 |                     <label for="cluster-input">Select Cluster</label>
41 |                     <input type="text" id="cluster-input" placeholder="Enter cluster number">
42 |                 </div>
43 |                 <button onclick="submitClusterSelection()">Submit Selection</button>
44 |             </div>
45 |         </div>
46 |         <div class="progress-section">
47 |             <div class="section-header">
48 |                 <h2>Progress</h2>
49 |             </div>
50 |             <div id="progress"></div>
51 |         </div>
52 |         <div class="final-report-section">
53 |             <div class="section-header">
54 |                 <h2>Final Report</h2>
55 |                 <button id="copyButton" class="copy-button" onclick="copyReport()" style="display: none;">
56 |                     Copy
57 |                 </button>
58 |             </div>
59 |             <div id="report"></div>
60 |         </div>
61 |     </div>
62 |     
63 |     <script src="/static/script.js"></script>
64 | </body>
65 | </html>


--------------------------------------------------------------------------------
/langgraph.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "dockerfile_lines": [],
 3 |   "graphs": {
 4 |     "agent": "./langgraph_entry.py:graph"
 5 |   },
 6 |   "env": ".env",
 7 |   "python_version": "3.11",
 8 |   "dependencies": [
 9 |     "."
10 |   ]
11 | }


--------------------------------------------------------------------------------
/langgraph_entry.py:
--------------------------------------------------------------------------------
1 | # langgraph_entry.py
2 | from backend.graph import Graph  # Adjust if your Graph class is in a different path
3 | 
4 | graph = Graph().compile()


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | langgraph==0.2.50
 2 | tavily-python==0.5.0
 3 | langchain==0.3.7
 4 | langchain-anthropic==0.3.0
 5 | pypandoc==1.14
 6 | pandoc==2.4
 7 | tldextract==5.1.3
 8 | fpdf==1.7.2
 9 | flask==3.1.0
10 | fastapi==0.115.5
11 | uvicorn==0.32.0
12 | websockets==14.1
13 | python-multipart==0.0.17
14 | python-dotenv==1.0.1
15 | markdown2==2.5.1
16 | 


--------------------------------------------------------------------------------