├── .gitignore ├── README.md ├── app.py ├── backend ├── __init__.py ├── classes │ ├── __init__.py │ ├── classes.py │ └── research_state.py ├── graph.py ├── nodes │ ├── __init__.py │ ├── cluster.py │ ├── enrich_docs.py │ ├── eval.py │ ├── generate_report.py │ ├── initial_grounding.py │ ├── manual_cluster_select.py │ ├── publish.py │ ├── research.py │ └── sub_questions.py └── utils │ ├── routing_helper.py │ └── utils.py ├── frontend ├── static │ ├── script.js │ └── styles.css └── templates │ └── index.html ├── langgraph.json ├── langgraph_entry.py └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | .env 2 | .venv 3 | *.pyc[cod] 4 | __pycache__/ 5 | .DS_Store 6 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Company Researcher with Tavily and Langgraph 2 | 3 | The **Company Researcher** is an open-source tool designed for in-depth company analysis. Built with **Tavily’s `search` and `extract` capabilities** and powered by **LangGraph**, it delivers percise, real-time insights in a structured format. Ideal for competitive intelligence, lead research, and Go-to-Market (GTM) strategies, this tool leverages advanced AI-driven workflows to provide comprehensive, reliable reports for data-driven decision-making. 4 | 5 | ## Table of Contents 6 | 1. [Overview](#overview) 7 | 2. [Key Workflow Features](#key-workflow-features) 8 | 3. [Running the Tool Locally](#running-the-tool-locally) 9 | - [Prerequisites](#prerequisites) 10 | - [Installation](#installation) 11 | - [Running the Application](#running-the-application) 12 | 4. [Running the Tool in LangGraph Studio](#running-the-tool-in-langgraph-studio) 13 | 5. [Customization](#customization) 14 | 6. [Future Directions](#future-directions) 15 | 16 | --- 17 | 18 | ## Overview 19 | 20 | The **Company Researcher** is an open-source tool designed for in-depth company analysis. Built with **Tavily’s search and extract capabilities** and powered by **LangGraph**, it gathers both general and targeted information, using feedback loops and optional human validation for accuracy. It is designed to handle complex scenarios, such as distinguishing similarly named companies or gathering data in sparsely documented fields, and can be easily adapted to other research domains. 21 | 22 | --- 23 | ![workflow](https://i.imgur.com/92E2kcj.jpeg) 24 | 25 | --- 26 | 27 | ## Key Workflow Features 28 | 1. **Establishing a Ground Truth with Tavily Extract**: Each session begins by setting a “ground truth” with Tavily’s `extract` tool, using a user-provided company name and URL. This foundational data anchors the subsequent search, ensuring all steps stay within accurate and verified data boundaries. 29 | 2. **Sub-Question Generation and Tavily Search**: The workflow dynamically generates specific research questions to drive Tavily’s `search`, focusing the retrieval on relevant, high-value information rather than conducting broad, unfocused searches. 30 | 3. **AI-Driven Document Clustering**: Retrieved documents are clustered based on relevance to the target company. This process, anchored by the ground truth, filters out unrelated content, a critical feature for similarly named companies or entities with minimal online presence. 31 | 4. **Human-on-the-Loop Validation**: In cases where clustering yields ambiguous results, optional human review allows for manual cluster selection, ensuring the data aligns accurately with the target entity. 32 | 5. **Document Curation and Enrichment with Tavily Extract**: Once the appropriate cluster is identified, Tavily’s `extract` further refines and enriches the content, adding substantial depth to the research. This step enhances the precision and comprehensiveness of the final output. 33 | 6. **Report Generation and Evaluation with Feedback Loops**: An LLM synthesizes the enriched data into a structured report. If gaps are detected, feedback loops prompt additional information gathering, enabling iterative improvements without restarting the entire workflow. 34 | 7. **Multi-Format Output**: The finalized report can be exported in PDF or Markdown formats, making it ready for easy sharing and integration. 35 | 36 | --- 37 | 38 | ## Running the Tool Locally 39 | 40 | ### Prerequisites 41 | 42 | - Python 3.11 or later: [Python Installation Guide](https://www.tutorialsteacher.com/python/install-python) 43 | - Tavily API Key - [Sign Up](https://tavily.com/) 44 | - Anthropic API Key - [Sign Up](https://console.anthropic.com/settings/keys) 45 | 46 | ### Installation 47 | 48 | 1. **Clone the Repository**: 49 | 50 | ```bash 51 | git clone https://github.com/danielleyahalom/company-researcher.git 52 | cd company-researcher 53 | ``` 54 | 55 | 2. **Create a Virtual Environment**: 56 | 57 | To avoid dependency conflicts, it's recommended to create and activate a virtual environment using `venv`: 58 | 59 | ```bash 60 | python -m venv venv 61 | source venv/bin/activate # macOS/Linux 62 | venv\Scripts\activate # Windows 63 | ``` 64 | 65 | 3. **Set Up API Keys**: 66 | Configure your OpenAI and Tavily API keys as environment variables or place them in a `.env` file: 67 | 68 | ```bash 69 | export TAVILY_API_KEY={Your Tavily API Key here} 70 | export ANTHROPIC_API_KEY={Your Anthropic API Key here} 71 | ``` 72 | 73 | 4. **Install Dependencies**: 74 | 75 | Install the required Python packages: 76 | ```bash 77 | pip install -r requirements.txt 78 | ``` 79 | 80 | 5. **Run the Application**: 81 | 82 | ```bash 83 | python app.py 84 | ``` 85 | 86 | 6. **Open the App in Your Browser**: 87 | 88 | ```bash 89 | http://localhost:5000 90 | ``` 91 | 92 | --- 93 | 94 | ## Running the Tool in LangGraph Studio 95 | 96 | --- 97 |
98 | Langgraph Studio 99 |
100 | 101 | --- 102 | 103 | **LangGraph Studio** enables visualization, debugging, and real-time interaction with the Company Researcher's workflow. Here’s how to set it up: 104 | 105 | ### Prerequisites 106 | 107 | 1. **Download LangGraph Studio**: 108 | - For macOS, download the latest `.dmg` file for LangGraph Studio from [here](https://langgraph-studio.vercel.app/api/mac/latest) or visit the [releases page](https://github.com/langchain-ai/langgraph-studio/releases). 109 | - **Note**: Currently, only macOS is supported. 110 | 111 | 2. **Install Docker**: 112 | - Ensure [Docker Desktop](https://docs.docker.com/engine/install/) is installed and running. LangGraph Studio requires Docker Compose version 2.22.0 or higher. 113 | 114 | ### Setting Up in LangGraph Studio 115 | 116 | 1. **Clone the Repository**: 117 | ```bash 118 | git clone https://github.com/danielleyahalom/company-researcher.git 119 | cd company_researcher 120 | ``` 121 | - **Note**: This repository includes all required files except for the `.env` file, which you need to create to store your API keys. 122 | 123 | 2. **Configure the Environment**: 124 | - Create a `.env` file in the root directory to store your API keys: 125 | ```bash 126 | touch .env 127 | ``` 128 | - Add your API keys to the `.env` file: 129 | ```bash 130 | TAVILY_API_KEY={Your Tavily API Key here} 131 | ANTHROPIC_API_KEY={Your Anthropic API Key here} 132 | ``` 133 | 134 | 3. **Ensure LangGraph Configuration Files Are in Place**: 135 | - The repository includes `langgraph.json` and `langgraph_entry.py`, defining the entry point and configuration for LangGraph Studio. 136 | 137 | 4. **Start LangGraph Studio**: 138 | - Open LangGraph Studio and select the `company_researcher` directory from the dashboard. 139 | 140 | 5. **Running the Workflow in Studio**: 141 | - Visualize each step of the workflow, make real-time edits, and monitor the workflow’s state. 142 | - **Important Note**: If a cluster cannot be automatically selected, the tool will attempt to re-cluster instead. 143 | 144 | LangGraph Studio provides a hands-on approach to refining the workflow, enhancing both development efficiency and output reliability. 145 | 146 | --- 147 | 148 | ## Customization 149 | 150 | The tool’s modular structure makes it adaptable to various research applications: 151 | 152 | - **Modify Prompts**: Adjust prompts in question generation or report synthesis for different research needs. 153 | - **Extend Workflow Nodes**: Add, remove, or modify nodes to focus on specific types of analysis. 154 | - **Customize Output Formats**: Tailor output formats (e.g., CSS for PDF styling) to suit organizational standards. 155 | 156 | --- 157 | 158 | ## Future Directions 159 | 160 | This adaptable workflow can be fine-tuned for a range of applications beyond company research: 161 | 162 | - **Market Analysis**: Apply the workflow to track trends, competitors, and emerging tech. 163 | - **Lead Generation**: Compile detailed profiles on potential clients for targeted outreach. 164 | - **Ongoing Knowledge Bases**: Build continuously updated research repositories in fields like law, finance, or healthcare. 165 | 166 | This tool exemplifies how AI-driven workflows, backed by precise data extraction and real-time search, can reshape research and analysis across domains. -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | from fastapi import FastAPI, WebSocket, WebSocketDisconnect, Request 2 | from fastapi.responses import HTMLResponse 3 | from fastapi.staticfiles import StaticFiles 4 | from fastapi.templating import Jinja2Templates 5 | import uvicorn 6 | from backend.graph import Graph # Adjust this import if necessary 7 | 8 | from dotenv import load_dotenv 9 | load_dotenv('.env') 10 | 11 | app = FastAPI() 12 | app.mount("/static", StaticFiles(directory="frontend/static"), name="static") 13 | templates = Jinja2Templates(directory="frontend/templates") 14 | 15 | @app.get("/", response_class=HTMLResponse) 16 | async def index(request: Request): # Add the type hint here 17 | return templates.TemplateResponse("index.html", {"request": request}) 18 | 19 | @app.websocket("/ws") 20 | async def websocket_endpoint(websocket: WebSocket): 21 | await websocket.accept() 22 | try: 23 | # Receive initial data from the WebSocket client 24 | data = await websocket.receive_json() 25 | company_name = data.get("companyName") 26 | company_url = data.get("companyUrl") 27 | output_format = data.get("outputFormat", "pdf") 28 | 29 | # Initialize the Graph with company, URL, and output format 30 | graph = Graph(company=company_name, url=company_url, output_format=output_format, websocket=websocket) 31 | 32 | # Progress callback to send messages back to the client 33 | async def progress_callback(message): 34 | await websocket.send_text(message) 35 | 36 | # Run the graph process without additional arguments 37 | await graph.run(progress_callback=progress_callback) 38 | 39 | await websocket.send_text("✔️ Research completed.") 40 | except WebSocketDisconnect: 41 | print("WebSocket disconnected") 42 | finally: 43 | await websocket.close() 44 | 45 | if __name__ == "__main__": 46 | uvicorn.run( 47 | "app:app", 48 | host="127.0.0.1", 49 | port=5000, 50 | reload=True 51 | ) 52 | -------------------------------------------------------------------------------- /backend/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielleyahalom/company-researcher/03f596b05f69f7b1802620385e512abb729cb87d/backend/__init__.py -------------------------------------------------------------------------------- /backend/classes/__init__.py: -------------------------------------------------------------------------------- 1 | from .classes import TavilySearchInput, TavilyQuery, DocumentCluster, DocumentClusters, ReportEvaluation 2 | from .research_state import ResearchState 3 | 4 | -------------------------------------------------------------------------------- /backend/classes/classes.py: -------------------------------------------------------------------------------- 1 | from pydantic import BaseModel, Field 2 | from typing import List, Optional 3 | 4 | # Add Tavily's arguments to enhance the web search tool's capabilities 5 | class TavilyQuery(BaseModel): 6 | query: str = Field(description="web search query") 7 | 8 | 9 | # Define the args_schema for the tavily_search tool using a multi-query approach, enabling more precise queries for Tavily. 10 | class TavilySearchInput(BaseModel): 11 | sub_queries: List[TavilyQuery] = Field(description="set of sub-queries that can be answered in isolation") 12 | 13 | 14 | # Define the structure for clustering output 15 | class DocumentCluster(BaseModel): 16 | company_name: str = Field( 17 | ..., 18 | description="The name or identifier of the company these documents belong to." 19 | ) 20 | cluster: List[str] = Field( 21 | ..., 22 | description="A list of URLs relevant to the identified company." 23 | ) 24 | 25 | class DocumentClusters(BaseModel): 26 | clusters: List[DocumentCluster] = Field(default_factory=list, description="List of document clusters") 27 | 28 | # Define the ReportEvaluation structure 29 | class ReportEvaluation(BaseModel): 30 | grade: int = Field( 31 | ..., 32 | description="Overall grade of the report on a scale from 1 to 3 (1 = needs improvement, 3 = complete and thorough)." 33 | ) 34 | critical_gaps: Optional[List[str]] = Field( 35 | None, 36 | description="List of critical gaps to address if the grade is 1." 37 | ) 38 | -------------------------------------------------------------------------------- /backend/classes/research_state.py: -------------------------------------------------------------------------------- 1 | from langgraph.graph import add_messages 2 | from langchain_core.messages import AnyMessage 3 | from . import TavilySearchInput, DocumentCluster, ReportEvaluation 4 | from typing import TypedDict, List, Annotated, Dict, Union 5 | 6 | # Import directly from each specific module within format_classes 7 | from .classes import TavilySearchInput, DocumentCluster, ReportEvaluation 8 | 9 | # Define the research state 10 | class ResearchState(TypedDict): 11 | company: str 12 | company_url: str 13 | initial_documents: Dict[str, Dict[Union[str, int], Union[str, float]]] 14 | sub_questions: TavilySearchInput 15 | documents: Dict[str, Dict[Union[str, int], Union[str, float]]] 16 | document_clusters: List[DocumentCluster] 17 | chosen_cluster: int 18 | report: str 19 | eval: ReportEvaluation 20 | output_format: str 21 | messages: Annotated[list[AnyMessage], add_messages] 22 | 23 | class InputState(TypedDict): 24 | company: str 25 | company_url: str 26 | 27 | 28 | class OutputState(TypedDict): 29 | report: str -------------------------------------------------------------------------------- /backend/graph.py: -------------------------------------------------------------------------------- 1 | from langchain_core.messages import SystemMessage, AIMessage 2 | from functools import partial 3 | from langgraph.graph import StateGraph 4 | from langgraph.checkpoint.memory import MemorySaver 5 | 6 | # Import research state class 7 | from backend.classes.research_state import ResearchState, InputState, OutputState 8 | 9 | # Import node classes 10 | from backend.nodes import ( 11 | InitialGroundingNode, 12 | SubQuestionsNode, 13 | ResearcherNode, 14 | ClusterNode, 15 | ManualSelectionNode, 16 | EnrichDocsNode, 17 | GenerateNode, 18 | EvaluationNode, 19 | PublishNode 20 | ) 21 | from backend.utils.routing_helper import ( 22 | route_based_on_cluster, 23 | route_after_manual_selection, 24 | should_continue_research, 25 | route_based_on_evaluation 26 | ) 27 | 28 | class Graph: 29 | def __init__(self, company=None, url=None, output_format="pdf", websocket=None): 30 | # Initial setup of ResearchState and messages 31 | self.messages = [ 32 | SystemMessage(content="You are an expert researcher ready to begin the information gathering process.") 33 | ] 34 | 35 | # Initialize ResearchState 36 | self.state = ResearchState( 37 | company=company, 38 | company_url=url, 39 | output_format=output_format, 40 | messages=self.messages 41 | ) 42 | 43 | # Initialize nodes as attributes 44 | self.initial_search_node = InitialGroundingNode() 45 | self.sub_questions_node = SubQuestionsNode() 46 | self.researcher_node = ResearcherNode() 47 | self.cluster_node = ClusterNode() 48 | self.manual_selection_node = ManualSelectionNode() 49 | self.curate_node = EnrichDocsNode() 50 | self.generate_node = GenerateNode() 51 | self.evaluation_node = EvaluationNode() 52 | self.publish_node = PublishNode() 53 | 54 | # Initialize workflow for the graph 55 | self.workflow = StateGraph(ResearchState, input=InputState, output=OutputState) 56 | 57 | # Add nodes to the workflow 58 | self.workflow.add_node("initial_grounding", self.initial_search_node.run) 59 | self.workflow.add_node("sub_questions_gen", self.sub_questions_node.run) 60 | self.workflow.add_node("research", self.researcher_node.run) 61 | self.workflow.add_node("cluster", self.curried_node(self.cluster_node.run)) 62 | self.workflow.add_node("manual_cluster_selection", self.curried_node(self.manual_selection_node.run)) 63 | self.workflow.add_node("enrich_docs", self.curate_node.run) 64 | self.workflow.add_node("generate_report", self.curried_node(self.generate_node.run)) 65 | self.workflow.add_node("eval_report", self.evaluation_node.run) 66 | self.workflow.add_node("publish", self.publish_node.run) 67 | 68 | # Add edges to the graph 69 | self.workflow.add_edge("initial_grounding", "sub_questions_gen") 70 | self.workflow.add_edge("sub_questions_gen", "research") 71 | self.workflow.add_edge("research", "cluster") 72 | self.workflow.add_conditional_edges("cluster", route_based_on_cluster) 73 | self.workflow.add_conditional_edges("manual_cluster_selection", route_after_manual_selection) 74 | self.workflow.add_conditional_edges("enrich_docs", should_continue_research) 75 | self.workflow.add_edge("generate_report", "eval_report") 76 | self.workflow.add_conditional_edges("eval_report", route_based_on_evaluation) 77 | 78 | # Set start and end nodes 79 | self.workflow.set_entry_point("initial_grounding") 80 | self.workflow.set_finish_point("publish") 81 | 82 | self.memory = MemorySaver() 83 | self.websocket = websocket 84 | 85 | async def run(self, progress_callback=None): 86 | # Compile the graph 87 | graph = self.workflow.compile(checkpointer=self.memory) 88 | thread = {"configurable": {"thread_id": "2"}} 89 | 90 | # Execute the graph asynchronously and send progress updates 91 | async for s in graph.astream(self.state, thread, stream_mode="values"): 92 | if "messages" in s and s["messages"]: # Check if "messages" exists and is non-empty 93 | message = s["messages"][-1] 94 | output_message = message.content if hasattr(message, "content") else str(message) 95 | if progress_callback and not getattr(message, "is_manual_selection", False): 96 | await progress_callback(output_message) 97 | 98 | def curried_node(self, node_run_method): 99 | # Curried wrapper for handling websocket 100 | async def wrapper(state): 101 | return await node_run_method(state, self.websocket) 102 | return wrapper 103 | 104 | # Compile for langgraph studio 105 | def compile(self): 106 | # Use a consistent thread ID for state persistence 107 | thread = {"configurable": {"thread_id": "2"}} 108 | 109 | # Compile the workflow with checkpointer and interrupt configuration 110 | graph = self.workflow.compile( 111 | checkpointer=self.memory 112 | # interrupt_before=["manual_cluster_selection"] 113 | ) 114 | return graph 115 | -------------------------------------------------------------------------------- /backend/nodes/__init__.py: -------------------------------------------------------------------------------- 1 | from .initial_grounding import InitialGroundingNode 2 | from .sub_questions import SubQuestionsNode 3 | from .research import ResearcherNode 4 | from .cluster import ClusterNode 5 | from .manual_cluster_select import ManualSelectionNode 6 | from .enrich_docs import EnrichDocsNode 7 | from .generate_report import GenerateNode 8 | from .eval import EvaluationNode 9 | from .publish import PublishNode -------------------------------------------------------------------------------- /backend/nodes/cluster.py: -------------------------------------------------------------------------------- 1 | from langchain_core.messages import AIMessage 2 | from langchain_anthropic import ChatAnthropic 3 | 4 | from ..classes import ResearchState,DocumentClusters 5 | 6 | 7 | 8 | class ClusterNode: 9 | def __init__(self): 10 | self.model = ChatAnthropic( 11 | model="claude-3-5-haiku-20241022", 12 | temperature=0 13 | ) 14 | 15 | async def cluster(self, state: ResearchState): 16 | company = state['company'] 17 | company_url = state['company_url'] 18 | initial_docs = state['initial_documents'] 19 | documents = state.get('documents', {}) 20 | 21 | # Extract compnay domain from URL 22 | target_domain = company_url.split("//")[-1].split("/")[0] 23 | 24 | # Collect all retrieved documents without duplicates 25 | unique_urls = [] 26 | seen_urls = set() 27 | for url, doc, in documents.items(): 28 | if url not in seen_urls: 29 | unique_urls.append({'url': url, 'content': doc.get('content', '')}) 30 | seen_urls.add(url) 31 | 32 | # Pass in the first 25 URLs 33 | urls = unique_urls[:25] 34 | 35 | # LLM prompt to categorize documents accurately 36 | prompt = f""" 37 | We conducted a search for a company called '{company}', but the results may include documents from other companies with similar names or domains. 38 | Your task is to accurately categorize these retrieved documents based on which specific company they pertain to, using the initial company information as "ground truth." 39 | 40 | ### Target Company Information 41 | - **Company Name**: '{company}' 42 | - **Primary Domain**: '{target_domain}' 43 | - **Initial Context (Ground Truth)**: Information below should act as a verification baseline. Use it to confirm that the document content aligns directly with {company}. 44 | - **{initial_docs}** 45 | 46 | ### Retrieved Documents for Clustering 47 | Below are the retrieved documents, including URLs and brief content snippets: 48 | {[{'url': doc['url'], 'snippet': doc['content']} for doc in urls]} 49 | 50 | ### Clustering Instructions 51 | - **Primary Domain Priority**: Documents with URLs containing '{target_domain}' should be prioritized for the main cluster for '{company}'. 52 | - **Include Relevant Third-Party Sources**: Documents from third-party domains (e.g., news sites, industry reports) should also be included in the '{company}' cluster if they provide specific information about '{company}', reference '{target_domain}', or closely match the initial company context. 53 | - **Separate Similar But Distinct Domains**: Documents from similar but distinct domains (e.g., '{target_domain.replace('.com', '.io')}') should be placed in separate clusters unless they explicitly reference the target domain and align with the company's context. 54 | - **Handle Ambiguities Separately**: Documents that lack clear alignment with '{company}' should be placed in an "Ambiguous" cluster for further review. 55 | 56 | ### Example Output Format 57 | {{ 58 | "clusters": [ 59 | {{ 60 | "company_name": "Name of Company A", 61 | "cluster": [ 62 | "http://example.com/doc1", 63 | "http://example.com/doc2" 64 | ] 65 | }}, 66 | {{ 67 | "company_name": "Name of Company B", 68 | "cluster": [ 69 | "http://example.com/doc3" 70 | ] 71 | }}, 72 | {{ 73 | "company_name": "Ambiguous", 74 | "cluster": [ 75 | "http://example.com/doc4" 76 | ] 77 | }} 78 | ] 79 | }} 80 | 81 | ### Key Points 82 | - **Focus on Relevant Content**: Documents that contain relevant references to '{company}' (even from third-party domains) should be clustered with '{company}' if they align well with the initial information and context provided. 83 | - **Identify Ambiguities**: Any documents without clear relevance to '{company}' should be placed in the "Ambiguous" cluster for manual review. 84 | """ 85 | 86 | # LLM call with structured output using DocumentClusters 87 | messages = ["system","Your job is to generate clusters for the company: '{company}'.\n", 88 | ("human",f"{prompt}")] 89 | 90 | msg = "" 91 | try: 92 | # Use the model's structured output with DocumentClusters format 93 | response = await self.model.with_structured_output(DocumentClusters).ainvoke(messages) 94 | clusters = response.clusters # Access the structured clusters directly 95 | 96 | except Exception as e: 97 | msg = f"Error: {str(e)}\n" 98 | clusters = [] 99 | 100 | 101 | # Summarize the results 102 | if not clusters: 103 | msg += "No valid clusters generated. Please check the document formats.\n" 104 | else: 105 | msg += "Clusters generated successfully:\n" 106 | urls = set() 107 | for idx, cluster in enumerate(clusters, start=1): 108 | msg += f" 📂 Company {idx}: {cluster.company_name}\n" 109 | for url in cluster.cluster: 110 | domain = url.split("://")[-1].split("/")[0] 111 | if domain not in urls: 112 | urls.add(domain) 113 | msg += f" 📄 {domain}\n" 114 | 115 | return {"messages": [AIMessage(content=msg)], "document_clusters": clusters} 116 | 117 | # Define the function to choose the correct cluster as a conditional edge 118 | async def choose_cluster(self, state: ResearchState): 119 | company_url = state['company_url'] 120 | clusters = state['document_clusters'] 121 | 122 | # Attempt to automatically choose the correct cluster 123 | for index,cluster in enumerate(clusters): 124 | # Check if any URL in the cluster starts with the company URL 125 | if any(url.startswith(company_url) for url in cluster.cluster): 126 | # state['chosen_cluster'] = index 127 | msg = f"Automatically selected cluster for '{company_url}' as {cluster.company_name}." 128 | return {"messages": [AIMessage(content=msg)], "chosen_cluster": index} 129 | 130 | # If no automatic match, indicate that manual selection is needed 131 | msg = "No automatic cluster match found. Please select the correct cluster manually." 132 | return {"messages": [AIMessage(content=msg)], "document_clusters": clusters, "chosen_cluster": None} 133 | 134 | async def run(self, state: ResearchState, websocket): 135 | if websocket: 136 | await websocket.send_text("🔄 Beginning clustering process...") 137 | 138 | cluster_result = await self.cluster(state) 139 | state['document_clusters'] = cluster_result['document_clusters'] 140 | choose_cluster_result = await self.choose_cluster(state) 141 | result = {'chosen_cluster': choose_cluster_result['chosen_cluster']} 142 | result.update(cluster_result) 143 | return result -------------------------------------------------------------------------------- /backend/nodes/enrich_docs.py: -------------------------------------------------------------------------------- 1 | from langchain_core.messages import AIMessage 2 | from tavily import AsyncTavilyClient 3 | import os 4 | from ..classes import ResearchState 5 | 6 | class EnrichDocsNode: 7 | """ 8 | Curates documents based on the selected cluster stored in `chosen_cluster`, 9 | then enriches the content with Tavily Extract for more detailed information. 10 | """ 11 | def __init__(self): 12 | self.tavily_client = AsyncTavilyClient(api_key=os.getenv("TAVILY_API_KEY")) 13 | async def curate(self, state: ResearchState): 14 | chosen_cluster_index = state['chosen_cluster'] 15 | clusters = state['document_clusters'] 16 | chosen_cluster = clusters[chosen_cluster_index] 17 | msg = f"🚀 Enriching documents for selected cluster '{chosen_cluster.company_name}'...\n" 18 | 19 | # Filter `documents` to include only those in the chosen cluster 20 | selected_docs = {url: state['documents'][url] for url in chosen_cluster.cluster if url in state['documents']} 21 | 22 | # Limit to first 15 URLs 23 | urls_to_extract = list(selected_docs.keys())[:15] 24 | 25 | # Enrich the content using Tavily Extract 26 | try: 27 | extracted_content = await self.tavily_client.extract(urls=urls_to_extract) 28 | enriched_docs = {} 29 | 30 | # Update `documents` with enriched content from Tavily Extract 31 | for item in extracted_content["results"]: 32 | url = item['url'] 33 | if url in selected_docs: 34 | enriched_docs[url] = { 35 | **selected_docs[url], # Existing doc data 36 | "raw_content": item.get("raw_content", ""), 37 | "extracted_details": item.get("details", {}), 38 | } 39 | 40 | state['documents'] = enriched_docs # Update documents with enriched data 41 | 42 | except Exception as e: 43 | msg += f"Error occurred during Tavily Extract: {str(e)}\n" 44 | msg += f"Extracted URLs: {urls_to_extract}\n" # Log the urls_to_extract if error 45 | 46 | return {"messages": [AIMessage(content=msg)], "documents": state['documents']} 47 | 48 | async def run(self, state: ResearchState): 49 | result = await self.curate(state) 50 | return result -------------------------------------------------------------------------------- /backend/nodes/eval.py: -------------------------------------------------------------------------------- 1 | from langchain_core.messages import AIMessage 2 | from ..classes import ResearchState, TavilySearchInput, TavilyQuery, ReportEvaluation 3 | from langchain_anthropic import ChatAnthropic 4 | 5 | 6 | class EvaluationNode: 7 | def __init__(self): 8 | self.model = ChatAnthropic( 9 | model="claude-3-5-haiku-20241022", 10 | temperature=0 11 | ) 12 | 13 | # Evaluation function assigns an overall grade from 1 to 3. 14 | async def evaluate_report(self, state: ResearchState): 15 | """ 16 | Evaluates the generated report by assigning an overall grade from 1 to 3. 17 | If the grade is 1, includes critical gaps in the output. 18 | """ 19 | prompt = f""" 20 | You have created a report on '{state['company']}' based on the gathered information. 21 | Grade the report on a scale of 1 to 3 based on completeness, accuracy, and depth of information: 22 | - **3** indicates a thorough and well-supported report with no major gaps. 23 | - **2** indicates adequate coverage, but could be improved. 24 | - **1** indicates significant gaps or missing essential sections. 25 | 26 | If the grade is 1, specify any critical gaps that need addressing. 27 | 28 | Here is the report for evaluation: 29 | {state['report']} 30 | """ 31 | 32 | # Invoke the model for report evaluation 33 | 34 | messages = ["system","Your task is to evaluate a report on a scale of 1 to 3.", 35 | ("human",f"{prompt}")] 36 | evaluation = await self.model.with_structured_output(ReportEvaluation).ainvoke(messages) 37 | 38 | # Determine if additional questions are needed based on grade 39 | if evaluation.grade == 1: 40 | msg = f"❌ The report received a grade of 1. Critical gaps identified: {', '.join(evaluation.critical_gaps or ['None specified'])}" 41 | # Create new sub-questions for critical gaps 42 | new_sub_queries = [ 43 | TavilyQuery(query=f"Gather information on {gap} for {state['company']}", topic="general", days=30) 44 | for gap in evaluation.critical_gaps or [] 45 | ] 46 | if 'sub_questions' in state: 47 | state['sub_questions'].sub_queries.extend(new_sub_queries) 48 | else: 49 | state['sub_questions'] = TavilySearchInput(sub_queries=new_sub_queries) 50 | return {"messages": [AIMessage(content=msg)], "eval": evaluation, "sub_questions": state['sub_questions']} 51 | else: 52 | msg = f"✅ The report received a grade of {evaluation.grade}/3 and is marked as complete." 53 | return {"messages": [AIMessage(content=msg)], "eval": evaluation} 54 | 55 | async def run(self, state: ResearchState): 56 | result = await self.evaluate_report(state) 57 | return result 58 | -------------------------------------------------------------------------------- /backend/nodes/generate_report.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime 2 | from langchain_core.messages import AIMessage 3 | from langchain_anthropic import ChatAnthropic 4 | from ..classes import ResearchState 5 | 6 | 7 | 8 | class GenerateNode: 9 | def __init__(self): 10 | self.model = ChatAnthropic( 11 | model="claude-3-5-haiku-20241022", 12 | temperature=0 13 | ) 14 | def extract_markdown_content(self, content): 15 | # Strip out extra preamble or conversational text, retaining only Markdown. 16 | start_index_hash = content.find("#") 17 | start_index_bold = content.find("**") 18 | 19 | if start_index_hash != -1 and (start_index_bold == -1 or start_index_hash < start_index_bold): 20 | # '#' found and it comes before '**' (or '**' not found) 21 | return content[start_index_hash:].strip() 22 | elif start_index_bold != -1: 23 | # '**' found 24 | return content[start_index_bold:].strip() 25 | else: 26 | # Neither '#' nor '**' found, return the whole content stripped 27 | return content.strip() 28 | 29 | async def generate_report(self, state: ResearchState): 30 | report_title = f"Weekly Report on {state['company']}" 31 | report_date = datetime.now().strftime('%B %d, %Y') 32 | 33 | prompt = f""" 34 | You are an expert researcher tasked with writing a fact-based report on recent developments for the company **{state['company']}**. Write the report in Markdown format, but **do not include a title**. Each section must be written in well-structured paragraphs, not lists or bullet points. 35 | Ensure the report includes: 36 | - **Inline citations** as Markdown hyperlinks directly in the main sections (e.g., Company X is an innovative leader in AI ([LinkedIn](https://linkedin.com))). 37 | - A **Citations Section** at the end that lists all URLs used. 38 | 39 | ### Report Structure: 40 | 1. **Executive Summary**: 41 | - High-level overview of the company, its services, location, employee count, and achievements. 42 | - Make sure to include the general information necessary to understand the company well including any notable achievements. 43 | 44 | 2. **Leadership and Vision**: 45 | - Details on the CEO and key team members, their experience, and alignment with company goals. 46 | - Any personnel changes and their strategic impact. 47 | 48 | 3. **Product and Service Overview**: 49 | - Summary of current products/services, features, updates, and market fit. 50 | - Include details from the company's website, tools, or new integrations. 51 | 52 | 4. **Financial Performance**: 53 | - For public companies: key metrics (e.g., revenue, market cap). 54 | - For startups: funding rounds, investors, and milestones. 55 | 56 | 5. **Recent Developments**: 57 | - New product enhancements, partnerships, competitive moves, or market entries. 58 | 59 | 6. **Citations**: 60 | - Ensure every source cited in the report is listed in the text as Markdown hyperlinks. 61 | - Also include a list of all URLs as Markdown hyperlinks in this section. 62 | 63 | ### Documents to Base the Report On: 64 | {state['documents']} 65 | """ 66 | 67 | messages = [("system", "Your task is to generate a Markdown report."), ("human", prompt)] 68 | 69 | try: 70 | # Invoke the model 71 | response = await self.model.ainvoke(messages) 72 | 73 | # Extract the Markdown content 74 | markdown_content = self.extract_markdown_content(response.content) 75 | 76 | # Add the title and date to the response 77 | full_report = f"# {report_title}\n\n*{report_date}*\n\n{markdown_content}" 78 | return {"messages": [AIMessage(content=f"Report generated successfully!\n{full_report}")], "report": full_report} 79 | except Exception as e: 80 | error_message = f"Error generating report: {str(e)}" 81 | return { 82 | "messages": [AIMessage(content=error_message)], 83 | "report": f"# Error Generating Report\n\n*{report_date}*\n\n{error_message}" 84 | } 85 | 86 | 87 | async def run(self, state: ResearchState, websocket): 88 | if websocket: 89 | await websocket.send_text("⌛️ Generating report...") 90 | result = await self.generate_report(state) 91 | return result -------------------------------------------------------------------------------- /backend/nodes/initial_grounding.py: -------------------------------------------------------------------------------- 1 | from langchain_core.messages import AIMessage 2 | from tavily import AsyncTavilyClient 3 | import os 4 | 5 | from ..classes import ResearchState 6 | 7 | 8 | class InitialGroundingNode: 9 | def __init__(self) -> None: 10 | self.tavily_client = AsyncTavilyClient(api_key=os.getenv("TAVILY_API_KEY")) 11 | 12 | # Use Tavily Extract to get base content from provided company URL 13 | async def initial_search(self, state: ResearchState): 14 | msg = f"🔎 Initiating initial grounding for company '{state['company']}'...\n" 15 | 16 | urls = [] 17 | urls.append(state['company_url']) 18 | state['initial_documents'] = {} 19 | 20 | try: 21 | search_results = await self.tavily_client.extract(urls=urls) 22 | for item in search_results["results"]: 23 | url = item['url'] 24 | raw_content = item["raw_content"] 25 | state['initial_documents'][url] = {'url': url, 'raw_content': raw_content} 26 | # msg += f"Extracted raw content for URL: {url}\n" 27 | 28 | except Exception as e: 29 | print(f"Error occurred during Tavily Extract request:{e}") 30 | 31 | return {"messages": [AIMessage(content=msg)], "initial_documents": state['initial_documents']} 32 | 33 | async def run(self, state: ResearchState): 34 | result = await self.initial_search(state) 35 | return result 36 | 37 | -------------------------------------------------------------------------------- /backend/nodes/manual_cluster_select.py: -------------------------------------------------------------------------------- 1 | # In your node file 2 | from langchain_core.messages import AIMessage 3 | from langgraph.errors import NodeInterrupt 4 | from ..classes import ResearchState 5 | 6 | class ManualSelectionNode: 7 | async def manual_cluster_selection(self, state: ResearchState, websocket): 8 | clusters = state['document_clusters'] 9 | msg = "Multiple clusters were identified. Please review the options and select the correct cluster for the target company.\n\n" 10 | msg += "Enter '0' if none of these clusters match the target company.\n" 11 | 12 | if websocket: 13 | # Send cluster options to the frontend via WebSocket 14 | await websocket.send_text(msg) 15 | 16 | # Wait for user selection from WebSocket 17 | while True: 18 | try: 19 | selection_text = await websocket.receive_text() 20 | selected_cluster_index = int(selection_text) - 1 21 | 22 | if selected_cluster_index == -1: 23 | msg = "No suitable cluster found. Trying to cluster again.\n" 24 | await websocket.send_text(msg) 25 | return {"messages": [AIMessage(content=msg, is_manual_selection=True)], "chosen_cluster": selected_cluster_index} 26 | elif 0 <= selected_cluster_index < len(clusters): 27 | chosen_cluster = clusters[selected_cluster_index] 28 | msg = f"You selected cluster '{chosen_cluster.company_name}' as the correct cluster." 29 | await websocket.send_text(msg) 30 | return {"messages": [AIMessage(content=msg, is_manual_selection=True)], "chosen_cluster": selected_cluster_index} 31 | else: 32 | await websocket.send_text("Invalid choice. Please enter a number corresponding to the listed clusters or '0' to re-cluster.") 33 | except ValueError: 34 | await websocket.send_text("Invalid input. Please enter a valid number.") 35 | else: 36 | # Handle selection for studio, attempt to cluster again for now 37 | 38 | msg = "Manual selection needed, trying to cluster again.\n" 39 | return {"messages": [AIMessage(content=msg, is_manual_selection=True)], "chosen_cluster": -1} 40 | # selected_cluster_index = state.get('chosen_cluster', -1) # Default to -1 if not set 41 | # if selected_cluster_index == -1: 42 | # raise NodeInterrupt( 43 | # "Please input the chosen cluster index for manual selection in LangGraph Studio. " 44 | # "Set the chosen cluster index in the state attribute 'chosen_cluster'." 45 | # ) 46 | # msg = "No suitable cluster found. Trying to cluster again.\n" 47 | # return {"messages": [AIMessage(content=msg, is_manual_selection=True)], "chosen_cluster": selected_cluster_index} 48 | # elif 0 <= selected_cluster_index < len(clusters): 49 | # chosen_cluster = clusters[selected_cluster_index] 50 | # msg = f"You selected cluster '{chosen_cluster.company_name}' as the correct cluster." 51 | # return {"messages": [AIMessage(content=msg, is_manual_selection=True)], "chosen_cluster": selected_cluster_index} 52 | # else: 53 | # msg = "Invalid cluster selection in state. Please provide a valid cluster index." 54 | # return {"messages": [AIMessage(content=msg)], "chosen_cluster": None} 55 | async def run(self, state: ResearchState, websocket=None): 56 | return await self.manual_cluster_selection(state, websocket) 57 | -------------------------------------------------------------------------------- /backend/nodes/publish.py: -------------------------------------------------------------------------------- 1 | 2 | import os 3 | from datetime import datetime 4 | from langchain_core.messages import AIMessage 5 | from ..utils.utils import generate_pdf_from_md 6 | from ..classes import ResearchState 7 | 8 | class PublishNode: 9 | def __init__(self, output_dir="reports"): 10 | self.output_dir = output_dir 11 | if not os.path.exists(self.output_dir): 12 | os.makedirs(self.output_dir) 13 | 14 | async def markdown_to_pdf(self, markdown_content: str, output_path: str): 15 | try: 16 | # Generate the PDF from Markdown content 17 | generate_pdf_from_md(markdown_content, output_path) 18 | except Exception as e: 19 | raise Exception(f"Failed to generate PDF: {str(e)}") 20 | 21 | async def format_output(self, state: ResearchState): 22 | report = state["report"] 23 | output_format = state.get("output_format", "pdf") # Default to PDF 24 | 25 | # Set up the directory and file paths 26 | timestamp = datetime.now().strftime('%Y-%m-%d_%H-%M-%S') 27 | file_base = f"{self.output_dir}/{state['company']}_Weekly_Report_{timestamp}" 28 | 29 | if output_format == "pdf": 30 | pdf_file_path = f"{file_base}.pdf" 31 | await self.markdown_to_pdf(markdown_content=report, output_path=pdf_file_path) 32 | formatted_report = f"📥 PDF report saved at {pdf_file_path}" 33 | else: 34 | markdown_file_path = f"{file_base}.md" 35 | with open(markdown_file_path, "w") as md_file: 36 | md_file.write(report) 37 | formatted_report = f"📥 Markdown report saved at {markdown_file_path}" 38 | 39 | return {"messages": [AIMessage(content=formatted_report)]} 40 | 41 | async def run(self, state: ResearchState): 42 | result = await self.format_output(state) 43 | return result 44 | 45 | -------------------------------------------------------------------------------- /backend/nodes/research.py: -------------------------------------------------------------------------------- 1 | from langchain_core.messages import AIMessage 2 | from tavily import AsyncTavilyClient 3 | import os 4 | import asyncio 5 | from datetime import datetime 6 | from typing import List 7 | 8 | 9 | from ..classes import ResearchState, TavilyQuery 10 | 11 | class ResearcherNode(): 12 | def __init__(self): 13 | self.tavily_client = AsyncTavilyClient(api_key=os.getenv("TAVILY_API_KEY")) 14 | 15 | 16 | async def tavily_search(self, sub_queries: List[TavilyQuery]): 17 | """Perform searches for each sub-query using the Tavily search tool concurrently.""" 18 | # Define a coroutine function to perform a single search with error handling 19 | async def perform_search(itm): 20 | try: 21 | # Add date to the query as we need the most recent results 22 | query_with_date = f"{itm.query} {datetime.now().strftime('%m-%Y')}" 23 | # Attempt to perform the search, hardcoding days to 7 (days will be used only when topic is news) 24 | response = await self.tavily_client.search(query=query_with_date, topic="general", max_results=7) 25 | return response['results'] 26 | except Exception as e: 27 | # Handle any exceptions, log them, and return an empty list 28 | print(f"Error occurred during search for query '{itm.query}': {str(e)}") 29 | return [] 30 | 31 | # Run all the search tasks in parallel 32 | search_tasks = [perform_search(itm) for itm in sub_queries] 33 | search_responses = await asyncio.gather(*search_tasks) 34 | 35 | # Combine the results from all the responses 36 | search_results = [] 37 | for response in search_responses: 38 | search_results.extend(response) 39 | 40 | return search_results 41 | 42 | 43 | async def research(self, state: ResearchState): 44 | """ 45 | Conducts a Tavily Search and stores all documents in a unified 'documents' attribute. 46 | """ 47 | msg = "🚀 Conducting Tavily Search for the specified company...\n" 48 | state['documents'] = {} # Initialize documents if not already present 49 | 50 | research_node = ResearcherNode() 51 | # Perform the search and gather results 52 | response = await research_node.tavily_search(state['sub_questions'].sub_queries) 53 | 54 | # Process each set of search results and add to documents 55 | for doc in response: 56 | url = doc.get('url') 57 | if url and url not in state['documents']: # Avoid duplicates 58 | state['documents'][url] = doc 59 | 60 | return {"messages": [AIMessage(content=msg)], "documents": state['documents']} 61 | 62 | async def run(self, state: ResearchState): 63 | result = await self.research(state) 64 | return result -------------------------------------------------------------------------------- /backend/nodes/sub_questions.py: -------------------------------------------------------------------------------- 1 | from langchain_core.messages import AIMessage 2 | from langchain_anthropic import ChatAnthropic 3 | from ..classes import ResearchState, TavilySearchInput 4 | 5 | class SubQuestionsNode: 6 | def __init__(self) -> None: 7 | self.model = ChatAnthropic( 8 | model="claude-3-5-haiku-20241022", 9 | temperature=0 10 | ) 11 | 12 | # Function to generate sub-questions based on initial search data 13 | async def generate_sub_questions(self, state: ResearchState): 14 | try: 15 | msg = "🤔 Generating sub-questions based on the initial search results...\n" 16 | 17 | if 'sub_questions_data' not in state: 18 | state['sub_questions_data'] = [] 19 | 20 | # Prompt to generate detailed sub-questions 21 | prompt = f""" 22 | You are an expert researcher focusing on company analysis to generate a report. 23 | Your task is to generate 4 specific sub-questions that will provide a thorough understanding of the company: '{state['company']}'. 24 | 25 | ### Key Areas to Explore: 26 | - **Company Background**: Include history, mission, headquarters location, CEO, and number of employees. 27 | - **Products and Services**: Focus on main offerings, unique features, and target customer segments. 28 | - **Market Position**: Address competitive standing, market reach, and industry impact. 29 | - **Financials**: Seek recent funding, revenue milestones, financial performance, and growth indicators. 30 | 31 | Use the initial information provided from the company's website below to keep questions directly relevant to **{state['company']}**. 32 | 33 | Official URL: {state['company_url']} 34 | Initial Company Information: 35 | {state["initial_documents"]} 36 | 37 | Ensure questions are clear, specific, and well-aligned with the company's context. 38 | """ 39 | 40 | # Use LLM to generate sub-questions 41 | messages = ["system","Your task is to generate sub-questions based on the initial search results.", 42 | ("human",f"{prompt}")] 43 | 44 | sub_questions = await self.model.with_structured_output(TavilySearchInput).ainvoke(messages) 45 | 46 | except Exception as e: 47 | msg = f"An error occurred during sub-question generation: {str(e)}" 48 | return {"messages": [AIMessage(content=msg)], "sub_questions": None, "initial_documents": state['initial_documents']} 49 | 50 | 51 | return {"messages": [AIMessage(content=msg)], "sub_questions": sub_questions, "initial_documents": state['initial_documents']} 52 | 53 | async def run(self, state: ResearchState): 54 | result = await self.generate_sub_questions(state) 55 | return result -------------------------------------------------------------------------------- /backend/utils/routing_helper.py: -------------------------------------------------------------------------------- 1 | from typing import Literal 2 | from ..classes import ResearchState 3 | 4 | from langchain_core.messages import AIMessage 5 | 6 | 7 | def route_based_on_cluster(state: ResearchState) -> Literal["enrich_docs", "manual_cluster_selection"]: 8 | if state.get('chosen_cluster') is not None: 9 | return "enrich_docs" 10 | return "manual_cluster_selection" 11 | 12 | def route_after_manual_selection(state: ResearchState) -> Literal["enrich_docs", "cluster"]: 13 | if state.get('chosen_cluster') >= 0: 14 | return "enrich_docs" 15 | return "cluster" 16 | 17 | def should_continue_research(state: ResearchState) -> Literal["research", "generate_report"]: 18 | # Minimum threshold for documents 19 | min_doc_count = 2 # Adjust this as needed 20 | # Check document count 21 | if len(state["documents"]) < min_doc_count: 22 | return "research" 23 | return "generate_report" 24 | 25 | # Define the conditional edge function based on report grade 26 | def route_based_on_evaluation(state: ResearchState) -> Literal["research", "publish"]: 27 | evaluation = state.get("eval") 28 | 29 | # If the report has critical gaps, route to research for additional questions; otherwise, proceed to format 30 | return "research" if evaluation.grade == 1 else "publish" 31 | -------------------------------------------------------------------------------- /backend/utils/utils.py: -------------------------------------------------------------------------------- 1 | 2 | import re 3 | from fpdf import FPDF 4 | 5 | 6 | class CustomPDF(FPDF): 7 | def __init__(self): 8 | super().__init__() 9 | self.set_left_margin(25) 10 | self.set_right_margin(25) 11 | self.set_auto_page_break(auto=True, margin=25) 12 | 13 | def footer(self): 14 | self.set_y(-15) 15 | self.set_font("Arial", "I", 8) 16 | self.cell(0, 10, f"Page {self.page_no()}", 0, 0, "C") 17 | 18 | def sanitize_content(content): 19 | # Encode and decode to ensure consistent handling of special characters 20 | return content.encode('utf-8', 'ignore').decode('utf-8') 21 | 22 | def replace_problematic_characters(content): 23 | replacements = { 24 | '\u2013': '-', # en dash to hyphen 25 | '\u2014': '--', # em dash to double hyphen 26 | '\u2018': "'", # left single quote to apostrophe 27 | '\u2019': "'", # right single quote to apostrophe 28 | '\u201c': '"', # left double quote to double quote 29 | '\u201d': '"', # right double quote to double quote 30 | '\u2026': '...', # ellipsis 31 | '\u2022': '*', # bullet 32 | '\u2122': 'TM' # trademark symbol 33 | } 34 | for char, replacement in replacements.items(): 35 | content = content.replace(char, replacement) 36 | return content 37 | 38 | def generate_pdf_from_md(content, filename='output.pdf'): 39 | try: 40 | pdf = CustomPDF() 41 | pdf.add_page() 42 | pdf.set_font('Arial', '', 12) 43 | 44 | # Sanitize and replace problematic characters in content 45 | content = sanitize_content(content) 46 | content = replace_problematic_characters(content) 47 | 48 | lines = content.split('\n') 49 | for line in lines: 50 | if line.startswith('#'): 51 | header_level = min(line.count('#'), 4) 52 | header_text = re.sub(r'\*{2,}', '', line.strip('# ').strip()) 53 | pdf.set_font('Arial', 'B', 18 - header_level * 2) 54 | pdf.ln(8 if header_level == 1 else 5) 55 | pdf.multi_cell(0, 10, header_text) 56 | pdf.set_font('Arial', '', 12) 57 | else: 58 | process_markdown_line(pdf, line) 59 | pdf.ln(8) 60 | 61 | pdf.output(filename) 62 | return f"PDF generated: {filename}" 63 | 64 | except Exception as e: 65 | return f"Error generating PDF: {e}" 66 | 67 | def process_markdown_line(pdf, line): 68 | """Parses line for Markdown styling, including bold, italics, and links.""" 69 | parts = re.split(r'(\*\*.*?\*\*|\*.*?\*|\[.*?\]\(.*?\)|https?://\S+)', line) 70 | for part in parts: 71 | if re.match(r'\*\*.*?\*\*', part): # Bold 72 | text = part.strip('*') 73 | pdf.set_font('Arial', 'B', 12) 74 | pdf.write(10, text) 75 | elif re.match(r'\*.*?\*', part): # Italics 76 | text = part.strip('*') 77 | pdf.set_font('Arial', 'I', 12) 78 | pdf.write(10, text) 79 | elif re.match(r'\[.*?\]\(.*?\)', part): # Markdown link 80 | display_text = re.search(r'\[(.*?)\]', part).group(1) 81 | url = re.search(r'\((.*?)\)', part).group(1) 82 | pdf.set_text_color(0, 0, 255) 83 | pdf.set_font('Arial', 'U', 12) 84 | pdf.write(10, display_text, url) 85 | pdf.set_text_color(0, 0, 0) # Reset color 86 | pdf.set_font('Arial', '', 12) 87 | elif re.match(r'https?://\S+', part): # Plain URL 88 | url = part 89 | pdf.set_text_color(0, 0, 255) 90 | pdf.set_font('Arial', 'U', 12) 91 | pdf.write(10, url, url) 92 | pdf.set_text_color(0, 0, 0) 93 | pdf.set_font('Arial', '', 12) 94 | else: 95 | pdf.set_font('Arial', '', 12) 96 | pdf.write(10, part) 97 | -------------------------------------------------------------------------------- /frontend/static/script.js: -------------------------------------------------------------------------------- 1 | let ws; 2 | let currentMarkdownContent = ''; 3 | 4 | function validateInputs() { 5 | const companyName = document.getElementById("companyName").value.trim(); 6 | const companyUrl = document.getElementById("companyUrl").value.trim(); 7 | 8 | if (!companyName) { 9 | alert("Please enter a company name"); 10 | return false; 11 | } 12 | 13 | if (!companyUrl) { 14 | alert("Please enter a company URL"); 15 | return false; 16 | } 17 | 18 | // Basic URL validation 19 | try { 20 | new URL(companyUrl); 21 | } catch (error) { 22 | alert("Please enter a valid URL (including http:// or https://)"); 23 | return false; 24 | } 25 | 26 | return true; 27 | } 28 | function startResearch() { 29 | if (!validateInputs()) { 30 | return; 31 | } 32 | 33 | const progressDiv = document.getElementById("progress"); 34 | const clusterSelectionDiv = document.getElementById("cluster-selection"); 35 | const reportDiv = document.getElementById("report"); 36 | const copyButton = document.getElementById("copyButton"); 37 | 38 | // Clear previous content 39 | progressDiv.innerHTML = ""; 40 | reportDiv.innerHTML = ""; 41 | clusterSelectionDiv.style.display = "none"; 42 | copyButton.style.display = "none"; 43 | currentMarkdownContent = ''; 44 | 45 | // Open WebSocket connection 46 | ws = new WebSocket("ws://127.0.0.1:5000/ws"); 47 | 48 | ws.onmessage = function(event) { 49 | const message = event.data; 50 | if (message.includes("Please review the options and select the correct cluster")) { 51 | clusterSelectionDiv.style.display = "block"; 52 | } 53 | // Handle final report differently 54 | if (message.startsWith("Report generated successfully!")) { 55 | currentMarkdownContent = message.replace("Report generated successfully!", "").trim(); 56 | // Render Markdown content 57 | reportDiv.innerHTML = marked.parse(currentMarkdownContent); 58 | // Show copy button 59 | copyButton.style.display = "block"; 60 | // Add progress message 61 | // const messageElement = document.createElement("div"); 62 | // messageElement.className = "progress-message"; 63 | // messageElement.textContent = "Report generated successfully!"; 64 | // progressDiv.appendChild(messageElement); 65 | } else { 66 | // Create message element for all other messages 67 | const messageElement = document.createElement("div"); 68 | messageElement.className = "progress-message"; 69 | messageElement.textContent = message; 70 | progressDiv.appendChild(messageElement); 71 | } 72 | // Ensure automatic scrolling to the latest message 73 | requestAnimationFrame(() => { 74 | progressDiv.scrollTop = progressDiv.scrollHeight; 75 | }); 76 | }; 77 | 78 | ws.onopen = function() { 79 | const companyName = document.getElementById("companyName").value; 80 | const companyUrl = document.getElementById("companyUrl").value; 81 | const outputFormat = document.getElementById("outputFormat").value; 82 | const payload = { companyName, companyUrl, outputFormat }; 83 | console.log("Sending WebSocket payload:", payload); 84 | ws.send(JSON.stringify(payload)); 85 | }; 86 | 87 | ws.onerror = function(error) { 88 | const messageElement = document.createElement("div"); 89 | messageElement.className = "progress-message"; 90 | messageElement.textContent = "Error: " + error.message; 91 | messageElement.style.borderLeftColor = "#FE363B"; // Red border for errors 92 | progressDiv.appendChild(messageElement); 93 | progressDiv.scrollTop = progressDiv.scrollHeight; 94 | }; 95 | } 96 | 97 | function submitClusterSelection() { 98 | const clusterSelection = document.getElementById("cluster-input").value; 99 | if (ws && clusterSelection) { 100 | ws.send(clusterSelection); 101 | document.getElementById("cluster-selection").style.display = "none"; 102 | document.getElementById("cluster-input").value = ""; 103 | } 104 | } 105 | 106 | async function copyReport() { 107 | if (currentMarkdownContent) { 108 | try { 109 | await navigator.clipboard.writeText(currentMarkdownContent); 110 | const copyButton = document.getElementById("copyButton"); 111 | const originalText = copyButton.textContent; 112 | copyButton.textContent = "Copied!"; 113 | setTimeout(() => { 114 | copyButton.textContent = originalText; 115 | }, 2000); 116 | } catch (err) { 117 | console.error('Failed to copy text: ', err); 118 | } 119 | } 120 | } -------------------------------------------------------------------------------- /frontend/static/styles.css: -------------------------------------------------------------------------------- 1 | * { box-sizing: border-box; margin: 0; padding: 0; font-family: Arial, sans-serif; } 2 | 3 | body { 4 | display: flex; 5 | flex-direction: column; 6 | align-items: center; 7 | min-height: 100vh; 8 | background-color: #F9F7F3; 9 | padding: 20px; 10 | overflow-x: hidden; 11 | } 12 | 13 | .main-title { 14 | color: #2C3E50; 15 | margin-bottom: 10px; 16 | font-size: 36px; 17 | font-weight: 600; 18 | letter-spacing: 0.5px; 19 | text-align: center; 20 | font-family: 'Segoe UI', Arial, sans-serif; 21 | position: relative; 22 | padding-bottom: 15px; 23 | } 24 | 25 | .main-title::after { 26 | content: ''; 27 | position: absolute; 28 | bottom: 0; 29 | left: 50%; 30 | transform: translateX(-50%); 31 | width: 60px; 32 | height: 3px; 33 | background-color: #468BFF; 34 | border-radius: 2px; 35 | } 36 | 37 | .container { 38 | display: flex; 39 | gap: 20px; 40 | width: 100%; 41 | max-width: 1400px; 42 | height: calc(100vh - 100px); 43 | } 44 | 45 | .input-section, .progress-section, .final-report-section { 46 | height: 100%; 47 | padding: 20px; 48 | background-color: #ffffff; 49 | border: 1px solid #ddd; 50 | border-radius: 8px; 51 | box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); 52 | display: flex; 53 | flex-direction: column; 54 | flex-shrink: 0; 55 | overflow-y: auto; 56 | } 57 | 58 | .section-header { 59 | border: 2px solid #e0e0e0; 60 | border-radius: 6px; 61 | padding: 12px 16px; 62 | margin: -5px -5px 15px -5px; 63 | background: linear-gradient(to bottom, #ffffff, #f8f9fa); 64 | display: flex; 65 | align-items: center; 66 | justify-content: space-between; 67 | } 68 | 69 | .section-header h2 { 70 | margin: 0; 71 | padding: 0; 72 | font-size: 20px; 73 | color: #2C3E50; 74 | font-weight: 600; 75 | } 76 | 77 | .input-section { flex: 1.5; flex-basis: 20%; } 78 | 79 | .progress-section, .final-report-section { flex: 2; flex-basis: 40%; } 80 | 81 | /* Progress section specific styles */ 82 | .progress-section { background-color: #f8f9fa; } 83 | 84 | #progress { 85 | flex: 1; 86 | overflow-y: auto; 87 | padding: 10px; 88 | margin-top: 10px; 89 | display: flex; 90 | flex-direction: column; 91 | gap: 8px; 92 | } 93 | 94 | /* Individual progress messages */ 95 | .progress-message { 96 | padding: 12px; 97 | background-color: white; 98 | border-left: 3px solid #468BFF; 99 | border-radius: 4px; 100 | word-wrap: break-word; 101 | white-space: pre-wrap; 102 | box-shadow: 0 1px 3px rgba(0, 0, 0, 0.05); 103 | } 104 | 105 | .copy-button { 106 | background-color: #f0f0f0; 107 | color: #666; 108 | border: 1px solid #ddd; 109 | padding: 2px 6px; /* Reduced padding */ 110 | border-radius: 3px; 111 | cursor: pointer; 112 | font-size: 11px; /* Reduced font size */ 113 | transition: all 0.2s ease; 114 | width: auto; /* Reduced min-width */ 115 | height: 20px; /* Reduced height */ 116 | line-height: 1; 117 | } 118 | 119 | .copy-button:hover { 120 | background-color: #e0e0e0; 121 | } 122 | 123 | .copy-button:active { 124 | background-color: #d0d0d0; 125 | } 126 | 127 | .form-group { margin-bottom: 20px; } 128 | 129 | .form-group label { 130 | display: block; 131 | margin-bottom: 8px; 132 | color: #555; 133 | font-weight: 500; 134 | } 135 | 136 | .form-group input, .form-group select { 137 | width: 100%; 138 | padding: 12px; 139 | border: 1px solid #ddd; 140 | border-radius: 6px; 141 | font-size: 14px; 142 | transition: all 0.2s ease; 143 | } 144 | 145 | .form-group input:focus, .form-group select:focus { 146 | outline: none; 147 | border-color: #468BFF; 148 | box-shadow: 0 0 0 3px rgba(70, 139, 255, 0.1); 149 | } 150 | 151 | button { 152 | background-color: #468BFF; 153 | color: white; 154 | border: none; 155 | padding: 12px 20px; 156 | cursor: pointer; 157 | border-radius: 6px; 158 | width: 100%; 159 | font-size: 16px; 160 | font-weight: 500; 161 | transition: background-color 0.2s ease; 162 | } 163 | 164 | button:hover { background-color: #357ABD; } 165 | 166 | #cluster-selection { 167 | display: none; 168 | margin-top: 20px; 169 | padding-top: 20px; 170 | border-top: 1px solid #ddd; 171 | } 172 | 173 | /* Scrollbar styling */ 174 | #progress::-webkit-scrollbar { width: 8px; } 175 | 176 | #progress::-webkit-scrollbar-track { 177 | background: #f1f1f1; 178 | border-radius: 4px; 179 | } 180 | 181 | #progress::-webkit-scrollbar-thumb { 182 | background: #ccc; 183 | border-radius: 4px; 184 | } 185 | 186 | #progress::-webkit-scrollbar-thumb:hover { background: #999; } 187 | 188 | 189 | .final-report-section { 190 | background-color: #ffffff; 191 | } 192 | 193 | #report { 194 | flex: 1; 195 | overflow-y: auto; 196 | padding: 15px 30px; 197 | color: #2C3E50; 198 | font-size: 15px; 199 | line-height: 1.8; 200 | } 201 | #report h1:first-child { 202 | margin-top: 0; 203 | } 204 | 205 | #report h1 { 206 | font-size: 28px; 207 | margin: 32px 0 20px 0; 208 | color: #1a2634; 209 | border-bottom: 2px solid #eee; 210 | padding-bottom: 10px; 211 | } 212 | 213 | #report h2 { 214 | font-size: 24px; 215 | margin: 28px 0 16px 0; 216 | color: #1a2634; 217 | } 218 | 219 | #report h3 { 220 | font-size: 20px; 221 | margin: 24px 0 14px 0; 222 | color: #1a2634; 223 | } 224 | 225 | #report p { 226 | margin: 0 0 20px 0; 227 | line-height: 1.8; 228 | } 229 | 230 | #report ul, 231 | #report ol { 232 | margin: 0 0 20px 0; 233 | padding-left: 24px; 234 | } 235 | 236 | #report li { 237 | margin-bottom: 12px; 238 | line-height: 1.6; 239 | } 240 | 241 | #report blockquote { 242 | margin: 20px 0; 243 | padding: 10px 20px; 244 | border-left: 4px solid #468BFF; 245 | background-color: #f8f9fa; 246 | } 247 | 248 | #report code { 249 | background-color: #f6f8fa; 250 | padding: 2px 6px; 251 | border-radius: 4px; 252 | font-family: monospace; 253 | } 254 | 255 | #report pre { 256 | background-color: #f6f8fa; 257 | padding: 16px; 258 | border-radius: 6px; 259 | overflow-x: auto; 260 | margin: 20px 0; 261 | } 262 | 263 | #report hr { 264 | margin: 30px 0; 265 | border: none; 266 | border-top: 1px solid #eee; 267 | } 268 | 269 | #report table { 270 | border-collapse: collapse; 271 | width: 100%; 272 | margin: 20px 0; 273 | } 274 | 275 | #report th, 276 | #report td { 277 | border: 1px solid #ddd; 278 | padding: 12px; 279 | text-align: left; 280 | } 281 | 282 | #report th { 283 | background-color: #f8f9fa; 284 | } 285 | 286 | #report tr:nth-child(even) { 287 | background-color: #f8f9fa; 288 | } -------------------------------------------------------------------------------- /frontend/templates/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Company Research Tool 7 | 8 | 9 | 10 | 11 |

Company Research Tool

12 | 13 |
14 |
15 |
16 |

User Input

17 |
18 |
19 | 20 | 21 |
22 | 23 |
24 | 25 | 26 |
27 | 28 |
29 | 30 | 34 |
35 | 36 | 37 | 38 |
39 |
40 | 41 | 42 |
43 | 44 |
45 |
46 |
47 |
48 |

Progress

49 |
50 |
51 |
52 |
53 |
54 |

Final Report

55 | 58 |
59 |
60 |
61 |
62 | 63 | 64 | 65 | -------------------------------------------------------------------------------- /langgraph.json: -------------------------------------------------------------------------------- 1 | { 2 | "dockerfile_lines": [], 3 | "graphs": { 4 | "agent": "./langgraph_entry.py:graph" 5 | }, 6 | "env": ".env", 7 | "python_version": "3.11", 8 | "dependencies": [ 9 | "." 10 | ] 11 | } -------------------------------------------------------------------------------- /langgraph_entry.py: -------------------------------------------------------------------------------- 1 | # langgraph_entry.py 2 | from backend.graph import Graph # Adjust if your Graph class is in a different path 3 | 4 | graph = Graph().compile() -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | langgraph==0.2.50 2 | tavily-python==0.5.0 3 | langchain==0.3.7 4 | langchain-anthropic==0.3.0 5 | pypandoc==1.14 6 | pandoc==2.4 7 | tldextract==5.1.3 8 | fpdf==1.7.2 9 | flask==3.1.0 10 | fastapi==0.115.5 11 | uvicorn==0.32.0 12 | websockets==14.1 13 | python-multipart==0.0.17 14 | python-dotenv==1.0.1 15 | markdown2==2.5.1 16 | --------------------------------------------------------------------------------