9 |
10 | ### DeepGit 2.0 🤯 — now **hardware‑aware** & **ColBERT‑powered**
11 |
12 | ## DeepGit
13 |
14 | **DeepGit** is an advanced, Langgraph-based agentic workflow designed to perform deep research across GitHub repositories. It intelligently searches, analyzes, and ranks repositories based on user intent—even uncovering less-known but highly relevant tools. DeepGit infuses hybrid dense retrieval with advanced cross-encoder re-ranking and comprehensive activity analysis into a unified, open-source platform for intelligent repository discovery
15 |
16 | ---
17 | ### Try out the Lite version here 🧑🎓
18 |
19 | DeepGit-lite is a lightweight version of DeepGit running on zero GPU on Hugging Face Space [here.](https://huggingface.co/spaces/zamal/DeepGit-lite)
20 | It may not perform as well as the full version, but it's great for a quick first-hand preview.
21 |
22 | ---
23 |
24 |
25 | The latest release makes it even **deeper, smarter, and faster**:
26 |
27 | | New feature | What it gives you |
28 | |-------------|------------------|
29 | | **⚛️ Multi‑dimensional ColBERT v2 embeddings** | Fine‑grained token‑level similarity for nuanced matches that single‑vector embeddings miss. |
30 | | **🔩 Smart Hardware Filter** | Tell DeepGit your device specs — CPU-only, low RAM, or mobile. It filters out repos that won’t run smoothly, so you only see ones that fit your setup. |
31 |
32 | DeepGit still unifies hybrid dense retrieval, cross‑encoder re‑ranking, activity & quality analysis—but now every step is both *smarter* and *leaner*.
33 |
34 | ---
35 |
36 | ## ⚙️ How It Works — Agentic Workflow *v2*
37 |
38 | When the user submits a query, the **DeepGit Orchestrator Agent** triggers a relay of expert tools:
39 |
40 | 1. **Query Expansion**
41 | An LLM turns your natural‑language question into high‑signal GitHub tags for precise searching.
42 |
43 | 2. **Hardware Spec Detector**
44 | The same pass infers your wording for hints like “GPU‑poor”, “low‑memory”, or “mobile‑only” and records the constraint.
45 |
46 | 3. **ColBERT‑v2 Semantic Retriever**
47 | Every README & doc block is embedded with multi‑dimensional token vectors; MaxSim scoring surfaces nuanced matches.
48 |
49 | 4. **Cross‑Encoder Re‑ranker**
50 | A lightweight BERT (`MiniLM‑L‑6‑v2`) re‑orders the top K results for passage‑level accuracy.
51 |
52 | 5. **Hardware‑aware Dependency Filter**
53 | The reasoning engine inspects each repo’s `requirements.txt` / `pyproject.toml` and discards any that can’t run on your declared hardware.
54 |
55 | 6. **Community & Code Insight**
56 | Collects stars, forks, issue cadence, commit history, plus quick code‑quality metrics.
57 |
58 | 7. **Multi‑factor Ranking & Delivery**
59 | Merges all scores into one ranking and serves a clean table with links, similarity %, and “Runs on cpu‑only” badges where relevant.
60 |
61 | ---
62 |
63 |
64 | ## 🚀 Goals
65 |
66 | - **Uncover Hidden Gems:**
67 | Surface powerful but under-the-radar open-source tools. Now comes with hardware spec filter too.
68 |
69 | - **Empower Research:**
70 | Build an intelligent discovery layer over GitHub tailored for research-focused developers.
71 |
72 | - **Promote Open Innovation:**
73 | Open-source the entire workflow to foster transparency and collaboration in research.
74 |
75 | ---
76 |
77 | ## 🖥️ User Interface
78 |
79 | DeepGit provides an intuitive interface for exploring repository recommendations. The main page where users enter raw natural language query. This is the primary interaction point for initiating deep semantic searches.
80 |
81 |
82 |
83 |
84 |
85 | *Output:* Showcases the tabulated results with clickable links and different threshold scores, making it easy to compare and understand the ranking criteria.
86 |
87 |
88 |
89 |
90 |
91 |
92 | ---
93 |
94 |
95 | ### 🔧 Recommended Environment
96 |
97 | - **Python:** 3.11+ (The repo has been tested on Python 3.11.x)
98 | - **pip:** 24.0+ (Ensure you have an up-to-date pip version)
99 |
100 | ---
101 |
102 | ### 👨🏭 Setup Instructions
103 |
104 | #### 1. Clone the Repository
105 | ```bash
106 | git clone https://github.com/zamalali/DeepGit.git
107 | cd DeepGit
108 | ```
109 |
110 | #### 2. Create a Virtual Environment (Recommended)
111 | ```bash
112 | python3 -m venv venv
113 | source venv/bin/activate # On Windows: venv\Scripts\activate
114 | ```
115 |
116 | #### 3. Upgrade pip (Optional but Recommended)
117 | ```bash
118 | pip install --upgrade pip
119 | ```
120 |
121 | #### 4. Install Dependencies
122 | ```bash
123 | pip install -r requirements.txt
124 | ```
125 |
126 | #### 5. 🚀 Running DeepGit via App
127 |
128 | To run DeepGit locally, simply execute:
129 |
130 | ```bash
131 | python app.py
132 | ```
133 |
134 |
135 | ### 🛠️ Troubleshooting
136 |
137 | - **Python Version:** Use Python 3.11 or higher as the repo has been tested on Python 3.11.x.
138 | - **pip Version:** Make sure you’re running pip 24.0 or later.
139 | - **Dependency Issues:** If you encounter any, try reinstalling in a new virtual environment.
140 |
141 |
142 | ---
143 |
144 | ### 🛠️ Running DeepGit
145 |
146 | For a detailed documentation on using DeepGit, Check out [here](docs).
147 |
148 | DeepGit leverages Langgraph for orchestration. To launch the Langsmith dashboard and start the workflow, simply run:
149 |
150 | ```bash
151 | langgraph dev
152 | ```
153 | This command opens the Langsmith dashboard where you can enter your raw queries in a JSON snippet and monitor the entire agentic workflow.
154 |
155 |
156 | ### DeepGit on Docker
157 | For instructions on using Docker with DeepGit, please refer to our [Docker Documentation](docs/docker.md).
158 |
--------------------------------------------------------------------------------
/agent.py:
--------------------------------------------------------------------------------
1 | import os
2 | import logging
3 | import getpass
4 | from pathlib import Path
5 | from dotenv import load_dotenv
6 | from langgraph.graph import START, END, StateGraph
7 | from pydantic import BaseModel, Field
8 | from dataclasses import dataclass, field
9 | from typing import List, Any
10 |
11 | # ---------------------------
12 | # Import node functions
13 | # ---------------------------
14 | from tools.convert_query import convert_searchable_query
15 | from tools.parse_hardware import parse_hardware_spec
16 | from tools.github import ingest_github_repos
17 | from tools.dense_retrieval import hybrid_dense_retrieval
18 | from tools.cross_encoder_reranking import cross_encoder_reranking
19 | from tools.filtering import threshold_filtering
20 | from tools.dependency_analysis import dependency_analysis
21 | from tools.activity_analysis import repository_activity_analysis
22 | from tools.decision_maker import decision_maker
23 | from tools.code_quality import code_quality_analysis
24 | from tools.merge_analysis import merge_analysis
25 | from tools.ranking import multi_factor_ranking
26 | from tools.output_presentation import output_presentation
27 |
28 | # ---------------------------
29 | # Logging & Environment Setup
30 | # ---------------------------
31 | logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
32 | logger = logging.getLogger(__name__)
33 |
34 | dotenv_path = Path(__file__).resolve().parent.parent / ".env"
35 | if dotenv_path.exists():
36 | load_dotenv(dotenv_path)
37 |
38 | if "GITHUB_API_KEY" not in os.environ:
39 | os.environ["GITHUB_API_KEY"] = getpass.getpass("Enter your GitHub API key: ")
40 |
41 | # ---------------------------
42 | # State & Configuration
43 | # ---------------------------
44 | @dataclass(kw_only=True)
45 | class AgentState:
46 | user_query: str = field(default="")
47 | searchable_query: str = field(default="")
48 | hardware_spec: str = field(default="") # extracted hardware hint
49 | repositories: List[Any] = field(default_factory=list)
50 | semantic_ranked: List[Any] = field(default_factory=list)
51 | reranked_candidates: List[Any] = field(default_factory=list)
52 | filtered_candidates: List[Any] = field(default_factory=list)
53 | hardware_filtered: List[Any] = field(default_factory=list)
54 | activity_candidates: List[Any] = field(default_factory=list)
55 | quality_candidates: List[Any] = field(default_factory=list)
56 | final_ranked: List[Any] = field(default_factory=list)
57 |
58 | @dataclass(kw_only=True)
59 | class AgentStateInput:
60 | user_query: str = field(default="")
61 |
62 | @dataclass(kw_only=True)
63 | class AgentStateOutput:
64 | final_results: str = field(default="")
65 |
66 | class AgentConfiguration(BaseModel):
67 | max_results: int = Field(100, title="Max Results", description="Max GitHub results")
68 | per_page: int = Field(25, title="Per Page", description="GitHub results per page")
69 | dense_retrieval_k: int = Field(100, title="Dense K", description="Top‑K for dense retrieval")
70 | cross_encoder_top_n: int = Field(50, title="Cross‑encoder N", description="Top‑N after re‑rank")
71 | min_stars: int = Field(50, title="Min Stars", description="Minimum star count")
72 | cross_encoder_threshold: float = Field(5.5, title="CE Threshold", description="Cross‑encoder score cutoff")
73 | sem_model_name: str = Field("all-mpnet-base-v2", title="SentenceTransformer model")
74 | cross_encoder_model_name: str = Field("cross-encoder/ms-marco-MiniLM-L-6-v2", title="Cross‑encoder model")
75 |
76 | @classmethod
77 | def from_runnable_config(cls, config: Any = None) -> "AgentConfiguration":
78 | cfg = (config or {}).get("configurable", {})
79 | raw = {k: os.environ.get(k.upper(), cfg.get(k)) for k in cls.__fields__.keys()}
80 | values = {k: v for k, v in raw.items() if v is not None}
81 | return cls(**values)
82 |
83 | # -------------------------------------------------------
84 | # Build & Compile the Workflow Graph
85 | # -------------------------------------------------------
86 | builder = StateGraph(
87 | AgentState,
88 | input=AgentStateInput,
89 | output=AgentStateOutput,
90 | config_schema=AgentConfiguration
91 | )
92 |
93 | # Core nodes
94 | builder.add_node("convert_searchable_query", convert_searchable_query)
95 | builder.add_node("parse_hardware", parse_hardware_spec)
96 | builder.add_node("ingest_github_repos", ingest_github_repos)
97 | builder.add_node("neural_dense_retrieval", hybrid_dense_retrieval)
98 | builder.add_node("cross_encoder_reranking",cross_encoder_reranking)
99 | builder.add_node("threshold_filtering", threshold_filtering)
100 | builder.add_node("dependency_analysis", dependency_analysis)
101 | builder.add_node("repository_activity_analysis", repository_activity_analysis)
102 | builder.add_node("decision_maker", decision_maker)
103 | builder.add_node("code_quality_analysis", code_quality_analysis)
104 | builder.add_node("merge_analysis", merge_analysis)
105 | builder.add_node("multi_factor_ranking", multi_factor_ranking)
106 | builder.add_node("output_presentation", output_presentation)
107 |
108 | # Edges (dataflow)
109 | builder.add_edge(START, "convert_searchable_query")
110 | builder.add_edge("convert_searchable_query","parse_hardware")
111 | builder.add_edge("parse_hardware", "ingest_github_repos")
112 | builder.add_edge("ingest_github_repos", "neural_dense_retrieval")
113 | builder.add_edge("neural_dense_retrieval", "cross_encoder_reranking")
114 | builder.add_edge("cross_encoder_reranking", "threshold_filtering")
115 |
116 | # **Parallel branches** after filtering:
117 | builder.add_edge("threshold_filtering", "dependency_analysis")
118 | builder.add_edge("threshold_filtering", "repository_activity_analysis")
119 | builder.add_edge("threshold_filtering", "decision_maker")
120 |
121 | # Merge the outputs of the three parallel paths:
122 | builder.add_edge("dependency_analysis", "code_quality_analysis")
123 | builder.add_edge("decision_maker", "code_quality_analysis")
124 | builder.add_edge("repository_activity_analysis", "merge_analysis")
125 | builder.add_edge("code_quality_analysis", "merge_analysis")
126 |
127 | builder.add_edge("merge_analysis", "multi_factor_ranking")
128 | builder.add_edge("multi_factor_ranking", "output_presentation")
129 | builder.add_edge("output_presentation", END)
130 |
131 | graph = builder.compile()
132 |
133 | # -------------------------------------------------------
134 | # CLI entrypoint
135 | # -------------------------------------------------------
136 | if __name__ == "__main__":
137 | initial = AgentStateInput(
138 | user_query=(
139 | "I am looking for chain-of-thought prompting for reasoning models "
140 | "and I am GPU-poor, so I need something lightweight."
141 | )
142 | )
143 | result = graph.invoke(initial)
144 | print(result["final_results"])
145 |
--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
1 | import gradio as gr
2 | import os
3 | import json
4 | import time
5 | import threading
6 | import logging
7 | from agent import graph # Your DeepGit langgraph workflow
8 |
9 | # ---------------------------
10 | # Set environment variables to prevent thread/multiprocessing issues on macOS/Linux
11 | # os.environ["TOKENIZERS_PARALLELISM"] = "false"
12 | # os.environ["OMP_NUM_THREADS"] = "1"
13 | # os.environ["MKL_NUM_THREADS"] = "1"
14 | # ---------------------------
15 |
16 |
17 |
18 | # ---------------------------
19 | # Global Logging Buffer Setup
20 | # ---------------------------
21 | LOG_BUFFER = []
22 | LOG_BUFFER_LOCK = threading.Lock()
23 |
24 | class BufferLogHandler(logging.Handler):
25 | def emit(self, record):
26 | log_entry = self.format(record)
27 | with LOG_BUFFER_LOCK:
28 | LOG_BUFFER.append(log_entry)
29 |
30 | # Attach the custom logging handler if not already attached.
31 | root_logger = logging.getLogger()
32 | if not any(isinstance(h, BufferLogHandler) for h in root_logger.handlers):
33 | handler = BufferLogHandler()
34 | formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
35 | handler.setFormatter(formatter)
36 | root_logger.addHandler(handler)
37 |
38 | # ---------------------------
39 | # Helper to Filter Log Messages
40 | # ---------------------------
41 | def filter_logs(logs):
42 | """
43 | Processes a list of log messages so that any log containing
44 | "HTTP Request:" is replaced with a generic message, and adjacent
45 | HTTP logs are deduplicated.
46 | """
47 | filtered = []
48 | last_was_fetching = False
49 | for log in logs:
50 | if "HTTP Request:" in log:
51 | if not last_was_fetching:
52 | filtered.append("Fetching repositories...")
53 | last_was_fetching = True
54 | else:
55 | filtered.append(log)
56 | last_was_fetching = False
57 | return filtered
58 |
59 | # ---------------------------
60 | # Title, Favicon & Description
61 | # ---------------------------
62 | #custom_theme = gr.Theme.load("gstaff/sketch")
63 |
64 | favicon_html = """
65 |
66 |
67 | DeepGit Research Agent
68 |
69 | """
70 |
71 | title = """
72 |
73 |
74 |
75 | DeepGit
76 |
77 |
78 | ⚙️ Built for open-source, by an open-sourcer — DeepGit finds gold in the GitHub haystack.
79 |
80 |
81 | """
82 |
83 | description = """
84 | DeepGit is a multi‑stage research agent that digs through GitHub so you don’t have to.
85 | Just describe what you’re hunting for — and, if you like, add a hint about your hardware (“GPU‑poor”, “mobile‑only”, etc.).
86 | Behind the scenes, DeepGit now orchestrates an upgraded tool‑chain:
87 | • Query Expansion → ColBERT‑v2 token‑level Semantic Retrieval → Cross‑Encoder Re‑ranking
88 | • Hardware‑aware Dependency Filter that discards repos your device can’t run
89 | • Codebase & Community Insight modules for quality and activity signals
90 | Feed it a topic below; the agent will analyze, rank, and explain the most relevant, runnable repositories.
91 | A short wait earns you a gold‑curated list.
92 |
"""
93 |
94 |
95 | consent_text = """
96 |
97 |
98 | By using DeepGit, you consent to the collection and temporary processing of your query for semantic search and ranking purposes.
99 | No data is stored permanently, and your input is only used to power the DeepGit agent workflow.
100 |
101 |
102 | ⭐ Star us on GitHub if you find this tool useful!
103 | GitHub
104 |
4 |
5 | DeepGit leverages a state graph to process a user's query and deliver a ranked list of repositories. Each node in the graph handles a specific function. Below is an overview of the entire architecture:
6 |
7 | 1. **Query Conversion**
8 | - **Function:** Converts the raw user query into colon-separated search tags using an LLM.
9 | - **Module:** `tools/convert_query.py`
10 |
11 | 2. **Repository Ingestion**
12 | - **Function:** Uses the GitHub API (with asynchronous calls) to fetch repository metadata and documentation.
13 | - **Module:** `tools/github.py`
14 | - **Details:**
15 | - Fetches README and additional markdown files.
16 | - Combines the content into a single `combined_doc` for each repository.
17 |
18 | 3. **Neural Dense Retrieval**
19 | - **Function:** Encodes repository documentation using a Sentence Transformer and computes semantic similarity with the query using FAISS.
20 | - **Module:** `tools/dense_retrieval.py`
21 | - **Details:**
22 | - Normalizes embeddings.
23 | - Returns a ranked list of candidates based on semantic similarity.
24 |
25 | 4. **Cross-Encoder Re-Ranking**
26 | - **Function:** Reranks candidates by comparing the user query with the complete markdown documentation of each repository.
27 | - **Module:** `tools/cross_encoder_reranking.py`
28 | - **Details:**
29 | - For long documentation, splits text into chunks.
30 | - Aggregates chunk scores (using the maximum value) to produce a final score.
31 |
32 | 5. **Threshold Filtering**
33 | - **Function:** Filters out repositories that do not meet certain thresholds (e.g., minimum stars, cross encoder score).
34 | - **Module:** `tools/filtering.py`
35 |
36 | 6. **Decision Maker**
37 | - **Function:** Determines if code quality analysis should be run based on the query and repository count.
38 | - **Module:** `tools/decision_maker.py`
39 |
40 | 7. **Repository Activity Analysis**
41 | - **Function:** Computes an activity score based on factors like pull requests, commits, and open issues.
42 | - **Module:** `tools/activity_analysis.py`
43 |
44 | 8. **Code Quality Analysis**
45 | - **Function:** (Conditional) Clones repositories and uses static analysis (flake8) to score code quality.
46 | - **Module:** `tools/code_quality.py`
47 |
48 | 9. **Merge Analysis**
49 | - **Function:** Merges results from activity and code quality analyses.
50 | - **Module:** `tools/merge_analysis.py`
51 |
52 | 10. **Multi-Factor Ranking**
53 | - **Function:** Normalizes and weights multiple metrics (semantic similarity, cross encoder score, activity score, code quality score, stars) to compute a final ranking.
54 | - **Module:** `tools/ranking.py`
55 |
56 | 11. **Output Presentation**
57 | - **Function:** Formats the final ranked repositories into a human-readable output.
58 | - **Module:** `tools/output_presentation.py`
59 |
60 | These nodes are connected in the state graph defined in `agent.py`, ensuring smooth data flow from the initial query to the final presentation of results.
61 |
62 | ---
63 |
64 |
--------------------------------------------------------------------------------
/docs/docker.md:
--------------------------------------------------------------------------------
1 |
2 | DeepGit on Docker
3 |
4 |
5 | Build the Docker Image
6 | From the root directory of your project, run:
7 |
8 | ```bash
9 | docker build -t deepgit-app .
10 | ```
11 |
12 | This command builds a Docker image tagged as deepgit-app using Python 3.10-slim as the base image and installs all the necessary dependencies.
13 |
14 | ### Run the Docker Container
15 | Once the image is built, start a container with:
16 |
17 | ```bash
18 | docker run -p 7860:7860 deepgit-app
19 | ```
20 | This command maps port 7860 from the container to your local machine, allowing you to access the Gradio interface of DeepGit in your web browser.
21 |
22 |
--------------------------------------------------------------------------------
/docs/modules.md:
--------------------------------------------------------------------------------
1 |
2 | DeepGit Modules
3 |
4 |
5 | This document provides detailed descriptions of each module in the DeepGit workflow.
6 |
7 | ### 1. Query Conversion (`tools/convert_query.py`)
8 | - **Purpose:** Convert the raw user query into colon-separated search tags.
9 | - **Mechanism:** Uses an LLM prompt to generate precise tags.
10 | - **Outcome:** Updates `state.searchable_query`.
11 |
12 | ### 2. Repository Ingestion (`tools/github.py`)
13 | - **Purpose:** Retrieve GitHub repositories based on search tags.
14 | - **Mechanism:**
15 | - Uses asynchronous HTTP calls (via `httpx.AsyncClient`) to query GitHub.
16 | - Fetches README files and additional markdown documentation.
17 | - Combines all documentation into `combined_doc`.
18 | - **Outcome:** Populates `state.repositories` with repository metadata and documentation.
19 |
20 | ### 3. Neural Dense Retrieval (`tools/dense_retrieval.py`)
21 | - **Purpose:** Compute semantic similarity between the user query and repository documentation.
22 | - **Mechanism:**
23 | - Encodes text using a SentenceTransformer.
24 | - Normalizes embeddings and uses FAISS to search for nearest neighbors.
25 | - **Outcome:** Produces a sorted list (`state.semantic_ranked`) with similarity scores.
26 |
27 | ### 4. Cross-Encoder Re-Ranking (`tools/cross_encoder_reranking.py`)
28 | - **Purpose:** Refine ranking by comparing the complete markdown documentation against the query.
29 | - **Mechanism:**
30 | - For short documentation, scores the full text directly.
31 | - For long documentation, splits it into chunks (with configurable chunk size and max length) and scores each chunk.
32 | - Uses the maximum score as the repository's final cross-encoder score.
33 | - **Outcome:** Updates `state.reranked_candidates` with enhanced relevance scores.
34 |
35 | ### 5. Threshold Filtering (`tools/filtering.py`)
36 | - **Purpose:** Filter out repositories that don't meet quality thresholds.
37 | - **Mechanism:**
38 | - Evaluates candidates based on star count and cross-encoder score.
39 | - Discards repositories failing to meet the thresholds.
40 | - **Outcome:** Sets `state.filtered_candidates`.
41 |
42 | ### 6. Decision Maker (`tools/decision_maker.py`)
43 | - **Purpose:** Decide if code quality analysis is needed.
44 | - **Mechanism:**
45 | - Uses an LLM prompt that evaluates the user's query and repository count.
46 | - Outputs a decision (1 to run analysis, 0 to skip).
47 | - **Outcome:** Sets `state.run_code_analysis`.
48 |
49 | ### 7. Repository Activity Analysis (`tools/activity_analysis.py`)
50 | - **Purpose:** Assess the repository's activity level.
51 | - **Mechanism:**
52 | - Fetches pull requests, commit dates, and open issues.
53 | - Computes an `activity_score` based on these metrics.
54 | - **Outcome:** Populates `state.activity_candidates`.
55 |
56 | ### 8. Code Quality Analysis (`tools/code_quality.py`)
57 | - **Purpose:** Evaluate code quality if required.
58 | - **Mechanism:**
59 | - Clones repositories locally.
60 | - Runs flake8 to count style errors.
61 | - Computes a score based on issues per file.
62 | - **Outcome:** Populates `state.quality_candidates`.
63 |
64 | ### 9. Merge Analysis (`tools/merge_analysis.py`)
65 | - **Purpose:** Combine results from the activity and code quality analyses.
66 | - **Mechanism:**
67 | - Merges candidates based on repository `full_name`.
68 | - **Outcome:** Updates `state.filtered_candidates` with merged information.
69 |
70 | ### 10. Multi-Factor Ranking (`tools/ranking.py`)
71 | - **Purpose:** Compute a final ranking score by combining various metrics.
72 | - **Mechanism:**
73 | - Normalizes scores (semantic, cross-encoder, activity, code quality, stars).
74 | - Applies predetermined weights and sums them to produce a final score.
75 | - **Outcome:** Produces a sorted `state.final_ranked` list.
76 |
77 | ### 11. Output Presentation (`tools/output_presentation.py`)
78 | - **Purpose:** Format and display the final ranked repositories.
79 | - **Mechanism:**
80 | - Constructs a string output with details of the top-ranked repositories.
81 | - **Outcome:** Returns the final results in `state.final_results`.
82 |
83 | ---
84 |
85 |
--------------------------------------------------------------------------------
/docs/testing.md:
--------------------------------------------------------------------------------
1 |
2 | Testing DeepGit
3 |
4 |
5 | DeepGit includes a comprehensive suite of tests to ensure the reliability of each module.
6 |
7 | ## Test Suite Overview
8 |
9 | Tests are organized under the `tests/` directory, covering the following modules:
10 |
11 | - **Query Conversion:** `tests/test_convert_query.py`
12 | - **Repository Ingestion:** `tests/test_github.py`
13 | - **Neural Dense Retrieval:** `tests/test_dense_retrieval.py`
14 | - **Cross-Encoder Re-Ranking:** `tests/test_cross_encoder_reranking.py`
15 | - **Threshold Filtering:** `tests/test_filtering.py`
16 | - **Repository Activity Analysis:** `tests/test_activity_analysis.py`
17 | - **Decision Maker:** `tests/test_decision_maker.py`
18 | - **Code Quality Analysis:** `tests/test_code_quality.py`
19 | - **Merge Analysis:** `tests/test_merge_analysis.py`
20 | - **Multi-Factor Ranking:** `tests/test_ranking.py`
21 | - **Output Presentation:** `tests/test_output_presentation.py`
22 |
23 | ## Running the Tests
24 |
25 | To run the entire test suite, use one of the following commands from the project root:
26 |
27 | ```bash
28 | pytest
29 | ```
30 |
31 | Or, if you have a test runner script (e.g., run_tests.py):
32 |
33 | ```bash
34 | python run_tests.py
35 | ```
36 |
37 | ## Test Environment
38 | **Mocking:**
39 | External dependencies such as HTTP requests (to GitHub) and model predictions are mocked using monkeypatch and dummy functions to ensure tests are deterministic.
40 |
41 | **Coverage:**
42 | Each test file simulates various scenarios (e.g., valid data, error handling) to cover the full functionality of each module.
43 |
44 | **Troubleshooting**
45 | Ensure your project structure includes all necessary __init__.py files.
46 |
47 | Check that any required dummy data or configuration is correctly set in the test files.
48 |
49 | Use verbose mode (pytest -v) for more detailed output if tests fail.
50 |
51 |
--------------------------------------------------------------------------------
/langgraph.json:
--------------------------------------------------------------------------------
1 | {
2 | "dockerfile_lines": [],
3 | "graphs": {
4 | "researcher": "./agent.py:graph"
5 | },
6 | "python_version": "3.11",
7 | "env": "./.env",
8 | "dependencies": [
9 | "."
10 | ]
11 | }
12 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | toml>=0.10.2
2 | requests==2.32.3
3 | numpy==1.25.2
4 | python-dotenv==1.0.1
5 | sentence-transformers==3.4.1
6 | faiss-cpu==1.9.0.post1
7 | pydantic==2.10.6
8 | httpx==0.27.2
9 | gradio==5.23.1
10 | langgraph==0.2.62
11 | langchain_groq==0.2.4
12 | langchain_core==0.3.47
13 | rank_bm25==0.2.2
14 | langgraph-cli==0.1.79
--------------------------------------------------------------------------------
/run_tests.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import os
3 | import pytest
4 |
5 | if __name__ == "__main__":
6 | # Set the project root (assuming agent.py, tools, and tests are in the "DeepGit" folder)
7 | project_root = os.path.abspath(os.path.join(os.path.dirname(__file__), ""))
8 | if project_root not in sys.path:
9 | sys.path.insert(0, project_root)
10 |
11 | print("Project root added to sys.path:", project_root)
12 |
13 | # Define the tests directory (inside the DeepGit folder)
14 | tests_dir = os.path.join(project_root, "tests")
15 | print("Running tests from:", tests_dir)
16 |
17 | # Run pytest on the tests folder
18 | result = pytest.main([tests_dir])
19 |
20 | if result == 0:
21 | print("All tests passed!")
22 | else:
23 | print("Some tests failed.")
24 | sys.exit(result)
25 |
--------------------------------------------------------------------------------
/tests/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zamalali/DeepGit/14ea95cb4b24f03e146b5061f59c8cba49eeb885/tests/__init__.py
--------------------------------------------------------------------------------
/tests/test_activity_analysis.py:
--------------------------------------------------------------------------------
1 | import pytest
2 | import datetime
3 | from tools.activity_analysis import repository_activity_analysis
4 |
5 | class DummyResponse:
6 | def __init__(self, status_code, json_data):
7 | self.status_code = status_code
8 | self._json = json_data
9 | def json(self):
10 | return self._json
11 |
12 | def dummy_requests_get(url, headers=None, params=None):
13 | # Return dummy responses based on URL.
14 | if "pulls" in url:
15 | # Return 2 open pull requests.
16 | return DummyResponse(200, [{"dummy": "pr1"}, {"dummy": "pr2"}])
17 | elif "commits" in url:
18 | # Return a commit with a recent date.
19 | recent_date = (datetime.datetime.utcnow() - datetime.timedelta(days=10)).isoformat() + "Z"
20 | return DummyResponse(200, [{"commit": {"committer": {"date": recent_date}}}])
21 | return DummyResponse(200, {})
22 |
23 | class DummyState:
24 | def __init__(self):
25 | self.filtered_candidates = [
26 | {"full_name": "dummy/repo1", "open_issues_count": 5},
27 | {"full_name": "dummy/repo2", "open_issues_count": 10}
28 | ]
29 |
30 | class DummyConfig:
31 | def __init__(self):
32 | self.configurable = {}
33 |
34 | def test_repository_activity_analysis(monkeypatch):
35 | monkeypatch.setattr("tools.activity_analysis.requests.get", dummy_requests_get)
36 | state = DummyState()
37 | config = DummyConfig().__dict__
38 | result = repository_activity_analysis(state, config)
39 | # Each candidate should now have an "activity_score" key.
40 | for repo in state.activity_candidates:
41 | assert "activity_score" in repo
42 | # Since the dummy returns 2 PRs and a commit 10 days ago, score should be computed.
43 | assert isinstance(repo["activity_score"], float)
44 |
--------------------------------------------------------------------------------
/tests/test_convert_query.py:
--------------------------------------------------------------------------------
1 | import pytest
2 | from tools.convert_query import convert_searchable_query
3 | from dataclasses import dataclass, field
4 | from typing import List, Any
5 |
6 | # Create a dummy AgentState and AgentStateInput for testing.
7 | @dataclass
8 | class DummyState:
9 | user_query: str = "Test query for convert"
10 | searchable_query: str = ""
11 |
12 | @dataclass
13 | class DummyConfig:
14 | configurable: dict = field(default_factory=lambda: {})
15 |
16 | def test_convert_searchable_query():
17 | state = DummyState()
18 | config = DummyConfig().__dict__
19 | result = convert_searchable_query(state, config)
20 | # Expect the searchable_query to be non-empty and contain a colon.
21 | assert ":" in state.searchable_query
22 | assert "searchable_query" in result
23 |
--------------------------------------------------------------------------------
/tests/test_cross_encoder_reranking.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pytest
3 | from tools.cross_encoder_reranking import cross_encoder_reranking
4 |
5 | class DummyCrossEncoder:
6 | def __init__(self, model_name):
7 | self.model_name = model_name
8 | def predict(self, pairs, show_progress_bar=False):
9 | # Return a score equal to the length of the second element (chunk) modulo 10.
10 | if isinstance(pairs, list):
11 | scores = [len(pair[1]) % 10 for pair in pairs]
12 | return np.array(scores)
13 | else:
14 | return np.array([len(pairs[1]) % 10])
15 |
16 | class DummyState:
17 | def __init__(self):
18 | self.user_query = "dummy query"
19 | # Create two repositories with different lengths of documentation.
20 | self.semantic_ranked = [
21 | {"combined_doc": "Short doc."},
22 | {"combined_doc": "This is a longer document that should produce a higher score due to more content."}
23 | ]
24 |
25 | class DummyConfig:
26 | def __init__(self):
27 | self.configurable = {
28 | "cross_encoder_model_name": "dummy-cross-encoder",
29 | "cross_encoder_top_n": 2
30 | }
31 |
32 | def test_cross_encoder_reranking(monkeypatch):
33 | monkeypatch.setattr("tools.cross_encoder_reranking.CrossEncoder", lambda model_name: DummyCrossEncoder(model_name))
34 | state = DummyState()
35 | config = DummyConfig().__dict__
36 | result = cross_encoder_reranking(state, config)
37 | # Verify that the reranked_candidates list is of length top_n (2) and that the candidate with the longer doc is ranked higher.
38 | assert len(state.reranked_candidates) == 2
39 | scores = [cand["cross_encoder_score"] for cand in state.reranked_candidates]
40 | # Since the dummy score is based on length mod 10, we can check that the max is first.
41 | assert scores[0] >= scores[1]
42 |
--------------------------------------------------------------------------------
/tests/test_decision_maker.py:
--------------------------------------------------------------------------------
1 | import pytest
2 | from tools.decision_maker import decision_maker
3 |
4 | # Create a dummy should_run_code_analysis function.
5 | def dummy_should_run_code_analysis(query, repo_count):
6 | # Return 1 if repo_count is less than 50, else 0.
7 | return 1 if repo_count < 50 else 0
8 |
9 | class DummyState:
10 | def __init__(self):
11 | self.user_query = "dummy query"
12 | self.filtered_candidates = [{}] * 30 # 30 repos.
13 |
14 | class DummyConfig:
15 | def __init__(self):
16 | self.configurable = {}
17 |
18 | def test_decision_maker(monkeypatch):
19 | monkeypatch.setattr("tools.decision_maker.should_run_code_analysis", dummy_should_run_code_analysis)
20 | state = DummyState()
21 | config = DummyConfig().__dict__
22 | result = decision_maker(state, config)
23 | # Since repo_count is 30 (<50), decision should be 1 => run_code_analysis True.
24 | assert state.run_code_analysis is True
25 |
--------------------------------------------------------------------------------
/tests/test_dense_retrieval.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pytest
3 | from tools.dense_retrieval import hybrid_dense_retrieval
4 |
5 | class DummySentenceTransformer:
6 | def __init__(self, model_name):
7 | self.model_name = model_name
8 | def encode(self, texts, convert_to_numpy=True, **kwargs):
9 | # Return a deterministic vector for each text (e.g., using length)
10 | if isinstance(texts, list):
11 | return np.array([[len(t)] for t in texts], dtype=float)
12 | else:
13 | return np.array([len(texts)], dtype=float)
14 |
15 | class DummyState:
16 | def __init__(self):
17 | self.user_query = "dummy query"
18 | self.repositories = [{"combined_doc": "Test document one."},
19 | {"combined_doc": "Another test document with more text."}]
20 |
21 | class DummyConfig:
22 | def __init__(self):
23 | self.configurable = {
24 | "sem_model_name": "dummy-model",
25 | "dense_retrieval_k": 10
26 | }
27 |
28 | def test_neural_dense_retrieval(monkeypatch):
29 | monkeypatch.setattr("tools.dense_retrieval.SentenceTransformer", lambda model_name: DummySentenceTransformer(model_name))
30 | state = DummyState()
31 | config = DummyConfig().__dict__
32 | result = hybrid_dense_retrieval(state, config)
33 | # Expect semantic_ranked to be sorted in descending order of embedding (here, length).
34 | ranked = state.semantic_ranked
35 | assert len(ranked) == len(state.repositories)
36 | assert ranked[0]["semantic_similarity"] >= ranked[-1]["semantic_similarity"]
37 |
--------------------------------------------------------------------------------
/tests/test_filtering.py:
--------------------------------------------------------------------------------
1 | import pytest
2 | from tools.filtering import threshold_filtering
3 |
4 | class DummyState:
5 | def __init__(self):
6 | # Create dummy reranked candidates with stars and cross_encoder_score.
7 | self.reranked_candidates = [
8 | {"stars": 60, "cross_encoder_score": 6.0},
9 | {"stars": 30, "cross_encoder_score": 4.0}, # Should be filtered out if both criteria fail.
10 | {"stars": 80, "cross_encoder_score": 5.5}
11 | ]
12 |
13 | class DummyConfig:
14 | def __init__(self):
15 | self.configurable = {
16 | "min_stars": 50,
17 | "cross_encoder_threshold": 5.5
18 | }
19 |
20 | def test_threshold_filtering():
21 | state = DummyState()
22 | config = DummyConfig().__dict__
23 | result = threshold_filtering(state, config)
24 | # Expect that the candidate with 30 stars and 4.0 score is filtered out.
25 | filtered = state.filtered_candidates
26 | assert len(filtered) == 2
27 | for repo in filtered:
28 | assert repo["stars"] >= 50 or repo["cross_encoder_score"] >= 5.5
29 |
--------------------------------------------------------------------------------
/tests/test_github.py:
--------------------------------------------------------------------------------
1 | import pytest
2 | import asyncio
3 | import base64
4 | from tools.github import ingest_github_repos
5 |
6 | # Dummy response class to simulate httpx responses.
7 | class DummyResponse:
8 | def __init__(self, status_code, json_data, text_data="Dummy text"):
9 | self.status_code = status_code
10 | self._json = json_data
11 | self.text = text_data
12 |
13 | def json(self):
14 | return self._json
15 |
16 | # Dummy async get function simulating httpx.AsyncClient.get.
17 | async def dummy_get(url, headers=None, params=None):
18 | # For repository search:
19 | if "search/repositories" in url:
20 | return DummyResponse(200, {"items": [{
21 | "html_url": "https://github.com/dummy/repo",
22 | "full_name": "dummy/repo",
23 | "clone_url": "https://github.com/dummy/repo.git",
24 | "stargazers_count": 100,
25 | "open_issues_count": 5,
26 | "name": "repo"
27 | }]})
28 | # For contents endpoint:
29 | elif "contents" in url:
30 | if "readme" in url.lower():
31 | # Return a dummy README encoded in base64.
32 | encoded = base64.b64encode("Dummy README".encode("utf-8")).decode("utf-8")
33 | return DummyResponse(200, {"content": encoded})
34 | else:
35 | # Return a dummy markdown file list.
36 | return DummyResponse(200, [{"type": "file", "name": "README.md", "download_url": "https://dummy/readme"}])
37 | return DummyResponse(200, {})
38 |
39 | # Dummy fetch_file_content function for asynchronous calls.
40 | async def dummy_fetch_file_content(download_url, client):
41 | return "Dummy documentation content."
42 |
43 | # Patch the httpx.AsyncClient.get and file content retrieval in the module.
44 | @pytest.fixture(autouse=True)
45 | def patch_httpx(monkeypatch):
46 | import httpx
47 | # Patch the get method of AsyncClient to use dummy_get.
48 | monkeypatch.setattr(httpx.AsyncClient, "get", lambda self, url, headers=None, params=None: dummy_get(url, headers, params))
49 | # Patch fetch_file_content function in the tools.github module.
50 | monkeypatch.setattr("tools.github.fetch_file_content", dummy_fetch_file_content)
51 |
52 | # Define dummy state and configuration classes.
53 | class DummyState:
54 | def __init__(self):
55 | self.searchable_query = "dummy:repo"
56 | self.repositories = []
57 |
58 | class DummyConfig:
59 | def __init__(self):
60 | self.configurable = {}
61 |
62 | # Since ingest_github_repos is a synchronous wrapper calling asyncio.run,
63 | # we can test it directly.
64 | def test_ingest_github_repos():
65 | state = DummyState()
66 | config = DummyConfig().__dict__
67 | result = ingest_github_repos(state, config)
68 | # We expect at least one repository with the required keys.
69 | assert len(state.repositories) >= 1
70 | repo = state.repositories[0]
71 | for key in ["title", "link", "clone_url", "combined_doc", "stars", "full_name", "open_issues_count"]:
72 | assert key in repo
73 |
--------------------------------------------------------------------------------
/tests/test_merge_analysis.py:
--------------------------------------------------------------------------------
1 | import pytest
2 | from tools.merge_analysis import merge_analysis
3 |
4 | class DummyState:
5 | def __init__(self):
6 | self.activity_candidates = [
7 | {"full_name": "dummy/repo", "activity_score": 8.0},
8 | {"full_name": "dummy/repo2", "activity_score": 7.0}
9 | ]
10 | self.quality_candidates = [
11 | {"full_name": "dummy/repo", "code_quality_score": 90},
12 | {"full_name": "dummy/repo3", "code_quality_score": 80}
13 | ]
14 | self.filtered_candidates = []
15 |
16 | class DummyConfig:
17 | def __init__(self):
18 | self.configurable = {}
19 |
20 | def test_merge_analysis():
21 | state = DummyState()
22 | config = DummyConfig().__dict__
23 | result = merge_analysis(state, config)
24 | # Expect merged repos: dummy/repo, dummy/repo2, dummy/repo3.
25 | merged = state.filtered_candidates
26 | assert len(merged) == 3
27 | # Check that for dummy/repo, both scores are merged.
28 | for repo in merged:
29 | if repo["full_name"] == "dummy/repo":
30 | assert "activity_score" in repo
31 | assert "code_quality_score" in repo
32 |
--------------------------------------------------------------------------------
/tests/test_output_presentation.py:
--------------------------------------------------------------------------------
1 | import pytest
2 | from tools.output_presentation import output_presentation
3 |
4 | class DummyState:
5 | def __init__(self):
6 | # Create dummy final_ranked list.
7 | self.final_ranked = [
8 | {"title": "Repo1", "link": "https://github.com/repo1", "stars": 100,
9 | "semantic_similarity": 0.8, "cross_encoder_score": 7.0, "activity_score": 5.0,
10 | "code_quality_score": 90, "final_score": 0.95, "combined_doc": "This is a documentation snippet for Repo1."},
11 | {"title": "Repo2", "link": "https://github.com/repo2", "stars": 50,
12 | "semantic_similarity": 0.6, "cross_encoder_score": 6.5, "activity_score": 3.0,
13 | "code_quality_score": 80, "final_score": 0.85, "combined_doc": "Documentation snippet for Repo2."}
14 | ]
15 |
16 | class DummyConfig:
17 | def __init__(self):
18 | self.configurable = {}
19 |
20 | def test_output_presentation():
21 | state = DummyState()
22 | config = DummyConfig().__dict__
23 | result = output_presentation(state, config)
24 | output_str = result["final_results"]
25 | # Check that the output string contains expected repository titles and snippets.
26 | assert "Repo1" in output_str
27 | assert "https://github.com/repo1" in output_str
28 | assert "Documentation snippet" in output_str
29 |
--------------------------------------------------------------------------------
/tests/test_ranking.py:
--------------------------------------------------------------------------------
1 | import pytest
2 | from tools.ranking import multi_factor_ranking
3 | import math
4 |
5 | class DummyState:
6 | def __init__(self):
7 | # Create dummy filtered_candidates with varying scores.
8 | self.filtered_candidates = [
9 | {"semantic_similarity": 0.8, "cross_encoder_score": 7.0, "activity_score": 5.0, "code_quality_score": 90, "stars": 100},
10 | {"semantic_similarity": 0.6, "cross_encoder_score": 8.0, "activity_score": 3.0, "code_quality_score": 80, "stars": 50},
11 | {"semantic_similarity": 0.9, "cross_encoder_score": 6.0, "activity_score": 7.0, "code_quality_score": 95, "stars": 150}
12 | ]
13 | self.final_ranked = []
14 |
15 | class DummyConfig:
16 | def __init__(self):
17 | self.configurable = {}
18 |
19 | def test_multi_factor_ranking():
20 | state = DummyState()
21 | config = DummyConfig().__dict__
22 | result = multi_factor_ranking(state, config)
23 | # Check that final_ranked is sorted in descending order by final_score.
24 | final = state.final_ranked
25 | assert len(final) == 3
26 | assert final[0]["final_score"] >= final[1]["final_score"] >= final[2]["final_score"]
27 | # Verify that star score is computed as log(stars+1)
28 | expected_star_score = math.log(100 + 1)
29 | # Normalize check is more complex; here we just ensure the final score key exists.
30 | for repo in final:
31 | assert "final_score" in repo
32 |
--------------------------------------------------------------------------------
/themes/theme_schema@0.0.1.json:
--------------------------------------------------------------------------------
1 | {"theme": {"_font": [{"__gradio_font__": true, "name": "Shantell Sans", "class": "google"}, {"__gradio_font__": true, "name": "ui-sans-serif", "class": "font"}, {"__gradio_font__": true, "name": "sans-serif", "class": "font"}], "_font_mono": [{"__gradio_font__": true, "name": "IBM Plex Mono", "class": "google"}, {"__gradio_font__": true, "name": "ui-monospace", "class": "font"}, {"__gradio_font__": true, "name": "monospace", "class": "font"}], "_stylesheets": ["https://fonts.googleapis.com/css2?family=Shantell+Sans:wght@400;600&display=swap", "https://fonts.googleapis.com/css2?family=IBM+Plex+Mono:wght@400;600&display=swap"], "background_fill_primary": "white", "background_fill_primary_dark": "*neutral_900", "background_fill_secondary": "*neutral_50", "background_fill_secondary_dark": "*neutral_800", "block_background_fill": "*background_fill_primary", "block_background_fill_dark": "*neutral_800", "block_border_color": "*border_color_primary", "block_border_color_dark": "*border_color_primary", "block_border_width": "0px", "block_border_width_dark": "1px", "block_info_text_color": "*body_text_color_subdued", "block_info_text_color_dark": "*body_text_color_subdued", "block_info_text_size": "*text_sm", "block_info_text_weight": "400", "block_label_background_fill": "*background_fill_primary", "block_label_background_fill_dark": "*background_fill_secondary", "block_label_border_color": "*border_color_primary", "block_label_border_color_dark": "*border_color_primary", "block_label_border_width": "1px", "block_label_margin": "0", "block_label_padding": "*spacing_sm *spacing_lg", "block_label_radius": "calc(*radius_lg - 1px) 0 calc(*radius_lg - 1px) 0", "block_label_right_radius": "0 calc(*radius_lg - 1px) 0 calc(*radius_lg - 1px)", "block_label_text_color": "*body_text_color", "block_label_text_color_dark": "*neutral_200", "block_label_text_size": "*text_md", "block_label_text_weight": "600", "block_padding": "*spacing_xl calc(*spacing_xl + 2px)", "block_radius": "*radius_lg", "block_shadow": "*shadow_drop_lg", "block_shadow_dark": "none", "block_title_background_fill": "none", "block_title_border_color": "none", "block_title_border_width": "0px", "block_title_padding": "0", "block_title_radius": "none", "block_title_text_color": "*body_text_color", "block_title_text_color_dark": "*neutral_200", "block_title_text_size": "*text_md", "block_title_text_weight": "600", "body_background_fill": "*background_fill_primary", "body_background_fill_dark": "*background_fill_primary", "body_text_color": "*neutral_900", "body_text_color_dark": "*neutral_100", "body_text_color_subdued": "*neutral_700", "body_text_color_subdued_dark": "*neutral_400", "body_text_size": "*text_md", "body_text_weight": "400", "border_color_accent": "*primary_300", "border_color_accent_dark": "*neutral_600", "border_color_primary": "*neutral_200", "border_color_primary_dark": "*neutral_700", "button_border_width": "*input_border_width", "button_border_width_dark": "*input_border_width", "button_cancel_background_fill": "*button_primary_background_fill", "button_cancel_background_fill_dark": "*button_secondary_background_fill", "button_cancel_background_fill_hover": "*button_primary_background_fill_hover", "button_cancel_background_fill_hover_dark": "*button_cancel_background_fill", "button_cancel_border_color": "*button_secondary_border_color", "button_cancel_border_color_dark": "*button_secondary_border_color", "button_cancel_border_color_hover": "*button_cancel_border_color", "button_cancel_border_color_hover_dark": "*button_cancel_border_color", "button_cancel_text_color": "*button_primary_text_color", "button_cancel_text_color_dark": "*button_secondary_text_color", "button_cancel_text_color_hover": "*button_cancel_text_color", "button_cancel_text_color_hover_dark": "*button_cancel_text_color", "button_large_padding": "*spacing_lg", "button_large_radius": "*radius_lg", "button_large_text_size": "*text_lg", "button_large_text_weight": "600", "button_primary_background_fill": "*neutral_900", "button_primary_background_fill_dark": "*neutral_600", "button_primary_background_fill_hover": "*neutral_700", "button_primary_background_fill_hover_dark": "*neutral_600", "button_primary_border_color": "*primary_200", "button_primary_border_color_dark": "*primary_600", "button_primary_border_color_hover": "*button_primary_border_color", "button_primary_border_color_hover_dark": "*button_primary_border_color", "button_primary_text_color": "white", "button_primary_text_color_dark": "white", "button_primary_text_color_hover": "*button_primary_text_color", "button_primary_text_color_hover_dark": "*button_primary_text_color", "button_secondary_background_fill": "*button_primary_background_fill", "button_secondary_background_fill_dark": "*neutral_600", "button_secondary_background_fill_hover": "*button_primary_background_fill_hover", "button_secondary_background_fill_hover_dark": "*button_secondary_background_fill", "button_secondary_border_color": "*neutral_200", "button_secondary_border_color_dark": "*neutral_600", "button_secondary_border_color_hover": "*button_secondary_border_color", "button_secondary_border_color_hover_dark": "*button_secondary_border_color", "button_secondary_text_color": "*button_primary_text_color", "button_secondary_text_color_dark": "white", "button_secondary_text_color_hover": "*button_secondary_text_color", "button_secondary_text_color_hover_dark": "*button_secondary_text_color", "button_shadow": "none", "button_shadow_active": "none", "button_shadow_hover": "none", "button_small_padding": "*spacing_sm", "button_small_radius": "*radius_lg", "button_small_text_size": "*text_md", "button_small_text_weight": "400", "button_transition": "background-color 0.2s ease", "checkbox_background_color": "*background_fill_primary", "checkbox_background_color_dark": "*neutral_700", "checkbox_background_color_focus": "*checkbox_background_color", "checkbox_background_color_focus_dark": "*checkbox_background_color", "checkbox_background_color_hover": "*checkbox_background_color", "checkbox_background_color_hover_dark": "*checkbox_background_color", "checkbox_background_color_selected": "*neutral_600", "checkbox_background_color_selected_dark": "*neutral_700", "checkbox_border_color": "*neutral_300", "checkbox_border_color_dark": "*neutral_700", "checkbox_border_color_focus": "*secondary_500", "checkbox_border_color_focus_dark": "*secondary_500", "checkbox_border_color_hover": "*neutral_300", "checkbox_border_color_hover_dark": "*neutral_600", "checkbox_border_color_selected": "*secondary_600", "checkbox_border_color_selected_dark": "*neutral_800", "checkbox_border_radius": "*radius_sm", "checkbox_border_width": "*input_border_width", "checkbox_border_width_dark": "*input_border_width", "checkbox_check": "url(\"data:image/svg+xml,%3csvg viewBox='0 0 16 16' fill='white' xmlns='http://www.w3.org/2000/svg'%3e%3cpath d='M12.207 4.793a1 1 0 010 1.414l-5 5a1 1 0 01-1.414 0l-2-2a1 1 0 011.414-1.414L6.5 9.086l4.293-4.293a1 1 0 011.414 0z'/%3e%3c/svg%3e\")", "checkbox_label_background_fill": "*button_primary_background_fill", "checkbox_label_background_fill_dark": "*button_secondary_background_fill", "checkbox_label_background_fill_hover": "*button_primary_background_fill_hover", "checkbox_label_background_fill_hover_dark": "*button_secondary_background_fill_hover", "checkbox_label_background_fill_selected": "*checkbox_label_background_fill", "checkbox_label_background_fill_selected_dark": "*checkbox_label_background_fill", "checkbox_label_border_color": "*border_color_primary", "checkbox_label_border_color_dark": "*border_color_primary", "checkbox_label_border_color_hover": "*checkbox_label_border_color", "checkbox_label_border_color_hover_dark": "*checkbox_label_border_color", "checkbox_label_border_width": "*input_border_width", "checkbox_label_border_width_dark": "*input_border_width", "checkbox_label_gap": "*spacing_lg", "checkbox_label_padding": "*spacing_md", "checkbox_label_shadow": "none", "checkbox_label_text_color": "*button_primary_text_color", "checkbox_label_text_color_dark": "*body_text_color", "checkbox_label_text_color_selected": "*checkbox_label_text_color", "checkbox_label_text_color_selected_dark": "*checkbox_label_text_color", "checkbox_label_text_size": "*text_md", "checkbox_label_text_weight": "400", "checkbox_shadow": "*input_shadow", "color_accent": "*primary_500", "color_accent_soft": "*primary_50", "color_accent_soft_dark": "*neutral_700", "container_radius": "*radius_lg", "embed_radius": "*radius_lg", "error_background_fill": "#fee2e2", "error_background_fill_dark": "*background_fill_primary", "error_border_color": "#fecaca", "error_border_color_dark": "*border_color_primary", "error_border_width": "1px", "error_text_color": "#ef4444", "error_text_color_dark": "#ef4444", "font": "'Shantell Sans', 'ui-sans-serif', sans-serif", "font_mono": "'IBM Plex Mono', 'ui-monospace', monospace", "form_gap_width": "0px", "input_background_fill": "*neutral_100", "input_background_fill_dark": "*neutral_700", "input_background_fill_focus": "*secondary_500", "input_background_fill_focus_dark": "*secondary_600", "input_background_fill_hover": "*input_background_fill", "input_background_fill_hover_dark": "*input_background_fill", "input_border_color": "*border_color_primary", "input_border_color_dark": "*border_color_primary", "input_border_color_focus": "*secondary_300", "input_border_color_focus_dark": "*neutral_700", "input_border_color_hover": "*input_border_color", "input_border_color_hover_dark": "*input_border_color", "input_border_width": "0px", "input_padding": "*spacing_xl", "input_placeholder_color": "*neutral_400", "input_placeholder_color_dark": "*neutral_500", "input_radius": "*radius_lg", "input_shadow": "none", "input_shadow_focus": "*input_shadow", "input_text_size": "*text_md", "input_text_weight": "400", "layout_gap": "*spacing_xxl", "link_text_color": "*secondary_600", "link_text_color_active": "*secondary_600", "link_text_color_active_dark": "*secondary_500", "link_text_color_dark": "*secondary_500", "link_text_color_hover": "*secondary_700", "link_text_color_hover_dark": "*secondary_400", "link_text_color_visited": "*secondary_500", "link_text_color_visited_dark": "*secondary_600", "loader_color": "*color_accent", "neutral_100": "#f5f5f5", "neutral_200": "#e5e5e5", "neutral_300": "#d4d4d4", "neutral_400": "#a3a3a3", "neutral_50": "#fafafa", "neutral_500": "#737373", "neutral_600": "#525252", "neutral_700": "#404040", "neutral_800": "#262626", "neutral_900": "#171717", "neutral_950": "#0f0f0f", "panel_background_fill": "*background_fill_secondary", "panel_background_fill_dark": "*background_fill_secondary", "panel_border_color": "*border_color_primary", "panel_border_color_dark": "*border_color_primary", "panel_border_width": "0", "primary_100": "#f5f5f5", "primary_200": "#e5e5e5", "primary_300": "#d4d4d4", "primary_400": "#a3a3a3", "primary_50": "#fafafa", "primary_500": "#737373", "primary_600": "#525252", "primary_700": "#404040", "primary_800": "#262626", "primary_900": "#171717", "primary_950": "#0f0f0f", "prose_header_text_weight": "600", "prose_text_size": "*text_md", "prose_text_weight": "400", "radio_circle": "url(\"data:image/svg+xml,%3csvg viewBox='0 0 16 16' fill='white' xmlns='http://www.w3.org/2000/svg'%3e%3ccircle cx='8' cy='8' r='3'/%3e%3c/svg%3e\")", "radius_lg": "8px", "radius_md": "6px", "radius_sm": "4px", "radius_xl": "12px", "radius_xs": "2px", "radius_xxl": "22px", "radius_xxs": "1px", "secondary_100": "#f5f5f5", "secondary_200": "#e5e5e5", "secondary_300": "#d4d4d4", "secondary_400": "#a3a3a3", "secondary_50": "#fafafa", "secondary_500": "#737373", "secondary_600": "#525252", "secondary_700": "#404040", "secondary_800": "#262626", "secondary_900": "#171717", "secondary_950": "#0f0f0f", "section_header_text_size": "*text_md", "section_header_text_weight": "400", "shadow_drop": "rgba(0,0,0,0.05) 0px 1px 2px 0px", "shadow_drop_lg": "0 1px 4px 0 rgb(0 0 0 / 0.1)", "shadow_inset": "rgba(0,0,0,0.05) 0px 2px 4px 0px inset", "shadow_spread": "3px", "shadow_spread_dark": "1px", "slider_color": "*neutral_900", "slider_color_dark": "*neutral_500", "spacing_lg": "8px", "spacing_md": "6px", "spacing_sm": "4px", "spacing_xl": "10px", "spacing_xs": "2px", "spacing_xxl": "16px", "spacing_xxs": "1px", "stat_background_fill": "*primary_300", "stat_background_fill_dark": "*primary_500", "table_border_color": "*neutral_300", "table_border_color_dark": "*neutral_700", "table_even_background_fill": "white", "table_even_background_fill_dark": "*neutral_950", "table_odd_background_fill": "*neutral_50", "table_odd_background_fill_dark": "*neutral_900", "table_radius": "*radius_lg", "table_row_focus": "*color_accent_soft", "table_row_focus_dark": "*color_accent_soft", "text_lg": "20px", "text_md": "16px", "text_sm": "14px", "text_xl": "24px", "text_xs": "12px", "text_xxl": "28px", "text_xxs": "10px"}, "version": "0.0.1"}
--------------------------------------------------------------------------------
/tools/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zamalali/DeepGit/14ea95cb4b24f03e146b5061f59c8cba49eeb885/tools/__init__.py
--------------------------------------------------------------------------------
/tools/activity_analysis.py:
--------------------------------------------------------------------------------
1 | # tools/activity_analysis.py
2 | import os
3 | import datetime
4 | import logging
5 | import requests
6 |
7 | logger = logging.getLogger(__name__)
8 |
9 | def get_commit_frequency(full_name, headers):
10 | """
11 | Returns the number of commits in the last 30 days.
12 | """
13 | since_date = (datetime.datetime.utcnow() - datetime.timedelta(days=30)).isoformat() + "Z"
14 | commits_url = f"https://api.github.com/repos/{full_name}/commits"
15 | commits_params = {"since": since_date, "per_page": 100}
16 | try:
17 | response = requests.get(commits_url, headers=headers, params=commits_params)
18 | if response.status_code == 200:
19 | commits = response.json()
20 | return len(commits)
21 | except Exception as e:
22 | logger.error(f"Error fetching commit frequency for {full_name}: {e}")
23 | return 0
24 |
25 | def repository_activity_analysis(state, config):
26 | headers = {
27 | "Authorization": f"token {os.getenv('GITHUB_API_KEY')}",
28 | "Accept": "application/vnd.github.v3+json"
29 | }
30 | def analyze_repository_activity(repo):
31 | full_name = repo.get("full_name")
32 | # Pull Requests analysis
33 | pr_url = f"https://api.github.com/repos/{full_name}/pulls"
34 | pr_params = {"state": "open", "per_page": 100}
35 | pr_response = requests.get(pr_url, headers=headers, params=pr_params)
36 | pr_count = len(pr_response.json()) if pr_response.status_code == 200 else 0
37 |
38 | # Latest commit analysis
39 | commits_url = f"https://api.github.com/repos/{full_name}/commits"
40 | commits_params = {"per_page": 1}
41 | commits_response = requests.get(commits_url, headers=headers, params=commits_params)
42 | if commits_response.status_code == 200:
43 | commit_data = commits_response.json()
44 | if commit_data:
45 | commit_date_str = commit_data[0]["commit"]["committer"]["date"]
46 | commit_date = datetime.datetime.fromisoformat(commit_date_str.rstrip("Z"))
47 | days_diff = (datetime.datetime.utcnow() - commit_date).days
48 | else:
49 | days_diff = 999
50 | else:
51 | days_diff = 999
52 |
53 | # Issues analysis: subtract PRs from total open issues.
54 | open_issues = repo.get("open_issues_count", 0)
55 | non_pr_issues = max(0, open_issues - pr_count)
56 |
57 | # New: Commit frequency in the last 30 days.
58 | commit_frequency = get_commit_frequency(full_name, headers)
59 |
60 | # Combine signals into an activity score.
61 | # Here, we give weight to PR count, subtract a penalty for stale commits,
62 | # add non-PR issues, and add a bonus for higher commit frequency.
63 | activity_score = (3 * pr_count) + non_pr_issues - (days_diff / 30) + (0.1 * commit_frequency)
64 | # Optionally, store commit frequency for further ranking analysis.
65 | repo["commit_frequency"] = commit_frequency
66 | repo["pr_count"] = pr_count
67 | repo["latest_commit_days"] = days_diff
68 | repo["activity_score"] = activity_score
69 | return repo
70 |
71 | activity_list = []
72 | # It is assumed that activity analysis runs on filtered candidates.
73 | for repo in state.filtered_candidates:
74 | data = analyze_repository_activity(repo)
75 | activity_list.append(data)
76 | state.activity_candidates = activity_list
77 | logger.info("Repository activity analysis complete.")
78 | return {"activity_candidates": state.activity_candidates}
79 |
--------------------------------------------------------------------------------
/tools/chat.py:
--------------------------------------------------------------------------------
1 | import os
2 | import re
3 | from langchain_groq import ChatGroq
4 | from langchain_core.prompts import ChatPromptTemplate
5 | from dotenv import load_dotenv
6 | from pathlib import Path
7 |
8 | # Load environment variables
9 | dotenv_path = Path(__file__).resolve().parent.parent / ".env"
10 | if dotenv_path.exists():
11 | load_dotenv(dotenv_path)
12 |
13 | # Step 1: Instantiate the Groq model with appropriate settings.
14 | llm = ChatGroq(
15 | model="deepseek-r1-distill-llama-70b",
16 | temperature=0.3,
17 | max_tokens=512,
18 | max_retries=3,
19 | )
20 |
21 | # Step 2: Build the prompt with enhanced instructions for iterative thinking and target language detection.
22 | prompt = ChatPromptTemplate.from_messages([
23 | ("system",
24 | """You are a GitHub search optimization expert.
25 |
26 | Your job is to:
27 | 1. Read a user's query about tools, research, or tasks.
28 | 2. Detect if the query mentions a specific programming language other than Python (for example, JavaScript or JS). If so, record that language as the target language.
29 | 3. Think iteratively and generate your internal chain-of-thought enclosed in ... tags.
30 | 4. After your internal reasoning, output up to five GitHub-style search tags or library names that maximize repository discovery.
31 | Use as many tags as necessary based on the query's complexity, but never more than five.
32 | 5. If you detected a non-Python target language, append an additional tag at the end in the format target-[language] (e.g., target-javascript).
33 | If no specific language is mentioned, do not include any target tag.
34 |
35 | Output Format:
36 | tag1:tag2[:tag3[:tag4[:tag5[:target-language]]]]
37 |
38 | Rules:
39 | - Use lowercase and hyphenated keywords (e.g., image-augmentation, chain-of-thought).
40 | - Use terms commonly found in GitHub repo names, topics, or descriptions.
41 | - Avoid generic terms like "python", "ai", "tool", "project".
42 | - Do NOT use full phrases or vague words like "no-code", "framework", or "approach".
43 | - Prefer real tools, popular methods, or dataset names when mentioned.
44 | - If your output does not strictly match the required format, correct it after your internal reasoning.
45 | - Choose high-signal keywords to ensure the search yields the most relevant GitHub repositories.
46 |
47 | Excellent Examples:
48 |
49 | Input: "No code tool to augment image and annotation"
50 | Output: image-augmentation:albumentations
51 |
52 | Input: "Open-source tool for labeling datasets with UI"
53 | Output: label-studio:streamlit
54 |
55 | Input: "Visual reasoning models trained on multi-modal datasets"
56 | Output: multimodal-reasoning:vlm
57 |
58 | Input: "I want repos related to instruction-based finetuning for LLaMA 2"
59 | Output: instruction-tuning:llama2
60 |
61 | Input: "Repos around chain of thought prompting mainly for finetuned models"
62 | Output: chain-of-thought:finetuned-llm
63 |
64 | Input: "I want to fine-tune Gemini 1.5 Flash model"
65 | Output: gemini-finetuning:flash002
66 |
67 | Input: "Need repos for document parsing with vision-language models"
68 | Output: document-understanding:vlm
69 |
70 | Input: "How to train custom object detection models using YOLO"
71 | Output: object-detection:yolov5
72 |
73 | Input: "Segment anything-like models for interactive segmentation"
74 | Output: interactive-segmentation:segment-anything
75 |
76 | Input: "Synthetic data generation for vision model training"
77 | Output: synthetic-data:image-augmentation
78 |
79 | Input: "OCR pipeline for scanned documents"
80 | Output: ocr:document-processing
81 |
82 | Input: "LLMs with self-reflection or reasoning chains"
83 | Output: self-reflection:chain-of-thought
84 |
85 | Input: "Chatbot development using open-source LLMs"
86 | Output: chatbot:llm
87 |
88 | Input: "Deep learning-based object detection with YOLO and transformer architecture"
89 | Output: object-detection:yolov5:transformer
90 |
91 | Input: "Semantic segmentation for medical images using UNet with attention mechanism"
92 | Output: semantic-segmentation:unet:attention
93 |
94 | Input: "Find repositories implementing data augmentation pipelines in JavaScript"
95 | Output: data-augmentation:target-javascript
96 |
97 | Output must be ONLY the search tags separated by colons. Do not include any extra text, bullet points, or explanations.
98 | """),
99 | ("human", "{query}")
100 | ])
101 |
102 | # Step 3: Chain the prompt with the LLM.
103 | chain = prompt | llm
104 |
105 | # Step 4: Define a function to parse the final search tags from the model's response.
106 | def parse_search_tags(response: str) -> str:
107 | """
108 | Removes any internal commentary enclosed in ... tags
109 | and returns only the final searchable tags.
110 | """
111 | if "" in response and "" in response:
112 | end_index = response.index("") + len("")
113 | tags = response[end_index:].strip()
114 | return tags
115 | else:
116 | return response.strip()
117 |
118 | # Step 5: Helper function to validate the output tags format using regex.
119 | def valid_tags(tags: str) -> bool:
120 | """
121 | Validates that the output is one to six colon-separated tokens composed of lowercase letters, numbers, and hyphens.
122 | This allows up to five search tags and optionally one target tag.
123 | """
124 | pattern = r'^[a-z0-9-]+(?::[a-z0-9-]+){0,5}$'
125 | return re.match(pattern, tags) is not None
126 |
127 | # Step 6: Define an iterative conversion function that refines the output if needed.
128 | def iterative_convert_to_search_tags(query: str, max_iterations: int = 2) -> str:
129 | print(f"\n[iterative_convert_to_search_tags] Input Query: {query}")
130 | refined_query = query
131 | for iteration in range(max_iterations):
132 | print(f"\nIteration {iteration+1}")
133 | response = chain.invoke({"query": refined_query})
134 | full_output = response.content.strip()
135 | tags_output = parse_search_tags(full_output)
136 | print(f"Output Tags: {tags_output}")
137 | if valid_tags(tags_output):
138 | print("Valid tags format detected.")
139 | return tags_output
140 | else:
141 | print("Invalid tags format. Requesting refinement...")
142 | refined_query = f"{query}\nPlease refine your answer so that the output strictly matches the format: tag1:tag2[:tag3[:tag4[:tag5[:target-language]]]]."
143 | print("Final output (may be invalid):", tags_output)
144 | return tags_output
145 |
146 | # Example usage
147 | if __name__ == "__main__":
148 | # Example queries for testing:
149 | example_queries = [
150 | "I am looking for repositories for data augmentation pipelines for fine-tuning LLMs", # Default (Python)
151 | "Find repositories implementing data augmentation pipelines in JavaScript", # Should return target-javascript
152 | "Searching for tools for instruction-based finetuning for LLaMA 2", # Default (Python)
153 | "Looking for open-source libraries for object detection using YOLO", # Default (Python)
154 | "Repos implementing chatbots in JavaScript with self-reflection capabilities" # Should return target-javascript
155 | ]
156 |
157 | for q in example_queries:
158 | github_query = iterative_convert_to_search_tags(q)
159 | print("\nGitHub Search Query:")
160 | print(github_query)
161 |
--------------------------------------------------------------------------------
/tools/code_quality.py:
--------------------------------------------------------------------------------
1 | # tools/code_quality.py
2 | import os
3 | import subprocess
4 | import tempfile
5 | import shutil
6 | import stat
7 | import logging
8 | from pathlib import Path
9 | import asyncio
10 |
11 | logger = logging.getLogger(__name__)
12 |
13 | def remove_readonly(func, path, exc_info):
14 | os.chmod(path, stat.S_IWRITE)
15 | func(path)
16 |
17 | def analyze_code_quality(repo_info):
18 | """
19 | Synchronously clones the repository, runs flake8 to determine code quality,
20 | and returns the updated repo_info with quality scores.
21 | """
22 | full_name = repo_info.get('full_name', 'unknown')
23 | clone_url = repo_info.get('clone_url')
24 | if not clone_url:
25 | repo_info["code_quality_score"] = 0
26 | repo_info["code_quality_issues"] = 0
27 | repo_info["python_files"] = 0
28 | return repo_info
29 |
30 | temp_dir = tempfile.mkdtemp()
31 | repo_path = os.path.join(temp_dir, full_name.split("/")[-1])
32 | try:
33 | from git import Repo
34 | Repo.clone_from(clone_url, repo_path, depth=1, no_single_branch=True)
35 | py_files = list(Path(repo_path).rglob("*.py"))
36 | total_files = len(py_files)
37 | if total_files == 0:
38 | logger.info(f"No Python files found in {full_name}.")
39 | repo_info["code_quality_score"] = 0
40 | repo_info["code_quality_issues"] = 0
41 | repo_info["python_files"] = 0
42 | return repo_info
43 |
44 | process = subprocess.run(
45 | ["flake8", "--max-line-length=120", repo_path],
46 | stdout=subprocess.PIPE,
47 | stderr=subprocess.PIPE,
48 | text=True
49 | )
50 | output = process.stdout.strip()
51 | error_count = len(output.splitlines()) if output else 0
52 | issues_per_file = error_count / total_files
53 | if issues_per_file <= 2:
54 | score = 95 + (2 - issues_per_file) * 2.5
55 | elif issues_per_file <= 5:
56 | score = 70 + (5 - issues_per_file) * 6.5
57 | elif issues_per_file <= 10:
58 | score = 40 + (10 - issues_per_file) * 3
59 | else:
60 | score = max(10, 40 - (issues_per_file - 10) * 2)
61 | repo_info["code_quality_score"] = round(score)
62 | repo_info["code_quality_issues"] = error_count
63 | repo_info["python_files"] = total_files
64 | return repo_info
65 | except Exception as e:
66 | logger.error(f"Error analyzing {full_name}: {e}.")
67 | repo_info["code_quality_score"] = 0
68 | repo_info["code_quality_issues"] = 0
69 | repo_info["python_files"] = 0
70 | return repo_info
71 | finally:
72 | try:
73 | shutil.rmtree(temp_dir, onerror=remove_readonly)
74 | except Exception as cleanup_e:
75 | logger.error(f"Cleanup error for {full_name}: {cleanup_e}")
76 |
77 | async def analyze_code_quality_async(repo_info: dict) -> dict:
78 | """
79 | Asynchronous wrapper that offloads the blocking analyze_code_quality function
80 | to a background thread.
81 | """
82 | return await asyncio.to_thread(analyze_code_quality, repo_info)
83 |
84 | async def code_quality_analysis_async(state, config) -> dict:
85 | """
86 | Concurrently analyzes code quality for all repositories in state.filtered_candidates.
87 | If the decision maker flag (run_code_analysis) is False, it skips analysis.
88 | """
89 | if not getattr(state, "run_code_analysis", False):
90 | logger.info("Skipping code quality analysis as per decision maker.")
91 | state.quality_candidates = []
92 | return {"quality_candidates": state.quality_candidates}
93 |
94 | tasks = []
95 | for repo in state.filtered_candidates:
96 | if "clone_url" not in repo:
97 | repo["clone_url"] = f"https://github.com/{repo['full_name']}.git"
98 | tasks.append(analyze_code_quality_async(repo))
99 | quality_list = await asyncio.gather(*tasks, return_exceptions=True)
100 | # Optionally, filter out any exceptions if they occurred
101 | quality_list = [res for res in quality_list if not isinstance(res, Exception)]
102 | state.quality_candidates = quality_list
103 | logger.info("Code quality analysis complete.")
104 | return {"quality_candidates": state.quality_candidates}
105 |
106 | def code_quality_analysis(state, config):
107 | """
108 | Synchronous wrapper for code quality analysis to maintain the current interface.
109 | """
110 | return asyncio.run(code_quality_analysis_async(state, config))
111 |
--------------------------------------------------------------------------------
/tools/convert_query.py:
--------------------------------------------------------------------------------
1 | # tools/convert_query.py
2 | import logging
3 | from tools.chat import iterative_convert_to_search_tags
4 | from tools.parse_hardware import parse_hardware_spec
5 |
6 | logger = logging.getLogger(__name__)
7 |
8 | def convert_searchable_query(state, config):
9 | # 1) Extract hardware_spec so we can remove it from the tags
10 | parse_hardware_spec(state, config)
11 | hw = state.hardware_spec or ""
12 |
13 | # 2) Generate the raw colon-separated tags
14 | raw = iterative_convert_to_search_tags(state.user_query)
15 |
16 | # 3) Filter out any tag that matches the hardware spec token
17 | filtered = [tag for tag in raw.split(":") if tag and tag != hw]
18 | searchable = ":".join(filtered)
19 |
20 | # 4) Store and log the cleaned searchable query
21 | state.searchable_query = searchable
22 | logger.info(f"Converted searchable query (hardware removed): {searchable}")
23 | return {"searchable_query": searchable}
24 |
--------------------------------------------------------------------------------
/tools/cross_encoder_reranking.py:
--------------------------------------------------------------------------------
1 | # tools/cross_encoder_reranking.py
2 | import numpy as np
3 | import logging
4 | from sentence_transformers import CrossEncoder
5 |
6 | logger = logging.getLogger(__name__)
7 |
8 | def cross_encoder_reranking(state, config):
9 | from agent import AgentConfiguration
10 | agent_config = AgentConfiguration.from_runnable_config(config)
11 | cross_encoder = CrossEncoder(agent_config.cross_encoder_model_name)
12 | # Use top candidates from semantic ranking (e.g., top 100)
13 | candidates_for_rerank = state.semantic_ranked[:100]
14 | logger.info(f"Re-ranking {len(candidates_for_rerank)} candidates with cross-encoder...")
15 |
16 | # Configuration for chunking
17 | CHUNK_SIZE = 2000 # characters per chunk
18 | MAX_DOC_LENGTH = 5000 # cap for long docs
19 | MIN_DOC_LENGTH = 200 # threshold for short docs
20 |
21 | def split_text(text, chunk_size=CHUNK_SIZE):
22 | return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
23 |
24 | def cross_encoder_rerank_func(query, candidates, top_n):
25 | for candidate in candidates:
26 | doc = candidate.get("combined_doc", "")
27 | # Limit document length if needed.
28 | if len(doc) > MAX_DOC_LENGTH:
29 | doc = doc[:MAX_DOC_LENGTH]
30 | try:
31 | if len(doc) < MIN_DOC_LENGTH:
32 | # For very short docs, score directly.
33 | score = cross_encoder.predict([[query, doc]], show_progress_bar=False)
34 | candidate["cross_encoder_score"] = float(score[0])
35 | else:
36 | # For longer docs, split into chunks.
37 | chunks = split_text(doc)
38 | pairs = [[query, chunk] for chunk in chunks]
39 | scores = cross_encoder.predict(pairs, show_progress_bar=False)
40 | # Combine scores: weighted average of max and mean scores.
41 | max_score = np.max(scores) if scores is not None else 0.0
42 | avg_score = np.mean(scores) if scores is not None else 0.0
43 | candidate["cross_encoder_score"] = float(0.5 * max_score + 0.5 * avg_score)
44 | except Exception as e:
45 | logger.error(f"Error scoring candidate {candidate.get('full_name', 'unknown')}: {e}")
46 | candidate["cross_encoder_score"] = 0.0
47 |
48 | # Postprocessing: Shift all scores upward if any are negative.
49 | all_scores = [candidate["cross_encoder_score"] for candidate in candidates]
50 | min_score = min(all_scores)
51 | if min_score < 0:
52 | shift = -min_score
53 | for candidate in candidates:
54 | candidate["cross_encoder_score"] += shift
55 |
56 | # Return top N candidates sorted by cross_encoder_score (descending)
57 | return sorted(candidates, key=lambda x: x["cross_encoder_score"], reverse=True)[:top_n]
58 |
59 | state.reranked_candidates = cross_encoder_rerank_func(
60 | state.user_query,
61 | candidates_for_rerank,
62 | int(agent_config.cross_encoder_top_n)
63 | )
64 | logger.info(f"Cross-encoder re-ranking complete: {len(state.reranked_candidates)} candidates remain.")
65 | return {"reranked_candidates": state.reranked_candidates}
66 |
--------------------------------------------------------------------------------
/tools/decision.py:
--------------------------------------------------------------------------------
1 | from langchain_groq import ChatGroq
2 | from langchain_core.prompts import ChatPromptTemplate
3 | import os
4 | from dotenv import load_dotenv
5 | from pathlib import Path
6 |
7 | # Load .env variables
8 | dotenv_path = Path(__file__).resolve().parent.parent / ".env"
9 | if dotenv_path.exists():
10 | load_dotenv(dotenv_path)
11 |
12 | # LLM setup: DeepSeek-R1-Distill
13 | llm = ChatGroq(
14 | model="deepseek-r1-distill-llama-70b",
15 | temperature=0.3,
16 | max_tokens=512,
17 | max_retries=2,
18 | )
19 |
20 | # Prompt for decision making
21 | prompt = ChatPromptTemplate.from_messages([
22 | ("system",
23 | """You are a minimal, resource-efficient filtering agent for a GitHub research tool.
24 |
25 | Your job is to decide whether code-level analysis (e.g., flake8, static checks, linting) should be run on a set of repositories.
26 | **Code analysis should almost never run** — only when the user is **explicitly and repeatedly focused on code structure, correctness, or quality**.
27 |
28 | You must return:
29 | - `0` → **90 percent of the time**. For nearly all queries, especially high-level, research, exploratory, or implementation-related queries.
30 | - `1` → Only if the user uses keywords like: "clean code", "linting", "flake8", "code correctness", "static analysis", or **explicitly demands code quality checks**.
31 |
32 | Also skip analysis if:
33 | - The number of repositories is above 30.
34 | - The query is about concepts, papers, models, architecture, tutorials, demos, agents, or research.
35 | - The user does not emphasize code hygiene or correctness.
36 |
37 | Examples:
38 | - "Show me Gemini agents using ReAct" with 25 repos → `0`
39 | - "Find repos with solid implementation of MoE routing" with 35 repos → `0`
40 | - "Repos with perfect flake8 compliance" with 20 repos → `1`
41 | - "Production-level, bug-free codebases only!" with 15 repos → `1`
42 | - "Tutorials for dataset loaders in PyTorch" with 80 repos → `0`
43 |
44 | Only return one digit: `0` or `1`. No comments, no formatting, no explanations.
45 | """),
46 | ("human", "Query: {query}\nRepo count: {repo_count}")
47 | ])
48 |
49 |
50 | chain = prompt | llm
51 |
52 | # Final function
53 | def should_run_code_analysis(query: str, repo_count: int) -> int:
54 | print(f"\n[Decision Maker] Query: {query} | Repo Count: {repo_count}")
55 | response = chain.invoke({"query": query, "repo_count": repo_count})
56 |
57 | full_output = response.content.strip()
58 | print(f"\n[thinking]\n{full_output}\n")
59 |
60 | # Parse final line for the decision
61 | lines = full_output.splitlines()
62 | # Try to get last non-empty line
63 | for line in reversed(lines):
64 | line = line.strip()
65 | if line in ["0", "1"]:
66 | print(f"[Decision Maker] Decision: {line}")
67 | return int(line)
68 |
69 | print("[Decision Maker] Failed to extract a valid decision. Defaulting to 0.")
70 | return 0
71 |
72 |
73 | # Example usage
74 | if __name__ == "__main__":
75 | query = "I want to find a real quick guide on custom training yolo"
76 | repo_count = 34
77 | decision = should_run_code_analysis(query, repo_count)
78 | print("Should run code analysis?", decision)
79 |
--------------------------------------------------------------------------------
/tools/decision_maker.py:
--------------------------------------------------------------------------------
1 | # tools/decision_maker.py
2 | import logging
3 | from tools.decision import should_run_code_analysis
4 |
5 | logger = logging.getLogger(__name__)
6 |
7 | def decision_maker(state, config):
8 | repo_count = len(state.filtered_candidates)
9 | decision = should_run_code_analysis(state.user_query, repo_count)
10 | state.run_code_analysis = (decision == 1)
11 | logger.info(f"Decision Maker: run_code_analysis = {state.run_code_analysis}")
12 | return {"run_code_analysis": state.run_code_analysis}
13 |
--------------------------------------------------------------------------------
/tools/dense_retrieval.py:
--------------------------------------------------------------------------------
1 | import logging
2 | import torch
3 | import numpy as np
4 | from transformers import AutoTokenizer, AutoModel
5 | from rank_bm25 import BM25Okapi
6 |
7 | logger = logging.getLogger(__name__)
8 |
9 |
10 | def hybrid_dense_retrieval(state, config):
11 | """
12 | Performs advanced hybrid dense retrieval using ColBERTv2 embeddings (CPU-only)
13 | fused with BM25 sparse retrieval on the combined repository documentation.
14 |
15 | Args:
16 | state: Agent state containing `user_query` (str) and `repositories` (list of dicts with 'combined_doc').
17 | config: Runnable configuration dict, optionally containing `configurable` overrides.
18 |
19 | Returns:
20 | dict with key 'semantic_ranked' containing the repositories sorted by combined score.
21 | """
22 | # Extract parameters directly from the config dict without importing AgentConfiguration
23 | cfg = config.get("configurable", {}) if isinstance(config, dict) else {}
24 | colbert_model_name = cfg.get("colbert_model_name", "colbert-ir/colbertv2.0")
25 | alpha = cfg.get("retrieval_alpha", 0.7)
26 |
27 | logger.info(f"Loading ColBERT model '{colbert_model_name}' for advanced vector embeddings...")
28 | device = cfg.get("device", "cpu")
29 | tokenizer = AutoTokenizer.from_pretrained(colbert_model_name)
30 | colbert_model = AutoModel.from_pretrained(colbert_model_name)
31 | colbert_model.to(device)
32 | colbert_model.eval()
33 |
34 | # Gather documents
35 | docs = [repo.get("combined_doc", "") for repo in state.repositories]
36 | if not docs:
37 | logger.warning("No documentation found in any repository. Skipping dense retrieval.")
38 | state.semantic_ranked = []
39 | return {"semantic_ranked": state.semantic_ranked}
40 |
41 | def encode_colbert(text: str) -> np.ndarray:
42 | """
43 | Token-level normalized embeddings for a single text via ColBERT.
44 | Returns an array of shape (num_tokens, embedding_dim).
45 | """
46 | inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
47 | inputs = {k: v.to(device) for k, v in inputs.items()}
48 | with torch.no_grad():
49 | outputs = colbert_model(**inputs)
50 | embeddings = outputs.last_hidden_state.squeeze(0)
51 | # Normalize each token embedding
52 | embeddings = embeddings / (embeddings.norm(dim=1, keepdim=True) + 1e-10)
53 | return embeddings.cpu().numpy()
54 |
55 | # Encode the user query
56 | logger.info("Encoding user query using ColBERT model...")
57 | query_embeddings = encode_colbert(state.user_query)
58 |
59 | # Compute ColBERT-based scores for each document
60 | logger.info(f"Scoring {len(docs)} documents with ColBERT embeddings...")
61 | colbert_scores = []
62 | for idx, doc in enumerate(docs):
63 | if not doc.strip():
64 | colbert_scores.append(0.0)
65 | continue
66 | try:
67 | doc_embeddings = encode_colbert(doc)
68 | # similarity matrix: query tokens vs doc tokens
69 | sim_matrix = np.dot(query_embeddings, doc_embeddings.T)
70 | # for each query token, take its max match in the document
71 | max_per_query = sim_matrix.max(axis=1)
72 | score = float(max_per_query.sum())
73 | colbert_scores.append(score)
74 | except Exception as e:
75 | logger.error(f"Error in ColBERT scoring for doc {idx}: {e}")
76 | colbert_scores.append(0.0)
77 |
78 | colbert_arr = np.array(colbert_scores)
79 | c_min, c_max = colbert_arr.min(), colbert_arr.max()
80 | norm_colbert = (colbert_arr - c_min) / (c_max - c_min + 1e-10)
81 |
82 | # BM25 sparse retrieval
83 | tokenized_docs = [doc.split() for doc in docs]
84 | bm25 = BM25Okapi(tokenized_docs)
85 | query_tokens = state.user_query.split()
86 | bm25_scores = np.array(bm25.get_scores(query_tokens))
87 | b_min, b_max = bm25_scores.min(), bm25_scores.max()
88 | norm_bm25 = (bm25_scores - b_min) / (b_max - b_min + 1e-10)
89 |
90 | # Combine dense and sparse signals
91 | combined = alpha * norm_colbert + (1 - alpha) * norm_bm25
92 |
93 | # Attach scores and sort repositories
94 | for idx, repo in enumerate(state.repositories):
95 | repo["semantic_similarity"] = float(combined[idx])
96 |
97 | state.semantic_ranked = sorted(
98 | state.repositories,
99 | key=lambda x: x.get("semantic_similarity", 0),
100 | reverse=True
101 | )
102 | logger.info(f"Hybrid ColBERT retrieval complete: {len(state.semantic_ranked)} candidates ranked.")
103 |
104 | return {"semantic_ranked": state.semantic_ranked}
105 |
--------------------------------------------------------------------------------
/tools/dependency_analysis.py:
--------------------------------------------------------------------------------
1 | # tools/dependency_analysis.py
2 | import os, logging, httpx, toml, base64
3 | from functools import lru_cache
4 | from tools.chat import chain
5 |
6 | logger = logging.getLogger(__name__)
7 |
8 | QUESTION_TMPL = (
9 | "Given the following dependency list, can this project run on {hw}? "
10 | "Answer YES or NO and a short reason.\n\nDependencies:\n{deps}"
11 | )
12 |
13 | @lru_cache(maxsize=1024)
14 | def _gh_raw(owner: str, repo: str, path: str, token: str) -> str | None:
15 | url = f"https://api.github.com/repos/{owner}/{repo}/contents/{path}"
16 | r = httpx.get(url, headers={"Authorization": f"token {token}"})
17 | if r.status_code != 200:
18 | return None
19 | data = r.json()
20 | if data.get("encoding") == "base64":
21 | return base64.b64decode(data["content"]).decode("utf-8")
22 | return data.get("content", "")
23 |
24 | def _collect_deps(owner: str, repo: str, token: str) -> list[str]:
25 | reqs = _gh_raw(owner, repo, "requirements.txt", token) or ""
26 | py = _gh_raw(owner, repo, "pyproject.toml", token) or ""
27 | deps = [l.strip() for l in reqs.splitlines() if l.strip() and not l.startswith("#")]
28 | if py:
29 | try:
30 | deps += list(toml.loads(py).get("tool",{}).get("poetry",{}).get("dependencies",{}).keys())
31 | except Exception:
32 | pass
33 | return deps
34 |
35 | def dependency_analysis(state, config):
36 | hw = state.hardware_spec # None means “no constraint”
37 | cand = state.filtered_candidates
38 | if not hw:
39 | state.hardware_filtered = cand
40 | return {"hardware_filtered": cand}
41 |
42 | token = os.getenv("GITHUB_API_KEY", "")
43 | kept = []
44 |
45 | for repo in cand:
46 | full = repo.get("full_name", "")
47 | if "/" not in full:
48 | kept.append(repo); continue
49 | o, n = full.split("/", 1)
50 |
51 | deps = _collect_deps(o, n, token)
52 | if not deps: # empty list = assume lightweight
53 | kept.append(repo); continue
54 |
55 | prompt = QUESTION_TMPL.format(hw=hw, deps=", ".join(deps[:25]))
56 | ans = chain.invoke({"query": prompt}).content.strip().split()[0].upper()
57 | if ans == "YES":
58 | kept.append(repo)
59 | else:
60 | logger.info(f"[Deps] drop {full} for {hw}")
61 |
62 | state.hardware_filtered = kept
63 | return {"hardware_filtered": kept}
64 |
--------------------------------------------------------------------------------
/tools/evaluation.py:
--------------------------------------------------------------------------------
1 | import os
2 | import base64
3 | import requests
4 | import numpy as np
5 | import datetime
6 | import torch
7 | import math
8 | import logging
9 | import getpass
10 | from typing import Dict, List, Any, TypedDict
11 | from pathlib import Path
12 | from dotenv import load_dotenv
13 |
14 | from sentence_transformers import SentenceTransformer, CrossEncoder
15 | import faiss
16 |
17 | from langchain_groq import ChatGroq
18 | from langchain_core.tools import tool
19 | from langchain_core.prompts import ChatPromptTemplate
20 | from langgraph.graph import StateGraph, END
21 |
22 |
23 | # ---------------------------
24 | # Environment and .env Setup
25 | # ---------------------------
26 | # Resolve the path to the root directory's .env file
27 | dotenv_path = Path(__file__).resolve().parent.parent / ".env"
28 | load_dotenv(dotenv_path=str(dotenv_path))
29 | # ------------------------------------------------------------------
30 | # Bitsandbytes & Environment Setup
31 | # ------------------------------------------------------------------
32 | os.environ["BITSANDBYTES_NOWELCOME"] = "1"
33 | os.environ["BITSANDBYTES_DISABLE_GPU"] = "1"
34 |
35 | # Load .env if available
36 | dotenv_path = Path(__file__).resolve().parent.parent / ".env"
37 | if dotenv_path.exists():
38 | load_dotenv(dotenv_path=str(dotenv_path))
39 |
40 | if "GITHUB_API_KEY" not in os.environ:
41 | os.environ["GITHUB_API_KEY"] = getpass.getpass("Enter your GitHub API key: ")
42 |
43 | # ------------------------------------------------------------------
44 | # Logging Setup
45 | # ------------------------------------------------------------------
46 | logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
47 | logger = logging.getLogger(__name__)
48 |
49 | # ------------------------------------------------------------------
50 | # ChatGroq Setup (for query enhancement and justification)
51 | # ------------------------------------------------------------------
52 | llm_groq = ChatGroq(
53 | model="llama-3.1-8b-instant",
54 | temperature=0.2,
55 | max_tokens=100,
56 | timeout=15,
57 | max_retries=2
58 | )
59 |
60 | # ------------------------------------------------------------------
61 | # GitHub Headers Setup
62 | # ------------------------------------------------------------------
63 | gh_headers = {
64 | "Authorization": f"token {os.environ.get('GITHUB_API_KEY')}",
65 | "Accept": "application/vnd.github.v3+json"
66 | }
67 |
68 | # ------------------------------------------------------------------
69 | # Define the Agent State
70 | # ------------------------------------------------------------------
71 | class AgentState(TypedDict):
72 | original_query: str
73 | enhanced_query: str
74 | github_query: str
75 | candidates: List[Dict[str, Any]]
76 | final_ranked: List[Dict[str, Any]]
77 | justifications: Dict[str, str]
78 |
79 | # ------------------------------------------------------------------
80 | # Helper Functions for Repository Documentation
81 | # ------------------------------------------------------------------
82 | def fetch_file_content(download_url: str) -> str:
83 | try:
84 | response = requests.get(download_url)
85 | if response.status_code == 200:
86 | return response.text
87 | except Exception as e:
88 | logger.error(f"Error fetching file from {download_url}: {e}")
89 | return ""
90 |
91 | def fetch_directory_markdown(repo_full_name: str, path: str) -> str:
92 | md_content = ""
93 | url = f"https://api.github.com/repos/{repo_full_name}/contents/{path}"
94 | response = requests.get(url, headers=gh_headers)
95 | if response.status_code == 200:
96 | items = response.json()
97 | for item in items:
98 | if item["type"] == "file" and item["name"].lower().endswith(".md"):
99 | content = fetch_file_content(item["download_url"])
100 | md_content += f"\n\n# {item['name']}\n" + content
101 | return md_content
102 |
103 | def fetch_repo_documentation(repo_full_name: str) -> str:
104 | doc_text = ""
105 | readme_url = f"https://api.github.com/repos/{repo_full_name}/readme"
106 | response = requests.get(readme_url, headers=gh_headers)
107 | if response.status_code == 200:
108 | readme_data = response.json()
109 | try:
110 | decoded = base64.b64decode(readme_data['content']).decode('utf-8')
111 | except Exception as e:
112 | decoded = ""
113 | logger.error(f"Error decoding readme for {repo_full_name}: {e}")
114 | doc_text += "# README\n" + decoded
115 | root_url = f"https://api.github.com/repos/{repo_full_name}/contents"
116 | response = requests.get(root_url, headers=gh_headers)
117 | if response.status_code == 200:
118 | items = response.json()
119 | for item in items:
120 | if item["type"] == "file" and item["name"].lower().endswith(".md") and item["name"].lower() != "readme.md":
121 | content = fetch_file_content(item["download_url"])
122 | doc_text += f"\n\n# {item['name']}\n" + content
123 | elif item["type"] == "dir" and item["name"].lower() in ["docs", "documentation"]:
124 | doc_text += f"\n\n# {item['name']} folder\n" + fetch_directory_markdown(repo_full_name, item["name"])
125 | return doc_text if doc_text.strip() else "No documentation available."
126 |
127 | # ------------------------------------------------------------------
128 | # Helper: Extract Enhanced Query as a String
129 | # ------------------------------------------------------------------
130 | def get_enhanced_query(original_query: str) -> str:
131 | result = enhance_query_tool.invoke({"original_query": original_query})
132 | # If the result has a 'content' attribute, extract it; otherwise convert to string.
133 | return result.content if hasattr(result, "content") else str(result)
134 |
135 | # ------------------------------------------------------------------
136 | # Tool Definitions for Each Stage of the Pipeline
137 | # ------------------------------------------------------------------
138 |
139 | @tool
140 | def enhance_query_tool(original_query: str) -> str:
141 | """
142 | Enhances the query for GitHub search by adding technical keywords and context,
143 | then returns only a valid GitHub search query using GitHub search syntax.
144 | """
145 | prompt = f"""You are an expert GitHub search assistant. Given the research topic: "{original_query}",
146 | generate a highly effective GitHub search query. Use only GitHub search syntax (e.g., language:python, keywords, filters).
147 | Return ONLY the optimized query with no additional explanation."""
148 | messages = [
149 | ("system", "You are a helpful research assistant specializing in GitHub search."),
150 | ("human", prompt)
151 | ]
152 | result = llm_groq.invoke(messages)
153 | logger.info(f"Enhanced Query: {result}")
154 | # Extract the query text (assuming it is returned in the 'content' field)
155 | return result.content if hasattr(result, "content") else str(result)
156 |
157 |
158 | @tool
159 | def fetch_github_repositories_tool(query: str, max_results: int = 1000, per_page: int = 100) -> List[Dict[str, Any]]:
160 | """
161 | Searches GitHub repositories using the given query.
162 | """
163 | url = "https://api.github.com/search/repositories"
164 | repositories = []
165 | num_pages = max_results // per_page
166 | for page in range(1, num_pages + 1):
167 | params = {
168 | "q": query,
169 | "sort": "stars",
170 | "order": "desc",
171 | "per_page": per_page,
172 | "page": page
173 | }
174 | response = requests.get(url, headers=gh_headers, params=params)
175 | if response.status_code != 200:
176 | logger.error(f"Error {response.status_code}: {response.json().get('message')}")
177 | break
178 | items = response.json().get('items', [])
179 | if not items:
180 | break
181 | for repo in items:
182 | full_name = repo.get("full_name", "")
183 | doc_content = fetch_repo_documentation(full_name)
184 | repositories.append({
185 | "title": repo.get("name", "No title available"),
186 | "link": repo.get("html_url", ""),
187 | "combined_doc": doc_content,
188 | "stars": repo.get("stargazers_count", 0),
189 | "full_name": full_name,
190 | "open_issues_count": repo.get("open_issues_count", 0)
191 | })
192 | logger.info(f"Fetched {len(repositories)} repositories from GitHub.")
193 | return repositories
194 |
195 | @tool
196 | def semantic_ranking_tool(query: str, candidates: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
197 | """
198 | Ranks candidates using SentenceTransformer and FAISS.
199 | """
200 | if not candidates:
201 | logger.info("No candidates provided for semantic ranking. Returning empty list.")
202 | return []
203 | docs = [repo.get("combined_doc", "") for repo in candidates]
204 | sem_model = SentenceTransformer("all-mpnet-base-v2")
205 | logger.info(f"Encoding {len(docs)} documents for dense retrieval...")
206 | doc_embeddings = sem_model.encode(docs, convert_to_numpy=True, show_progress_bar=True, batch_size=16)
207 | if doc_embeddings.ndim == 1:
208 | doc_embeddings = doc_embeddings.reshape(1, -1)
209 | else:
210 | norms = np.linalg.norm(doc_embeddings, axis=1, keepdims=True)
211 | doc_embeddings = doc_embeddings / (norms + 1e-10)
212 | query_embedding = sem_model.encode(query, convert_to_numpy=True).reshape(1, -1)
213 | query_norm = np.linalg.norm(query_embedding, axis=1, keepdims=True)
214 | query_embedding = query_embedding / (query_norm + 1e-10)
215 | dim = doc_embeddings.shape[1]
216 | index = faiss.IndexFlatIP(dim)
217 | index.add(doc_embeddings)
218 | k = min(100, len(candidates))
219 | distances, indices = index.search(query_embedding, k)
220 | for idx, score in zip(indices[0], distances[0]):
221 | candidates[idx]["semantic_similarity"] = score
222 | ranked = sorted(candidates, key=lambda x: x.get("semantic_similarity", 0), reverse=True)
223 | logger.info(f"Semantic ranking complete: {len(ranked)} candidates.")
224 | return ranked
225 |
226 | @tool
227 | def cross_encoder_rerank_tool(query: str, candidates: List[Dict[str, Any]], top_n: int = 50) -> List[Dict[str, Any]]:
228 | """
229 | Re-ranks candidates using a CrossEncoder.
230 | """
231 | if not candidates:
232 | logger.info("No candidates for cross-encoder reranking. Returning empty list.")
233 | return []
234 | cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
235 | pairs = [[query, candidate["combined_doc"]] for candidate in candidates]
236 | scores = cross_encoder.predict(pairs, show_progress_bar=True)
237 | for candidate, score in zip(candidates, scores):
238 | candidate["cross_encoder_score"] = score
239 | reranked = sorted(candidates, key=lambda x: x["cross_encoder_score"], reverse=True)[:top_n]
240 | logger.info(f"Cross-encoder reranking complete: {len(reranked)} candidates.")
241 | return reranked
242 |
243 | @tool
244 | def filter_low_star_repos_tool(candidates: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
245 | """
246 | Filters out repositories with low star counts (unless they have high cross-encoder scores).
247 | """
248 | if not candidates:
249 | logger.info("No candidates for filtering. Returning empty list.")
250 | return []
251 | filtered = [repo for repo in candidates if repo["stars"] >= 50 or repo.get("cross_encoder_score", 0) >= 5.5]
252 | if not filtered:
253 | filtered = candidates
254 | logger.info(f"Filtered {len(filtered)} candidates after low-star filtering.")
255 | return filtered
256 |
257 | @tool
258 | def analyze_activity_tool(candidates: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
259 | """
260 | Analyzes repository activity based on PRs, commits, and issues.
261 | """
262 | if not candidates:
263 | logger.info("No candidates for activity analysis. Returning empty list.")
264 | return []
265 | for repo in candidates:
266 | full_name = repo.get("full_name", "")
267 | pr_url = f"https://api.github.com/repos/{full_name}/pulls"
268 | pr_response = requests.get(pr_url, headers=gh_headers, params={"state": "open", "per_page": 100})
269 | pr_count = len(pr_response.json()) if pr_response.status_code == 200 else 0
270 | commits_url = f"https://api.github.com/repos/{full_name}/commits"
271 | commits_response = requests.get(commits_url, headers=gh_headers, params={"per_page": 1})
272 | if commits_response.status_code == 200 and commits_response.json():
273 | commit_date_str = commits_response.json()[0]["commit"]["committer"]["date"]
274 | commit_date = datetime.datetime.fromisoformat(commit_date_str.rstrip("Z"))
275 | days_diff = (datetime.datetime.utcnow() - commit_date).days
276 | else:
277 | days_diff = 999
278 | open_issues = repo.get("open_issues_count", 0)
279 | non_pr_issues = max(0, open_issues - pr_count)
280 | activity_score = (3 * pr_count) + non_pr_issues - (days_diff / 30)
281 | repo.update({"pr_count": pr_count, "latest_commit_days": days_diff, "activity_score": activity_score})
282 | logger.info("Activity analysis complete.")
283 | return candidates
284 |
285 | @tool
286 | def final_scoring_tool(candidates: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
287 | """
288 | Combines semantic, cross-encoder, activity, and star scores for final ranking.
289 | """
290 | if not candidates:
291 | logger.info("No candidates for final scoring. Returning empty list.")
292 | return []
293 | semantic_scores = [repo.get("semantic_similarity", 0) for repo in candidates]
294 | cross_encoder_scores = [repo.get("cross_encoder_score", 0) for repo in candidates]
295 | activity_scores = [repo.get("activity_score", -100) for repo in candidates]
296 | star_scores = [math.log(repo.get("stars", 0) + 1) for repo in candidates]
297 | min_sem, max_sem = min(semantic_scores), max(semantic_scores)
298 | min_ce, max_ce = min(cross_encoder_scores), max(cross_encoder_scores)
299 | min_act, max_act = min(activity_scores), max(activity_scores)
300 | min_star, max_star = min(star_scores), max(star_scores)
301 | def normalize(val, min_val, max_val):
302 | if max_val - min_val == 0:
303 | return 0.5
304 | return (val - min_val) / (max_val - min_val)
305 | for repo in candidates:
306 | norm_sem = normalize(repo.get("semantic_similarity", 0), min_sem, max_sem)
307 | norm_ce = normalize(repo.get("cross_encoder_score", 0), min_ce, max_ce)
308 | norm_act = normalize(repo.get("activity_score", -100), min_act, max_act)
309 | norm_star = normalize(math.log(repo.get("stars", 0) + 1), min_star, max_star)
310 | repo["final_score"] = 0.3 * norm_ce + 0.2 * norm_sem + 0.2 * norm_act + 0.3 * norm_star
311 | ranked = sorted(candidates, key=lambda x: x["final_score"], reverse=True)
312 | logger.info(f"Final scoring complete: {len(ranked)} candidates.")
313 | return ranked
314 |
315 | @tool
316 | def justify_candidates_tool(candidates: List[Dict[str, Any]], top_n: int = 10) -> Dict[str, str]:
317 | """
318 | Generates a brief justification for each of the top candidates.
319 | """
320 | if not candidates:
321 | logger.info("No candidates for justification. Returning empty dictionary.")
322 | return {}
323 | justifications = {}
324 | for repo in candidates[:top_n]:
325 | prompt = f"""You are a highly knowledgeable AI research assistant. In one to two lines, explain why the repository titled "{repo['title']}" is a good match for a query on Chain of Thought prompting in large language models within a Python environment. Mention key factors such as documentation quality, activity, and community validation if relevant.
326 |
327 | Repository Details:
328 | - Stars: {repo['stars']}
329 | - Semantic Similarity: {repo.get('semantic_similarity', 0):.4f}
330 | - Cross-Encoder Score: {repo.get('cross_encoder_score', 0):.4f}
331 | - Activity Score: {repo.get('activity_score', 0):.2f}
332 |
333 | Provide a concise justification:"""
334 | messages = [
335 | ("system", "You are a highly knowledgeable AI research assistant that can succinctly justify repository matches."),
336 | ("human", prompt)
337 | ]
338 | result = llm_groq.invoke(messages)
339 | justifications[repo["title"]] = result
340 | logger.info(f"Justification for {repo['title']}: {result}")
341 | return justifications
342 |
343 | # ------------------------------------------------------------------
344 | # Workflow Definition using LangGraph
345 | # ------------------------------------------------------------------
346 | workflow = StateGraph(AgentState)
347 |
348 | # Use the helper function to ensure we get a plain string for the enhanced query.
349 | workflow.add_node("enhance_query", lambda state: {
350 | "enhanced_query": get_enhanced_query(state["original_query"])
351 | })
352 | workflow.add_node("set_github_query", lambda state: {
353 | "github_query": state["enhanced_query"] + " language:python"
354 | })
355 | workflow.add_node("fetch_repos", lambda state: {
356 | "candidates": fetch_github_repositories_tool.invoke({"query": state["github_query"]})
357 | })
358 | workflow.add_node("semantic_rank", lambda state: {
359 | "candidates": semantic_ranking_tool.invoke({
360 | "query": state["original_query"],
361 | "candidates": state["candidates"]
362 | })
363 | })
364 | workflow.add_node("cross_encoder_rerank", lambda state: {
365 | "candidates": cross_encoder_rerank_tool.invoke({
366 | "query": state["original_query"],
367 | "candidates": state["candidates"]
368 | })
369 | })
370 | workflow.add_node("filter_low_star", lambda state: {
371 | "candidates": filter_low_star_repos_tool.invoke({"candidates": state["candidates"]})
372 | })
373 | workflow.add_node("analyze_activity", lambda state: {
374 | "candidates": analyze_activity_tool.invoke({"candidates": state["candidates"]})
375 | })
376 | workflow.add_node("final_scoring", lambda state: {
377 | "final_ranked": final_scoring_tool.invoke({"candidates": state["candidates"]})
378 | })
379 | workflow.add_node("justify", lambda state: {
380 | "justifications": justify_candidates_tool.invoke({"candidates": state["final_ranked"]})
381 | })
382 |
383 | workflow.set_entry_point("enhance_query")
384 | workflow.add_edge("enhance_query", "set_github_query")
385 | workflow.add_edge("set_github_query", "fetch_repos")
386 | workflow.add_edge("fetch_repos", "semantic_rank")
387 | workflow.add_edge("semantic_rank", "cross_encoder_rerank")
388 | workflow.add_edge("cross_encoder_rerank", "filter_low_star")
389 | workflow.add_edge("filter_low_star", "analyze_activity")
390 | workflow.add_edge("analyze_activity", "final_scoring")
391 | workflow.add_edge("final_scoring", "justify")
392 | workflow.add_edge("justify", END)
393 |
394 | agent = workflow.compile()
395 |
396 | # ------------------------------------------------------------------
397 | # Execute the Agent Workflow
398 | # ------------------------------------------------------------------
399 | initial_state = {
400 | "original_query": "I am looking for finetuning gemini models."
401 | }
402 | result = agent.invoke(initial_state)
403 |
404 | # ------------------------------------------------------------------
405 | # Final Output
406 | # ------------------------------------------------------------------
407 | print("\n=== Final Ranked Repositories ===")
408 | for rank, repo in enumerate(result["final_ranked"][:10], 1):
409 | print(f"Final Rank: {rank}")
410 | print(f"Title: {repo['title']}")
411 | print(f"Link: {repo['link']}")
412 | print(f"Stars: {repo['stars']}")
413 | print(f"Semantic Similarity: {repo.get('semantic_similarity', 0):.4f}")
414 | print(f"Cross-Encoder Score: {repo.get('cross_encoder_score', 0):.4f}")
415 | print(f"Activity Score: {repo.get('activity_score', 0):.2f}")
416 | print(f"Final Score: {repo.get('final_score', 0):.4f}")
417 | print(f"Justification: {result['justifications'].get(repo['title'], 'No justification available')}")
418 | print(f"Combined Doc Snippet: {repo['combined_doc'][:200]}...")
419 | print('-' * 80)
420 | print("\n=== End of Results ===")
421 |
--------------------------------------------------------------------------------
/tools/filtering.py:
--------------------------------------------------------------------------------
1 | # tools/filtering.py
2 | import logging
3 |
4 | logger = logging.getLogger(__name__)
5 |
6 | def threshold_filtering(state, config):
7 | """
8 | 1) Filters out repos with too few stars AND too-low cross-encoder scores.
9 | 2) If the user specified hardware constraints (state.hardware_spec),
10 | narrows down to state.hardware_filtered (populated in dependency_analysis).
11 | """
12 | # Import config schema lazily to avoid circular dependency
13 | from agent import AgentConfiguration
14 | agent_config = AgentConfiguration.from_runnable_config(config)
15 |
16 | # 1) Basic star + cross-encoder cutoff
17 | filtered = []
18 | for repo in state.reranked_candidates:
19 | stars = repo.get("stars", 0)
20 | ce_score = repo.get("cross_encoder_score", 0.0)
21 | # drop only if BOTH the star count AND cross-encoder score are too low
22 | if stars < agent_config.min_stars and ce_score < agent_config.cross_encoder_threshold:
23 | continue
24 | filtered.append(repo)
25 |
26 | # if nothing passes, keep all reranked candidates
27 | if not filtered:
28 | filtered = list(state.reranked_candidates)
29 |
30 | # 2) Apply hardware filter if specified by user
31 | if getattr(state, "hardware_spec", None):
32 | hw_filtered = getattr(state, "hardware_filtered", None)
33 | if hw_filtered:
34 | filtered = hw_filtered
35 | else:
36 | logger.info(
37 | "Hardware spec provided but no hardware_filtered list found; "
38 | "skipping hardware filter."
39 | )
40 |
41 | state.filtered_candidates = filtered
42 | logger.info(
43 | f"Filtering complete: {len(filtered)} candidates remain "
44 | f"(after thresholds{' + hardware filter' if state.hardware_spec else ''})."
45 | )
46 | return {"filtered_candidates": filtered}
47 |
--------------------------------------------------------------------------------
/tools/github.py:
--------------------------------------------------------------------------------
1 | # tools/github.py
2 | import os
3 | import base64
4 | import logging
5 | import asyncio
6 | from pathlib import Path
7 | import httpx
8 | from tools.mcp_adapter import mcp_adapter # Import our MCP adapter
9 |
10 | logger = logging.getLogger(__name__)
11 |
12 | # In-memory cache to store file content for given URLs
13 | FILE_CONTENT_CACHE = {}
14 |
15 | async def fetch_readme_content(repo_full_name: str, headers: dict, client: httpx.AsyncClient) -> str:
16 | readme_url = f"https://api.github.com/repos/{repo_full_name}/readme"
17 | try:
18 | response = await mcp_adapter.fetch(readme_url, headers=headers, client=client)
19 | if response.status_code == 200:
20 | readme_data = response.json()
21 | content = readme_data.get('content', '')
22 | if content:
23 | return base64.b64decode(content).decode('utf-8')
24 | except Exception as e:
25 | logger.error(f"Error fetching README for {repo_full_name}: {e}")
26 | return ""
27 |
28 | async def fetch_file_content(download_url: str, client: httpx.AsyncClient) -> str:
29 | if download_url in FILE_CONTENT_CACHE:
30 | return FILE_CONTENT_CACHE[download_url]
31 | try:
32 | response = await mcp_adapter.fetch(download_url, client=client)
33 | if response.status_code == 200:
34 | text = response.text
35 | FILE_CONTENT_CACHE[download_url] = text
36 | return text
37 | except Exception as e:
38 | logger.error(f"Error fetching file from {download_url}: {e}")
39 | return ""
40 |
41 | async def fetch_directory_markdown(repo_full_name: str, path: str, headers: dict, client: httpx.AsyncClient) -> str:
42 | md_content = ""
43 | url = f"https://api.github.com/repos/{repo_full_name}/contents/{path}"
44 | try:
45 | response = await mcp_adapter.fetch(url, headers=headers, client=client)
46 | if response.status_code == 200:
47 | items = response.json()
48 | tasks = []
49 | for item in items:
50 | if item["type"] == "file" and item["name"].lower().endswith(".md"):
51 | tasks.append(fetch_file_content(item["download_url"], client))
52 | if tasks:
53 | results = await asyncio.gather(*tasks, return_exceptions=True)
54 | for item, content in zip(items, results):
55 | if item["type"] == "file" and item["name"].lower().endswith(".md") and not isinstance(content, Exception):
56 | md_content += f"\n\n# {item['name']}\n" + content
57 | except Exception as e:
58 | logger.error(f"Error fetching directory markdown for {repo_full_name}/{path}: {e}")
59 | return md_content
60 |
61 | async def fetch_repo_documentation(repo_full_name: str, headers: dict, client: httpx.AsyncClient) -> str:
62 | doc_text = ""
63 | readme_task = asyncio.create_task(fetch_readme_content(repo_full_name, headers, client))
64 | root_url = f"https://api.github.com/repos/{repo_full_name}/contents"
65 | try:
66 | response = await mcp_adapter.fetch(root_url, headers=headers, client=client)
67 | if response.status_code == 200:
68 | items = response.json()
69 | tasks = []
70 | for item in items:
71 | if item["type"] == "file" and item["name"].lower().endswith(".md"):
72 | if item["name"].lower() != "readme.md":
73 | tasks.append(asyncio.create_task(fetch_file_content(item["download_url"], client)))
74 | elif item["type"] == "dir" and item["name"].lower() in ["docs", "documentation"]:
75 | tasks.append(asyncio.create_task(fetch_directory_markdown(repo_full_name, item["name"], headers, client)))
76 | results = await asyncio.gather(*tasks, return_exceptions=True)
77 | for res in results:
78 | if not isinstance(res, Exception):
79 | doc_text += "\n\n" + res
80 | except Exception as e:
81 | logger.error(f"Error fetching repository contents for {repo_full_name}: {e}")
82 | readme = await readme_task
83 | if readme:
84 | doc_text = "# README\n" + readme + doc_text
85 | return doc_text if doc_text.strip() else "No documentation available."
86 |
87 | async def fetch_github_repositories(query: str, max_results: int, per_page: int, headers: dict) -> list:
88 | url = "https://api.github.com/search/repositories"
89 | repositories = []
90 | num_pages = max_results // per_page
91 | async with httpx.AsyncClient() as client:
92 | for page in range(1, num_pages + 1):
93 | params = {
94 | "q": query,
95 | "sort": "stars",
96 | "order": "desc",
97 | "per_page": per_page,
98 | "page": page
99 | }
100 | try:
101 | response = await mcp_adapter.fetch(url, headers=headers, params=params, client=client)
102 | if response.status_code != 200:
103 | logger.error(f"Error {response.status_code}: {response.json().get('message')}")
104 | break
105 | items = response.json().get('items', [])
106 | if not items:
107 | break
108 | tasks = []
109 | for repo in items:
110 | full_name = repo.get('full_name', '')
111 | tasks.append(asyncio.create_task(fetch_repo_documentation(full_name, headers, client)))
112 | docs = await asyncio.gather(*tasks, return_exceptions=True)
113 | for repo, doc in zip(items, docs):
114 | repo_link = repo['html_url']
115 | full_name = repo.get('full_name', '')
116 | clone_url = repo.get('clone_url', f"https://github.com/{full_name}.git")
117 | star_count = repo.get('stargazers_count', 0)
118 | repositories.append({
119 | "title": repo.get('name', 'No title available'),
120 | "link": repo_link,
121 | "clone_url": clone_url,
122 | "combined_doc": doc if not isinstance(doc, Exception) else "",
123 | "stars": star_count,
124 | "full_name": full_name,
125 | "open_issues_count": repo.get('open_issues_count', 0)
126 | })
127 | except Exception as e:
128 | logger.error(f"Error fetching repositories for query {query}: {e}")
129 | break
130 | logger.info(f"Fetched {len(repositories)} repositories for query '{query}'.")
131 | return repositories
132 |
133 | async def ingest_github_repos_async(state, config) -> dict:
134 | headers = {
135 | "Authorization": f"token {os.getenv('GITHUB_API_KEY')}",
136 | "Accept": "application/vnd.github.v3+json"
137 | }
138 | keyword_list = [kw.strip() for kw in state.searchable_query.split(":") if kw.strip()]
139 | logger.info(f"Searchable keywords (raw): {keyword_list}")
140 |
141 | target_language = "python"
142 | filtered_keywords = []
143 | for kw in keyword_list:
144 | if kw.startswith("target-"):
145 | target_language = kw.split("target-")[-1]
146 | else:
147 | filtered_keywords.append(kw)
148 | keyword_list = filtered_keywords
149 | logger.info(f"Filtered keywords: {keyword_list} | Target language: {target_language}")
150 |
151 | all_repos = []
152 | from agent import AgentConfiguration
153 | agent_config = AgentConfiguration.from_runnable_config(config)
154 | tasks = []
155 | for keyword in keyword_list:
156 | query = f"{keyword} language:{target_language}"
157 | tasks.append(asyncio.create_task(fetch_github_repositories(query, agent_config.max_results, agent_config.per_page, headers)))
158 | results = await asyncio.gather(*tasks, return_exceptions=True)
159 | for result in results:
160 | if not isinstance(result, Exception):
161 | all_repos.extend(result)
162 | else:
163 | logger.error(f"Error in fetching repositories for a keyword: {result}")
164 | seen = set()
165 | unique_repos = []
166 | for repo in all_repos:
167 | if repo["full_name"] not in seen:
168 | seen.add(repo["full_name"])
169 | unique_repos.append(repo)
170 | state.repositories = unique_repos
171 | logger.info(f"Total unique repositories fetched: {len(state.repositories)}")
172 | return {"repositories": state.repositories}
173 |
174 | def ingest_github_repos(state, config):
175 | return asyncio.run(ingest_github_repos_async(state, config))
176 |
--------------------------------------------------------------------------------
/tools/mcp_adapter.py:
--------------------------------------------------------------------------------
1 | import httpx
2 | import logging
3 |
4 | logger = logging.getLogger(__name__)
5 |
6 | class MCPAdapter:
7 | def __init__(self):
8 | self.adapter_name = "GitHub MCP Adapter"
9 | # Optionally, initialize shared client settings or cache here.
10 |
11 | async def fetch(self, url: str, headers: dict = None, params: dict = None, client: httpx.AsyncClient = None):
12 | """
13 | A standardized fetch method that wraps HTTP GET calls.
14 | If a client is provided, it uses it; otherwise, it creates a temporary client.
15 | """
16 | try:
17 | if client is None:
18 | async with httpx.AsyncClient() as temp_client:
19 | response = await temp_client.get(url, headers=headers, params=params)
20 | else:
21 | response = await client.get(url, headers=headers, params=params)
22 | logger.info(f"[{self.adapter_name}] Fetched URL: {url} with status {response.status_code}")
23 | return response
24 | except Exception as e:
25 | logger.error(f"[{self.adapter_name}] Error fetching {url}: {e}")
26 | raise e
27 |
28 | # Provide a singleton instance for use in other modules.
29 | mcp_adapter = MCPAdapter()
30 |
--------------------------------------------------------------------------------
/tools/merge_analysis.py:
--------------------------------------------------------------------------------
1 | # tools/merge_analysis.py
2 | import logging
3 |
4 | logger = logging.getLogger(__name__)
5 |
6 | def merge_analysis(state, config):
7 | merged = {}
8 | # Merge activity_candidates and quality_candidates by full_name.
9 | for repo in state.activity_candidates:
10 | merged[repo["full_name"]] = repo.copy()
11 | for repo in state.quality_candidates:
12 | if repo["full_name"] in merged:
13 | merged[repo["full_name"]].update(repo)
14 | else:
15 | merged[repo["full_name"]] = repo.copy()
16 | merged_list = list(merged.values())
17 | state.filtered_candidates = merged_list
18 | logger.info(f"Merged analysis results: {len(merged_list)} candidates.")
19 | return {"filtered_candidates": merged_list}
20 |
--------------------------------------------------------------------------------
/tools/output_presentation.py:
--------------------------------------------------------------------------------
1 | # tools/output_presentation.py
2 | """
3 | def output_presentation(state, config):
4 | results_str = "\n=== Final Ranked Repositories ===\n"
5 | top_n = 10
6 | for rank, repo in enumerate(state.final_ranked[:top_n], 1):
7 | results_str += f"\nFinal Rank: {rank}\n"
8 | results_str += f"Title: {repo['title']}\n"
9 | results_str += f"Link: {repo['link']}\n"
10 | results_str += f"Stars: {repo['stars']}\n"
11 | results_str += f"Semantic Similarity: {repo.get('semantic_similarity', 0):.4f}\n"
12 | results_str += f"Cross-Encoder Score: {repo.get('cross_encoder_score', 0):.4f}\n"
13 | results_str += f"Activity Score: {repo.get('activity_score', 0):.2f}\n"
14 | results_str += f"Code Quality Score: {repo.get('code_quality_score', 0)}\n"
15 | results_str += f"Final Score: {repo.get('final_score', 0):.4f}\n"
16 | results_str += f"Combined Doc Snippet: {repo['combined_doc'][:200]}...\n"
17 | results_str += '-' * 80 + "\n"
18 | return {"final_results": results_str}
19 | """
20 |
21 | def output_presentation(state, config):
22 | results_str = "\n=== Final Ranked Repositories ===\n"
23 | top_n = 10
24 | for rank, repo in enumerate(state.final_ranked[:top_n], 1):
25 | results_str += f"\nFinal Rank: {rank}\n"
26 | results_str += f"Title: {repo['title']}\n"
27 | results_str += f"Link: {repo['link']}\n"
28 | results_str += f"Stars: {repo['stars']}\n"
29 | results_str += f"Semantic Similarity: {repo.get('semantic_similarity', 0):.4f}\n"
30 | results_str += f"Cross-Encoder Score: {repo.get('cross_encoder_score', 0):.4f}\n"
31 | results_str += f"Activity Score: {repo.get('activity_score', 0):.2f}\n"
32 | results_str += f"Code Quality Score: {repo.get('code_quality_score', 0)}\n"
33 | results_str += f"Final Score: {repo.get('final_score', 0):.4f}\n"
34 | results_str += f"Combined Doc Snippet: {repo['combined_doc'][:200]}...\n"
35 | results_str += '-' * 80 + "\n"
36 | # Do not update state.final_ranked here.
37 | return {"final_results": results_str}
38 |
--------------------------------------------------------------------------------
/tools/parse_hardware.py:
--------------------------------------------------------------------------------
1 | # tools/parse_hardware_spec.py
2 | import re, logging
3 | from tools.chat import chain
4 |
5 | logger = logging.getLogger(__name__)
6 |
7 | VALID_SPECS = ("cpu-only", "low-memory", "mobile")
8 |
9 | HARDWARE_PATTERNS = {
10 | "cpu-only": [r"cpu[- ]only", r"no[- ]?gpu", r"gpu[- ]poor", r"lightweight"],
11 | "low-memory": [r"low[- ]?memory", r"small[- ]?memory"],
12 | "mobile": [r"mobile", r"raspberry", r"android"],
13 | }
14 |
15 | PROMPT_TEMPLATE = (
16 | "Extract any hardware constraints from the user query. "
17 | "Return exactly one of: cpu-only, low-memory, mobile, NONE."
18 | )
19 |
20 | def parse_hardware_spec(state, config):
21 | q = state.user_query.lower()
22 |
23 | # 1) Fast heuristic
24 | for spec, patterns in HARDWARE_PATTERNS.items():
25 | if any(re.search(pat, q) for pat in patterns):
26 | logger.info(f"[Hardware] regex -> {spec}")
27 | state.hardware_spec = spec
28 | return {"hardware_spec": spec}
29 |
30 | # 2) LLM fallback
31 | full = f"{PROMPT_TEMPLATE}\n\nUser query:\n{state.user_query}"
32 | resp = chain.invoke({"query": full}).content.strip().lower()
33 | spec = resp if resp in VALID_SPECS else None
34 | logger.info(f"[Hardware] LLM -> {spec}")
35 | state.hardware_spec = spec
36 | return {"hardware_spec": spec}
37 |
--------------------------------------------------------------------------------
/tools/rank.py:
--------------------------------------------------------------------------------
1 | import os
2 | import base64
3 | import requests
4 | import numpy as np
5 | import datetime
6 | import math
7 | import logging
8 | import getpass
9 | import faiss
10 | from pathlib import Path
11 | from dotenv import load_dotenv
12 | from sentence_transformers import SentenceTransformer, CrossEncoder
13 |
14 | from langchain_core.runnables import RunnableConfig
15 | from langgraph.graph import START, END, StateGraph
16 | from pydantic import BaseModel, Field
17 | from dataclasses import dataclass, field
18 | from typing import List, Any
19 |
20 | # ---------------------------
21 | # Logging and Environment Setup
22 | # ---------------------------
23 | logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
24 | logger = logging.getLogger(__name__)
25 |
26 | # Load .env from the parent directory
27 | dotenv_path = Path(__file__).resolve().parent.parent / ".env"
28 | if dotenv_path.exists():
29 | from dotenv import load_dotenv
30 | load_dotenv(dotenv_path)
31 |
32 | if "GITHUB_API_KEY" not in os.environ:
33 | os.environ["GITHUB_API_KEY"] = getpass.getpass("Enter your GitHub API key: ")
34 |
35 | # ---------------------------
36 | # State and Configuration
37 | # ---------------------------
38 | @dataclass(kw_only=True)
39 | class AgentState:
40 | github_query: str = field(default="Chain of Thought prompting language:python")
41 | user_query: str = field(default="I am researching the application of Chain of Thought prompting for improving reasoning in large language models within a Python environment.")
42 | repositories: List[Any] = field(default_factory=list)
43 | semantic_ranked: List[Any] = field(default_factory=list)
44 | reranked_candidates: List[Any] = field(default_factory=list)
45 | filtered_candidates: List[Any] = field(default_factory=list)
46 | final_ranked: List[Any] = field(default_factory=list)
47 |
48 | @dataclass(kw_only=True)
49 | class AgentStateInput:
50 | github_query: str = field(default="Chain of Thought prompting language:python")
51 | user_query: str = field(default="I am researching the application of Chain of Thought prompting for improving reasoning in large language models within a Python environment.")
52 |
53 | @dataclass(kw_only=True)
54 | class AgentStateOutput:
55 | final_ranked: List[Any] = field(default_factory=list)
56 |
57 | class AgentConfiguration(BaseModel):
58 | max_results: int = Field(default=1000, title="Max Results", description="Maximum results to fetch from GitHub")
59 | per_page: int = Field(default=100, title="Per Page", description="Results per page for GitHub API")
60 | dense_retrieval_k: int = Field(default=100, title="Dense Retrieval Top K", description="Top K candidates to retrieve from FAISS")
61 | cross_encoder_top_n: int = Field(default=50, title="Cross Encoder Top N", description="Top N candidates after re-ranking")
62 | min_stars: int = Field(default=50, title="Minimum Stars", description="Minimum star count threshold for filtering")
63 | cross_encoder_threshold: float = Field(default=5.5, title="Cross Encoder Threshold", description="Threshold for cross encoder score filtering")
64 |
65 | sem_model_name: str = Field(default="all-mpnet-base-v2", title="Sentence Transformer Model", description="Model for dense retrieval")
66 | cross_encoder_model_name: str = Field(default="cross-encoder/ms-marco-MiniLM-L-6-v2", title="Cross Encoder Model", description="Model for re-ranking")
67 |
68 | @classmethod
69 | def from_runnable_config(cls, config: Any = None) -> "AgentConfiguration":
70 | configurable = config["configurable"] if config and "configurable" in config else {}
71 | raw_values = {name: os.environ.get(name.upper(), configurable.get(name)) for name in cls.__fields__.keys()}
72 | values = {k: v for k, v in raw_values.items() if v is not None}
73 | return cls(**values)
74 |
75 | # ---------------------------
76 | # Node 1: Fetch Repositories
77 | # ---------------------------
78 | def fetch_repositories(state: AgentState, config: RunnableConfig):
79 | headers = {
80 | "Authorization": f"token {os.getenv('GITHUB_API_KEY')}",
81 | "Accept": "application/vnd.github.v3+json"
82 | }
83 | # Helper functions for GitHub API
84 | def fetch_readme_content(repo_full_name, headers):
85 | readme_url = f"https://api.github.com/repos/{repo_full_name}/readme"
86 | response = requests.get(readme_url, headers=headers)
87 | if response.status_code == 200:
88 | readme_data = response.json()
89 | return base64.b64decode(readme_data['content']).decode('utf-8')
90 | return ""
91 |
92 | def fetch_file_content(download_url):
93 | try:
94 | response = requests.get(download_url)
95 | if response.status_code == 200:
96 | return response.text
97 | except Exception as e:
98 | logger.error(f"Error fetching file: {e}")
99 | return ""
100 |
101 | def fetch_directory_markdown(repo_full_name, path, headers):
102 | md_content = ""
103 | url = f"https://api.github.com/repos/{repo_full_name}/contents/{path}"
104 | response = requests.get(url, headers=headers)
105 | if response.status_code == 200:
106 | items = response.json()
107 | for item in items:
108 | if item["type"] == "file" and item["name"].lower().endswith(".md"):
109 | content = fetch_file_content(item["download_url"])
110 | md_content += f"\n\n# {item['name']}\n" + content
111 | return md_content
112 |
113 | def fetch_repo_documentation(repo_full_name, headers):
114 | doc_text = ""
115 | readme = fetch_readme_content(repo_full_name, headers)
116 | if readme:
117 | doc_text += "# README\n" + readme
118 | root_url = f"https://api.github.com/repos/{repo_full_name}/contents"
119 | response = requests.get(root_url, headers=headers)
120 | if response.status_code == 200:
121 | items = response.json()
122 | for item in items:
123 | if item["type"] == "file" and item["name"].lower().endswith(".md"):
124 | if item["name"].lower() != "readme.md":
125 | content = fetch_file_content(item["download_url"])
126 | doc_text += f"\n\n# {item['name']}\n" + content
127 | elif item["type"] == "dir" and item["name"].lower() in ["docs", "documentation"]:
128 | doc_text += f"\n\n# {item['name']} folder\n" + fetch_directory_markdown(repo_full_name, item["name"], headers)
129 | return doc_text if doc_text.strip() else "No documentation available."
130 |
131 | def fetch_github_repositories(query, max_results, per_page):
132 | url = "https://api.github.com/search/repositories"
133 | repositories = []
134 | num_pages = max_results // per_page
135 | for page in range(1, num_pages + 1):
136 | params = {
137 | "q": query,
138 | "sort": "stars",
139 | "order": "desc",
140 | "per_page": per_page,
141 | "page": page
142 | }
143 | response = requests.get(url, headers=headers, params=params)
144 | if response.status_code != 200:
145 | logger.error(f"Error {response.status_code}: {response.json().get('message')}")
146 | break
147 | items = response.json().get('items', [])
148 | if not items:
149 | break
150 | for repo in items:
151 | repo_link = repo['html_url']
152 | full_name = repo.get('full_name', '')
153 | doc_content = fetch_repo_documentation(full_name, headers)
154 | star_count = repo.get('stargazers_count', 0)
155 | repositories.append({
156 | "title": repo.get('name', 'No title available'),
157 | "link": repo_link,
158 | "combined_doc": doc_content,
159 | "stars": star_count,
160 | "full_name": full_name,
161 | "open_issues_count": repo.get('open_issues_count', 0)
162 | })
163 | logger.info(f"Fetched {len(repositories)} repositories from GitHub.")
164 | return repositories
165 |
166 | agent_config = AgentConfiguration.from_runnable_config(config)
167 | state.repositories = fetch_github_repositories(state.github_query, int(agent_config.max_results), int(agent_config.per_page))
168 | return {"repositories": state.repositories}
169 |
170 | # ---------------------------
171 | # Node 2: Dense Retrieval with FAISS
172 | # ---------------------------
173 | def dense_retrieval(state: AgentState, config: RunnableConfig):
174 | agent_config = AgentConfiguration.from_runnable_config(config)
175 | sem_model = SentenceTransformer(agent_config.sem_model_name)
176 |
177 | docs = [repo.get("combined_doc", "") for repo in state.repositories]
178 | if not docs:
179 | logger.warning("No documents found. Skipping dense retrieval.")
180 | state.semantic_ranked = []
181 | return {"semantic_ranked": state.semantic_ranked}
182 |
183 | logger.info(f"Encoding {len(docs)} documents for dense retrieval...")
184 | doc_embeddings = sem_model.encode(docs, convert_to_numpy=True, show_progress_bar=True, batch_size=16)
185 |
186 | # Handle 1D shape if there's exactly 1 doc
187 | if doc_embeddings.ndim == 1:
188 | doc_embeddings = doc_embeddings.reshape(1, -1)
189 |
190 | def normalize_embeddings(embeddings):
191 | norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
192 | return embeddings / (norms + 1e-10)
193 |
194 | doc_embeddings = normalize_embeddings(doc_embeddings)
195 |
196 | # Encode user query
197 | query_embedding = sem_model.encode(state.user_query, convert_to_numpy=True)
198 | if query_embedding.ndim == 1:
199 | query_embedding = query_embedding.reshape(1, -1)
200 | query_embedding = normalize_embeddings(query_embedding)[0]
201 |
202 | dim = doc_embeddings.shape[1]
203 | index = faiss.IndexFlatIP(dim)
204 | index.add(doc_embeddings)
205 | k = min(int(agent_config.dense_retrieval_k), doc_embeddings.shape[0])
206 | D, I = index.search(np.expand_dims(query_embedding, axis=0), k)
207 |
208 | for idx, score in zip(I[0], D[0]):
209 | state.repositories[idx]["semantic_similarity"] = score
210 |
211 | state.semantic_ranked = sorted(
212 | state.repositories, key=lambda x: x.get("semantic_similarity", 0), reverse=True
213 | )
214 | logger.info(f"Dense retrieval complete: {len(state.semantic_ranked)} candidates ranked by semantic similarity.")
215 | return {"semantic_ranked": state.semantic_ranked}
216 |
217 | # ---------------------------
218 | # Node 3: Re-Ranking with Cross-Encoder
219 | # ---------------------------
220 | def cross_encoder_rerank(state: AgentState, config: RunnableConfig):
221 | agent_config = AgentConfiguration.from_runnable_config(config)
222 | cross_encoder = CrossEncoder(agent_config.cross_encoder_model_name)
223 | candidates_for_rerank = state.semantic_ranked[:100]
224 | logger.info(f"Re-ranking {len(candidates_for_rerank)} candidates with cross-encoder...")
225 | def cross_encoder_rerank_func(query, candidates, top_n):
226 | pairs = [[query, candidate["combined_doc"]] for candidate in candidates]
227 | scores = cross_encoder.predict(pairs, show_progress_bar=True)
228 | for candidate, score in zip(candidates, scores):
229 | candidate["cross_encoder_score"] = score
230 | return sorted(candidates, key=lambda x: x["cross_encoder_score"], reverse=True)[:top_n]
231 | state.reranked_candidates = cross_encoder_rerank_func(state.user_query, candidates_for_rerank, int(agent_config.cross_encoder_top_n))
232 | logger.info(f"Re-ranking complete: {len(state.reranked_candidates)} candidates remain after cross-encoder re-ranking.")
233 | return {"reranked_candidates": state.reranked_candidates}
234 |
235 | # ---------------------------
236 | # Node 4: Filtering Low-Star Repositories
237 | # ---------------------------
238 | def filter_candidates(state: AgentState, config: RunnableConfig):
239 | agent_config = AgentConfiguration.from_runnable_config(config)
240 | filtered = []
241 | for repo in state.reranked_candidates:
242 | if repo["stars"] < agent_config.min_stars and repo.get("cross_encoder_score", 0) < agent_config.cross_encoder_threshold:
243 | continue
244 | filtered.append(repo)
245 | if not filtered:
246 | filtered = state.reranked_candidates
247 | state.filtered_candidates = filtered
248 | logger.info(f"Filtering complete: {len(state.filtered_candidates)} candidates remain after filtering low-star repositories.")
249 | return {"filtered_candidates": state.filtered_candidates}
250 |
251 | # ---------------------------
252 | # Node 5: Activity Analysis
253 | # ---------------------------
254 | def analyze_activity(state: AgentState, config: RunnableConfig):
255 | headers = {
256 | "Authorization": f"token {os.getenv('GITHUB_API_KEY')}",
257 | "Accept": "application/vnd.github.v3+json"
258 | }
259 | def analyze_repository_activity(repo, headers):
260 | full_name = repo.get("full_name")
261 | pr_url = f"https://api.github.com/repos/{full_name}/pulls"
262 | pr_params = {"state": "open", "per_page": 100}
263 | pr_response = requests.get(pr_url, headers=headers, params=pr_params)
264 | pr_count = len(pr_response.json()) if pr_response.status_code == 200 else 0
265 | commits_url = f"https://api.github.com/repos/{full_name}/commits"
266 | commits_params = {"per_page": 1}
267 | commits_response = requests.get(commits_url, headers=headers, params=commits_params)
268 | if commits_response.status_code == 200:
269 | commit_data = commits_response.json()
270 | if commit_data:
271 | commit_date_str = commit_data[0]["commit"]["committer"]["date"]
272 | commit_date = datetime.datetime.fromisoformat(commit_date_str.rstrip("Z"))
273 | days_diff = (datetime.datetime.utcnow() - commit_date).days
274 | else:
275 | days_diff = 999
276 | else:
277 | days_diff = 999
278 | open_issues = repo.get("open_issues_count", 0)
279 | non_pr_issues = max(0, open_issues - pr_count)
280 | activity_score = (3 * pr_count) + non_pr_issues - (days_diff / 30)
281 | return {"pr_count": pr_count, "latest_commit_days": days_diff, "activity_score": activity_score}
282 |
283 | for repo in state.filtered_candidates:
284 | activity_data = analyze_repository_activity(repo, headers)
285 | repo.update(activity_data)
286 | logger.info("Activity analysis complete for filtered candidates.")
287 | return {"filtered_candidates": state.filtered_candidates}
288 |
289 | # ---------------------------
290 | # Node 6: Final Ranking
291 | # ---------------------------
292 | def final_ranking(state: AgentState, config: RunnableConfig):
293 | semantic_scores = [repo.get("semantic_similarity", 0) for repo in state.filtered_candidates]
294 | cross_encoder_scores = [repo.get("cross_encoder_score", 0) for repo in state.filtered_candidates]
295 | activity_scores = [repo.get("activity_score", -100) for repo in state.filtered_candidates]
296 | star_scores = [math.log(repo.get("stars", 0) + 1) for repo in state.filtered_candidates]
297 | min_sem, max_sem = min(semantic_scores), max(semantic_scores)
298 | min_ce, max_ce = min(cross_encoder_scores), max(cross_encoder_scores)
299 | min_act, max_act = min(activity_scores), max(activity_scores)
300 | min_star, max_star = min(star_scores), max(star_scores)
301 | def normalize(val, min_val, max_val):
302 | if max_val - min_val == 0:
303 | return 0.5
304 | return (val - min_val) / (max_val - min_val)
305 | for repo in state.filtered_candidates:
306 | norm_sem = normalize(repo.get("semantic_similarity", 0), min_sem, max_sem)
307 | norm_ce = normalize(repo.get("cross_encoder_score", 0), min_ce, max_ce)
308 | norm_act = normalize(repo.get("activity_score", -100), min_act, max_act)
309 | norm_star = normalize(math.log(repo.get("stars", 0) + 1), min_star, max_star)
310 | repo["final_score"] = 0.3 * norm_ce + 0.2 * norm_sem + 0.2 * norm_act + 0.3 * norm_star
311 | state.final_ranked = sorted(state.filtered_candidates, key=lambda x: x["final_score"], reverse=True)
312 | logger.info(f"Final ranking computed for {len(state.final_ranked)} candidates.")
313 | return {"final_ranked": state.final_ranked}
314 |
315 | # ---------------------------
316 | # Node 7: Display Results
317 | # ---------------------------
318 | def display_results(state: AgentState, config: RunnableConfig):
319 | results_str = "\n=== Final Ranked Repositories ===\n"
320 | top_n = 10
321 | for rank, repo in enumerate(state.final_ranked[:top_n], 1):
322 | results_str += f"\nFinal Rank: {rank}\n"
323 | results_str += f"Title: {repo['title']}\n"
324 | results_str += f"Link: {repo['link']}\n"
325 | results_str += f"Stars: {repo['stars']}\n"
326 | results_str += f"Semantic Similarity: {repo.get('semantic_similarity', 0):.4f}\n"
327 | results_str += f"Cross-Encoder Score: {repo.get('cross_encoder_score', 0):.4f}\n"
328 | results_str += f"Activity Score: {repo.get('activity_score', 0):.2f}\n"
329 | results_str += f"Final Score: {repo.get('final_score', 0):.4f}\n"
330 | results_str += f"Combined Doc Snippet: {repo['combined_doc'][:200]}...\n"
331 | results_str += '-' * 80 + "\n"
332 | return {"final_results": results_str}
333 |
334 | # ---------------------------
335 | # Build and Compile the Graph
336 | # ---------------------------
337 | builder = StateGraph(AgentState, input=AgentStateInput, output=AgentStateOutput, config_schema=AgentConfiguration)
338 | builder.add_node("fetch_repositories", fetch_repositories)
339 | builder.add_node("dense_retrieval", dense_retrieval)
340 | builder.add_node("cross_encoder_rerank", cross_encoder_rerank)
341 | builder.add_node("filter_candidates", filter_candidates)
342 | builder.add_node("analyze_activity", analyze_activity)
343 | builder.add_node("final_ranking", final_ranking)
344 | builder.add_node("display_results", display_results)
345 |
346 | builder.add_edge(START, "fetch_repositories")
347 | builder.add_edge("fetch_repositories", "dense_retrieval")
348 | builder.add_edge("dense_retrieval", "cross_encoder_rerank")
349 | builder.add_edge("cross_encoder_rerank", "filter_candidates")
350 | builder.add_edge("filter_candidates", "analyze_activity")
351 | builder.add_edge("analyze_activity", "final_ranking")
352 | builder.add_edge("final_ranking", "display_results")
353 | builder.add_edge("display_results", END)
354 |
355 | graph = builder.compile()
356 |
357 | if __name__ == "__main__":
358 | # For local testing outside of LangGraph Studio.
359 | initial_state = AgentStateInput(
360 | github_query="Chain of Thought prompting language:python",
361 | user_query="I am researching the application of Chain of Thought prompting for improving reasoning in large language models within a Python environment."
362 | )
363 | result = graph.run(initial_state)
364 | # You can print the final ranked repositories or the formatted results.
365 | print(result.final_ranked)
366 |
--------------------------------------------------------------------------------
/tools/ranking.py:
--------------------------------------------------------------------------------
1 | # tools/ranking.py
2 | import math
3 | import logging
4 |
5 | logger = logging.getLogger(__name__)
6 |
7 |
8 |
9 | # tools/normalize.py
10 | def normalize_scores(values):
11 | """
12 | Perform min–max normalization on a list of numeric values.
13 | Returns a list of values scaled to [0, 1]. If the range is zero,
14 | returns 0.5 for each value.
15 | """
16 | min_val = min(values)
17 | max_val = max(values)
18 | if max_val - min_val == 0:
19 | return [0.5 for _ in values]
20 | return [(val - min_val) / (max_val - min_val) for val in values]
21 |
22 |
23 | def multi_factor_ranking(state, config):
24 | # Gather raw scores from filtered candidates.
25 | semantic_scores = [repo.get("semantic_similarity", 0) for repo in state.filtered_candidates]
26 | cross_encoder_scores = [repo.get("cross_encoder_score", 0) for repo in state.filtered_candidates]
27 | activity_scores = [repo.get("activity_score", -100) for repo in state.filtered_candidates]
28 | quality_scores = [repo.get("code_quality_score", 0) for repo in state.filtered_candidates]
29 | star_scores = [math.log(repo.get("stars", 0) + 1) for repo in state.filtered_candidates]
30 |
31 | # Normalize each set of scores using the helper function.
32 | norm_sem_scores = normalize_scores(semantic_scores)
33 | norm_ce_scores = normalize_scores(cross_encoder_scores)
34 | norm_act_scores = normalize_scores(activity_scores)
35 | norm_quality_scores = normalize_scores(quality_scores)
36 | norm_star_scores = normalize_scores(star_scores)
37 |
38 | # Define weights for each signal.
39 | weights = {
40 | "cross_encoder": 0.30,
41 | "semantic": 0.20,
42 | "activity": 0.15,
43 | "quality": 0.15,
44 | "stars": 0.20
45 | }
46 |
47 | # Combine the normalized scores using the defined weights.
48 | for idx, repo in enumerate(state.filtered_candidates):
49 | repo["final_score"] = (
50 | weights["cross_encoder"] * norm_ce_scores[idx] +
51 | weights["semantic"] * norm_sem_scores[idx] +
52 | weights["activity"] * norm_act_scores[idx] +
53 | weights["quality"] * norm_quality_scores[idx] +
54 | weights["stars"] * norm_star_scores[idx]
55 | )
56 |
57 | # Sort repositories in descending order of final_score.
58 | state.final_ranked = sorted(state.filtered_candidates, key=lambda x: x["final_score"], reverse=True)
59 | logger.info(f"Final multi-factor ranking computed for {len(state.final_ranked)} candidates.")
60 | return {"final_ranked": state.final_ranked}
61 |
62 |
--------------------------------------------------------------------------------
/tools/search.py:
--------------------------------------------------------------------------------
1 | import os
2 | import base64
3 | import requests
4 | import numpy as np
5 | import datetime
6 | import torch
7 | from sentence_transformers import SentenceTransformer, CrossEncoder
8 | import faiss
9 | import getpass
10 | import math
11 | import logging
12 | from dotenv import load_dotenv
13 | from pathlib import Path
14 |
15 | # Resolve the path to the root directory
16 | dotenv_path = Path(__file__).resolve().parent.parent / ".env"
17 |
18 | # Load the .env file
19 | load_dotenv(dotenv_path)
20 | # ---------------------------
21 | # Logging Setup
22 | # ---------------------------
23 | logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
24 | logger = logging.getLogger(__name__)
25 |
26 | # ---------------------------
27 | # Environment Setup
28 | # ---------------------------
29 | if "GITHUB_API_KEY" not in os.environ:
30 | os.environ["GITHUB_API_KEY"] = getpass.getpass("Enter your GitHub API key: ")
31 |
32 | # ---------------------------
33 | # GitHub API Helper Functions
34 | # ---------------------------
35 | def fetch_readme_content(repo_full_name, headers):
36 | readme_url = f"https://api.github.com/repos/{repo_full_name}/readme"
37 | response = requests.get(readme_url, headers=headers)
38 | if response.status_code == 200:
39 | readme_data = response.json()
40 | return base64.b64decode(readme_data['content']).decode('utf-8')
41 | else:
42 | return ""
43 |
44 | def fetch_file_content(download_url):
45 | try:
46 | response = requests.get(download_url)
47 | if response.status_code == 200:
48 | return response.text
49 | except Exception as e:
50 | logger.error(f"Error fetching file: {e}")
51 | return ""
52 |
53 | def fetch_directory_markdown(repo_full_name, path, headers):
54 | md_content = ""
55 | url = f"https://api.github.com/repos/{repo_full_name}/contents/{path}"
56 | response = requests.get(url, headers=headers)
57 | if response.status_code == 200:
58 | items = response.json()
59 | for item in items:
60 | if item["type"] == "file" and item["name"].lower().endswith(".md"):
61 | content = fetch_file_content(item["download_url"])
62 | md_content += f"\n\n# {item['name']}\n" + content
63 | return md_content
64 |
65 | def fetch_repo_documentation(repo_full_name, headers):
66 | doc_text = ""
67 | # 1. Fetch README.
68 | readme = fetch_readme_content(repo_full_name, headers)
69 | if readme:
70 | doc_text += "# README\n" + readme
71 | # 2. List root directory files.
72 | root_url = f"https://api.github.com/repos/{repo_full_name}/contents"
73 | response = requests.get(root_url, headers=headers)
74 | if response.status_code == 200:
75 | items = response.json()
76 | for item in items:
77 | if item["type"] == "file" and item["name"].lower().endswith(".md"):
78 | if item["name"].lower() != "readme.md":
79 | content = fetch_file_content(item["download_url"])
80 | doc_text += f"\n\n# {item['name']}\n" + content
81 | elif item["type"] == "dir" and item["name"].lower() in ["docs", "documentation"]:
82 | doc_text += f"\n\n# {item['name']} folder\n" + fetch_directory_markdown(repo_full_name, item["name"], headers)
83 | return doc_text if doc_text.strip() else "No documentation available."
84 |
85 | def fetch_github_repositories(query, max_results=1000, per_page=100):
86 | url = "https://api.github.com/search/repositories"
87 | headers = {
88 | "Authorization": f"token {os.getenv('GITHUB_API_KEY')}",
89 | "Accept": "application/vnd.github.v3+json"
90 | }
91 | repositories = []
92 | num_pages = max_results // per_page
93 | for page in range(1, num_pages + 1):
94 | params = {
95 | "q": query, # e.g., "Chain of Thought prompting language:python"
96 | "sort": "stars",
97 | "order": "desc",
98 | "per_page": per_page,
99 | "page": page
100 | }
101 | response = requests.get(url, headers=headers, params=params)
102 | if response.status_code != 200:
103 | logger.error(f"Error {response.status_code}: {response.json().get('message')}")
104 | break
105 | items = response.json().get('items', [])
106 | if not items:
107 | break
108 | for repo in items:
109 | repo_link = repo['html_url']
110 | full_name = repo.get('full_name', '')
111 | doc_content = fetch_repo_documentation(full_name, headers)
112 | star_count = repo.get('stargazers_count', 0)
113 | repositories.append({
114 | "title": repo.get('name', 'No title available'),
115 | "link": repo_link,
116 | "combined_doc": doc_content,
117 | "stars": star_count,
118 | "full_name": full_name,
119 | "open_issues_count": repo.get('open_issues_count', 0)
120 | })
121 | logger.info(f"Fetched {len(repositories)} repositories from GitHub.")
122 | return repositories
123 |
124 | # ---------------------------
125 | # Stage 1: Dense Retrieval with FAISS
126 | # ---------------------------
127 | sem_model = SentenceTransformer("all-mpnet-base-v2")
128 | user_query_text = """
129 | I am researching the application of Chain of Thought prompting for improving reasoning in large language models within a Python environment.
130 | """
131 | github_query = "Chain of Thought prompting language:python"
132 | repos = fetch_github_repositories(github_query)
133 | docs = [repo.get("combined_doc", "") for repo in repos]
134 | logger.info(f"Encoding {len(docs)} documents for dense retrieval...")
135 | doc_embeddings = sem_model.encode(docs, convert_to_numpy=True, show_progress_bar=True, batch_size=16)
136 |
137 | def normalize_embeddings(embeddings):
138 | norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
139 | return embeddings / (norms + 1e-10)
140 |
141 | doc_embeddings = normalize_embeddings(doc_embeddings)
142 | query_embedding = sem_model.encode(user_query_text, convert_to_numpy=True)
143 | query_embedding = normalize_embeddings(np.expand_dims(query_embedding, axis=0))[0]
144 | dim = doc_embeddings.shape[1]
145 | index = faiss.IndexFlatIP(dim)
146 | index.add(doc_embeddings)
147 | k = min(100, doc_embeddings.shape[0])
148 | D, I = index.search(np.expand_dims(query_embedding, axis=0), k)
149 | for idx, score in zip(I[0], D[0]):
150 | repos[idx]["semantic_similarity"] = score
151 | ranked_by_semantic = sorted(repos, key=lambda x: x.get("semantic_similarity", 0), reverse=True)
152 | logger.info(f"Stage 1 complete: {len(ranked_by_semantic)} candidates ranked by semantic similarity.")
153 |
154 | # ---------------------------
155 | # Stage 2: Re-Ranking with Cross-Encoder
156 | # ---------------------------
157 | cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
158 | def cross_encoder_rerank(query, candidates, top_n=50):
159 | pairs = [[query, candidate["combined_doc"]] for candidate in candidates]
160 | scores = cross_encoder.predict(pairs, show_progress_bar=True)
161 | for candidate, score in zip(candidates, scores):
162 | candidate["cross_encoder_score"] = score
163 | return sorted(candidates, key=lambda x: x["cross_encoder_score"], reverse=True)[:top_n]
164 |
165 | candidates_for_rerank = ranked_by_semantic[:100]
166 | logger.info(f"Stage 2: Re-ranking {len(candidates_for_rerank)} candidates with cross-encoder...")
167 | reranked_candidates = cross_encoder_rerank(user_query_text, candidates_for_rerank, top_n=50)
168 | logger.info(f"Stage 2 complete: {len(reranked_candidates)} candidates remain after cross-encoder re-ranking.")
169 |
170 | # ---------------------------
171 | # Stage 2.5: Filtering Low-Star Repositories
172 | # ---------------------------
173 | filtered_candidates = []
174 | for repo in reranked_candidates:
175 | if repo["stars"] < 50 and repo.get("cross_encoder_score", 0) < 5.5:
176 | continue
177 | filtered_candidates.append(repo)
178 | if not filtered_candidates:
179 | filtered_candidates = reranked_candidates # fallback if filtering is too strict
180 | logger.info(f"Stage 2.5 complete: {len(filtered_candidates)} candidates remain after filtering low-star repositories.")
181 |
182 | # ---------------------------
183 | # Stage 3: Activity Analysis
184 | # ---------------------------
185 | def analyze_repository_activity(repo, headers):
186 | full_name = repo.get("full_name")
187 | pr_url = f"https://api.github.com/repos/{full_name}/pulls"
188 | pr_params = {"state": "open", "per_page": 100}
189 | pr_response = requests.get(pr_url, headers=headers, params=pr_params)
190 | pr_count = len(pr_response.json()) if pr_response.status_code == 200 else 0
191 | commits_url = f"https://api.github.com/repos/{full_name}/commits"
192 | commits_params = {"per_page": 1}
193 | commits_response = requests.get(commits_url, headers=headers, params=commits_params)
194 | if commits_response.status_code == 200:
195 | commit_data = commits_response.json()
196 | if commit_data:
197 | commit_date_str = commit_data[0]["commit"]["committer"]["date"]
198 | commit_date = datetime.datetime.fromisoformat(commit_date_str.rstrip("Z"))
199 | days_diff = (datetime.datetime.utcnow() - commit_date).days
200 | else:
201 | days_diff = 999
202 | else:
203 | days_diff = 999
204 | open_issues = repo.get("open_issues_count", 0)
205 | non_pr_issues = max(0, open_issues - pr_count)
206 | activity_score = (3 * pr_count) + non_pr_issues - (days_diff / 30)
207 | return {"pr_count": pr_count, "latest_commit_days": days_diff, "activity_score": activity_score}
208 |
209 | gh_headers = {
210 | "Authorization": f"token {os.getenv('GITHUB_API_KEY')}",
211 | "Accept": "application/vnd.github.v3+json"
212 | }
213 | for repo in filtered_candidates:
214 | activity_data = analyze_repository_activity(repo, gh_headers)
215 | repo.update(activity_data)
216 | logger.info("Stage 3 complete: Activity analysis done for filtered candidates.")
217 |
218 | # ---------------------------
219 | # Stage 4: Combine Scores for Final Ranking (Including Stars)
220 | # ---------------------------
221 | semantic_scores = [repo.get("semantic_similarity", 0) for repo in filtered_candidates]
222 | cross_encoder_scores = [repo.get("cross_encoder_score", 0) for repo in filtered_candidates]
223 | activity_scores = [repo.get("activity_score", -100) for repo in filtered_candidates]
224 | star_scores = [math.log(repo.get("stars", 0) + 1) for repo in filtered_candidates] # log transform
225 |
226 | min_sem, max_sem = min(semantic_scores), max(semantic_scores)
227 | min_ce, max_ce = min(cross_encoder_scores), max(cross_encoder_scores)
228 | min_act, max_act = min(activity_scores), max(activity_scores)
229 | min_star, max_star = min(star_scores), max(star_scores)
230 |
231 | def normalize(val, min_val, max_val):
232 | if max_val - min_val == 0:
233 | return 0.5
234 | return (val - min_val) / (max_val - min_val)
235 |
236 | for repo in filtered_candidates:
237 | norm_sem = normalize(repo.get("semantic_similarity", 0), min_sem, max_sem)
238 | norm_ce = normalize(repo.get("cross_encoder_score", 0), min_ce, max_ce)
239 | norm_act = normalize(repo.get("activity_score", -100), min_act, max_act)
240 | norm_star = normalize(math.log(repo.get("stars", 0) + 1), min_star, max_star)
241 | # Weights: 30% cross-encoder, 20% semantic, 20% activity, 30% stars.
242 | repo["final_score"] = 0.3 * norm_ce + 0.2 * norm_sem + 0.2 * norm_act + 0.3 * norm_star
243 |
244 | final_ranked = sorted(filtered_candidates, key=lambda x: x["final_score"], reverse=True)
245 | logger.info(f"Stage 4 complete: Final ranking computed for {len(final_ranked)} candidates.")
246 |
247 | # ---------------------------
248 | # Final Output
249 | # ---------------------------
250 | print("\n=== Final Ranked Repositories ===")
251 | for rank, repo in enumerate(final_ranked[:10], 1):
252 | print(f"Final Rank: {rank}")
253 | print(f"Title: {repo['title']}")
254 | print(f"Link: {repo['link']}")
255 | print(f"Stars: {repo['stars']}")
256 | print(f"Semantic Similarity: {repo.get('semantic_similarity', 0):.4f}")
257 | print(f"Cross-Encoder Score: {repo.get('cross_encoder_score', 0):.4f}")
258 | print(f"Activity Score: {repo.get('activity_score', 0):.2f}")
259 | print(f"Final Score: {repo.get('final_score', 0):.4f}")
260 | print(f"Combined Doc Snippet: {repo['combined_doc'][:200]}...")
261 | print('-' * 80)
262 | print("\n=== End of Results ===")
--------------------------------------------------------------------------------
/tools/test.py:
--------------------------------------------------------------------------------
1 | import os
2 | import requests
3 | import subprocess
4 | import tempfile
5 | import shutil
6 | import stat
7 | from pathlib import Path
8 | from dotenv import load_dotenv
9 | from langchain_groq import ChatGroq
10 | from langchain_core.prompts import ChatPromptTemplate
11 | from git import Repo
12 |
13 | # Load environment variable
14 | dotenv_path = Path(__file__).resolve().parent.parent / ".env"
15 | if dotenv_path.exists():
16 | load_dotenv(dotenv_path)
17 |
18 | # ---------------------------
19 | # Step 1: Instantiate Groq model
20 | # ---------------------------
21 | llm = ChatGroq(
22 | model="llama-3.1-8b-instant",
23 | temperature=0.3,
24 | max_tokens=128,
25 | max_retries=2,
26 | )
27 |
28 | # ---------------------------
29 | # Step 2: Build the prompt
30 | # ---------------------------
31 | prompt = ChatPromptTemplate.from_messages([
32 | ("system",
33 | """You are a GitHub search optimization expert.
34 |
35 | Your job is to:
36 | 1. Read a user's query about tools, research, or tasks.
37 | 2. Return **exactly two GitHub-style search tags or library names** that maximize repository discovery.
38 | 3. Tags must represent:
39 | - The core task/technique (e.g., image-augmentation, instruction-tuning)
40 | - A specific tool, model name, or approach (e.g., albumentations, label-studio, llama2)
41 |
42 | Output Format:
43 | tag-one:tag-two
44 |
45 | Rules:
46 | - Use lowercase and hyphenated keywords (e.g., image-augmentation, chain-of-thought)
47 | - Use terms commonly found in GitHub repo names, topics, or descriptions
48 | - Avoid generic terms like "python", "ai", "tool", "project"
49 | - Do NOT use full phrases or vague words like "no-code", "framework", "approach"
50 | - Prefer *real tools*, *popular methods*, or *dataset names* if mentioned
51 | - Choose high-signal keywords. Be precise.
52 |
53 | Excellent Examples:
54 |
55 | Input: "No code tool to augment image and annotation"
56 | Output: image-augmentation:albumentations
57 |
58 | Input: "Open-source tool for labeling datasets with UI"
59 | Output: label-studio:streamlit
60 |
61 | Input: "Visual reasoning models trained on multi-modal datasets"
62 | Output: multimodal-reasoning:vlm
63 |
64 | Input: "I want repos related to instruction-based finetuning for LLaMA 2"
65 | Output: instruction-tuning:llama2
66 |
67 | Input: "Repos around chain of thought prompting mainly for finetuned models"
68 | Output: chain-of-thought:finetuned-llm
69 |
70 | Input: "I want to fine-tune Gemini 1.5 Flash model"
71 | Output: gemini-finetuning:instruction-tuning
72 |
73 | Input: "Need repos for document parsing with vision-language models"
74 | Output: document-understanding:vlm
75 |
76 | Input: "How to train custom object detection models using YOLO"
77 | Output: object-detection:yolov5
78 |
79 | Input: "Segment anything-like models for interactive segmentation"
80 | Output: interactive-segmentation:segment-anything
81 |
82 | Input: "Synthetic data generation for vision model training"
83 | Output: synthetic-data:image-augmentation
84 |
85 | Input: "OCR pipeline for scanned documents"
86 | Output: ocr:document-processing
87 |
88 | Input: "LLMs with self-reflection or reasoning chains"
89 | Output: self-reflection:chain-of-thought
90 |
91 | Input: "Chatbot development using open-source LLMs"
92 | Output: chatbot:llm
93 |
94 | Output must be ONLY two search terms separated by a colon. No extra text. No bullet points.
95 | """),
96 | ("human", "{query}")
97 | ])
98 |
99 | # ---------------------------
100 | # Step 3: Chain model and prompt
101 | # ---------------------------
102 | chain = prompt | llm
103 |
104 | # ---------------------------
105 | # Step 4: Define a function to convert queries
106 | # ---------------------------
107 | def convert_to_search_tags(query: str) -> str:
108 | print(f"\n🧠 [convert_to_search_tags] Input Query: {query}")
109 | response = chain.invoke({"query": query})
110 | print(f"🔁 [convert_to_search_tags] Output Tags: {response.content.strip()}")
111 | return response.content.strip()
112 |
113 | # ---------------------------
114 | # Safe File Delete for Windows
115 | # ---------------------------
116 | def remove_readonly(func, path, exc_info):
117 | os.chmod(path, stat.S_IWRITE)
118 | func(path)
119 |
120 | # ---------------------------
121 | # Code Quality Checker (Robust Function)
122 | # ---------------------------
123 | def analyze_code_quality(repo_info):
124 | """
125 | Clone the repository and analyze Python files with flake8.
126 | Returns the repo_info dictionary augmented with:
127 | - code_quality_score: The computed quality score.
128 | - code_quality_issues: Total flake8 issues found.
129 | - python_files: Number of Python files analyzed.
130 | Returns None if the repo encounters errors or has no Python files.
131 | """
132 | full_name = repo_info.get('full_name', 'unknown')
133 | clone_url = repo_info.get('clone_url')
134 | temp_dir = tempfile.mkdtemp()
135 | repo_path = os.path.join(temp_dir, full_name.split("/")[-1])
136 |
137 | try:
138 | # Attempt shallow clone to save time and space
139 | Repo.clone_from(clone_url, repo_path, depth=1, no_single_branch=True)
140 |
141 | # Find all Python files
142 | py_files = list(Path(repo_path).rglob("*.py"))
143 | total_files = len(py_files)
144 | if total_files == 0:
145 | print(f"⚠️ No Python files found in {full_name}. Skipping repo.")
146 | return None
147 |
148 | # Run flake8 to collect issues
149 | process = subprocess.run(
150 | ["flake8", "--max-line-length=120", repo_path],
151 | stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True
152 | )
153 | output = process.stdout.strip()
154 | error_count = len(output.splitlines()) if output else 0
155 | issues_per_file = error_count / total_files
156 |
157 | # Robust scoring logic based on issues per file
158 | if issues_per_file <= 2:
159 | score = 95 + (2 - issues_per_file) * 2.5 # Range: 95–100
160 | elif issues_per_file <= 5:
161 | score = 70 + (5 - issues_per_file) * 6.5 # Range: 70–89
162 | elif issues_per_file <= 10:
163 | score = 40 + (10 - issues_per_file) * 3 # Range: 40–69
164 | else:
165 | score = max(10, 40 - (issues_per_file - 10) * 2)
166 |
167 | repo_info["code_quality_score"] = round(score)
168 | repo_info["code_quality_issues"] = error_count
169 | repo_info["python_files"] = total_files
170 | return repo_info
171 |
172 | except Exception as e:
173 | print(f"❌ Error analyzing {full_name}: {e}. Skipping repo.")
174 | return None
175 | finally:
176 | try:
177 | shutil.rmtree(temp_dir, onerror=remove_readonly)
178 | except Exception as cleanup_e:
179 | print(f"⚠️ Cleanup error for {full_name}: {cleanup_e}")
180 |
181 | # ---------------------------
182 | # Example usage (if run directly)
183 | # ---------------------------
184 | if __name__ == "__main__":
185 | # Test the search tag conversion
186 | user_query = "I am looking for repos around finetuning gemini models mainly 1.5 flash 002"
187 | github_query = convert_to_search_tags(user_query)
188 | print("🔍 GitHub Search Query:")
189 | print(github_query)
190 |
--------------------------------------------------------------------------------