├── LICENSE
├── README.md
├── app.py
├── requirements.txt
└── templates
└── index.html
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2025 KARRI VAMSI KRISHNA
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
50 |
51 |
52 |
53 |
54 |
55 |
56 |
57 |
58 |
59 |
60 |
61 |
62 |
63 | 📋 Table of Contents
64 |
65 |
66 |
67 |
68 | | | Section | Description |
69 | |----|----------------------------------------|--------------------------------------|
70 | | 📺 | [Demo Video](#-demo-video) | See KV in action |
71 | | 🌟 | [Key Features](#-key-features) | What makes KV special |
72 | | 🖼️ | [Screenshots](#-screenshots) | Visual previews |
73 | | 🚀 | [Why Choose KV](#-why-choose-kv) | Benefits & advantages |
74 | | ⚙️ | [Installation](#%EF%B8%8F-installation) | Get up and running |
75 | | 🎮 | [Usage Guide](#-usage-guide) | How to use KV effectively |
76 | | 🤝 | [Contribution](#-contribution) | Join our community |
77 | | 📄 | [License](#-license) | MIT License |
78 | | 📞 | [Contact & Support](#-contact--support) | Get in touch |
79 |
80 |
81 |
82 |
83 |
84 |
85 |
86 |
87 |
88 |
89 | KV is a revolutionary open-source platform that combines the power of Google Gemini AI with advanced web scraping for unparalleled research and job search capabilities. KV goes beyond conventional tools to provide deep insights, analyze resumes, identify skill gaps, and help you find your ideal career path.
90 |
91 |
92 |
93 |
94 |
95 |
96 |
97 | ## 📺 Demo Video
98 |
99 |
108 |
109 |
110 |
111 |
112 |
113 |
114 | ## 🌟 Key Features
115 |
116 |
117 |
118 | | 🔥 Feature | 💫 Description |
119 | |:--:|:--|
120 | |
| **Advanced AI Integration**
Leverages Google Gemini's powerful AI capabilities for deep analysis |
121 | |
| **Intelligent Web Scraping**
Gathers comprehensive data from across the internet |
122 | |
| **Multi-Search Engine Support**
Access Google, Bing, DuckDuckGo, and LinkedIn simultaneously |
123 | |
| **Iterative Research**
Self-refining search strategies for more precise results |
124 | |
| **Resume Analysis**
AI-powered evaluation of your resume with improvement suggestions |
125 | |
| **Job Match Algorithm**
Finds ideal job opportunities based on your profile |
126 | |
| **Skill Gap Detection**
Identifies missing skills needed for your target positions |
127 | |
| **Professional Reports**
Generates beautiful PDF reports with data visualizations |
128 | |
| **UI Customization**
Dark & light mode with responsive design for all devices |
129 |
130 |
131 |
132 |
133 |
134 |
135 |
136 |
137 | ## 🖼️ Screenshots
138 |
139 |
140 |
141 |
142 |
📱 Modern Chat Interface
143 |
144 |
145 |
146 |
147 |
148 |
149 |
🔍 Deep Research in Action
150 |
151 |
152 |
153 |
154 |
155 |
156 |
⚙️ Customization Options
157 |
158 |
159 |
160 |
161 |
162 |
163 |
164 |
165 |
166 | ## 🚀 Why Choose KV
167 |
168 |
197 |
198 |
199 |
200 |
201 |
202 |
203 | ## ⚙️ Installation
204 |
205 |
206 |
207 |
⚡ Quick Setup
208 |
209 |
210 | ```bash
211 | # Clone repository
212 | git clone https://github.com/kvcops/Deep-Research-using-Gemini-api.git
213 |
214 | # Change directory
215 | cd Deep-Research-using-Gemini-api
216 |
217 | # Create virtual environment
218 | python -m venv venv
219 |
220 | # Activate virtual environment
221 | # On Windows:
222 | venv\Scripts\activate
223 | # On macOS/Linux:
224 | source venv/bin/activate
225 |
226 | # Install dependencies
227 | pip install -r requirements.txt
228 |
229 | # Create .env file with your API key
230 | echo "GEMINI_API_KEY=YOUR_ACTUAL_GEMINI_API_KEY" > .env
231 |
232 | # Launch KV
233 | uvicorn app:app --host 127.0.0.1 --port 8000 --reload
234 | ```
235 |
236 |
242 |
243 |
244 |
245 |
246 |
247 |
248 | ## 🎮 Usage Guide
249 |
250 |
251 |
🚀 Get Started in 3 Easy Steps
252 |
253 |
254 |
255 |
256 |
257 |
258 |
259 |
260 | Start a new chat session or research query
261 | Navigate to the homepage and select your desired research mode
262 |
263 |
264 |
265 |
266 |
267 |
268 |
269 | Upload your resume for job search or enter research topic
270 | Provide relevant documents and specify your preferences
271 |
272 |
273 |
274 |
275 |
276 |
277 |
278 | Review results, download reports, and implement recommendations
279 | KV provides actionable insights and suggestions for improvement
280 |
281 |
282 |
283 |
284 |
285 |
286 |
287 |
288 |
289 | ## 🤝 Contribution
290 |
291 |
292 |
293 |
294 |
295 |
296 |
We welcome all contributions! Here's how to get involved:
297 |
298 |
299 |
323 |
324 |
332 |
333 |
334 |
335 |
336 |
337 |
338 | ## 📄 License
339 |
340 |
341 |
342 | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
343 |
344 |
345 |
346 |
347 |
348 | ## 📞 Contact & Support
349 |
350 |
358 |
359 |
360 |
361 |
362 |
363 |
364 |
✨ Thank you for using KV! ✨
365 |
If you found it helpful, please consider giving it a star! ⭐
366 |
367 |
368 |
369 |
370 |
371 |
372 |
373 |
374 |
375 |
376 |
377 |
378 |
386 |
--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
1 | # app.py (Complete, Modified)
2 |
3 | from fastapi import FastAPI, Request, HTTPException, UploadFile, File, Form
4 | from fastapi.responses import HTMLResponse, JSONResponse, StreamingResponse
5 | from fastapi.templating import Jinja2Templates
6 | from fastapi.middleware.cors import CORSMiddleware
7 | from typing import List, Dict, Tuple, Optional, Union, Any
8 | import google.generativeai as genai
9 | import os
10 | from dotenv import load_dotenv
11 | import base64
12 | from PIL import Image
13 | import io
14 | import requests
15 | from bs4 import BeautifulSoup, SoupStrainer
16 | import re
17 | import random
18 | import logging
19 | from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type, before_sleep_log
20 | import time
21 | from urllib.parse import urlparse, urljoin, quote_plus, unquote
22 | import json
23 | from io import BytesIO
24 | from reportlab.lib.pagesizes import letter, A4
25 | from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Image as RLImage, PageBreak, Table, TableStyle, KeepTogether
26 | from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
27 | from reportlab.lib.units import inch
28 | from reportlab.lib.enums import TA_LEFT, TA_CENTER, TA_RIGHT, TA_JUSTIFY
29 | from reportlab.lib.colors import HexColor, black, lightgrey
30 | import csv
31 | from datetime import date, datetime
32 | import math
33 | import concurrent.futures
34 | import brotli
35 | from pdfminer.high_level import extract_text as pdf_extract_text
36 | import docx2txt
37 | import chardet
38 | import asyncio
39 |
40 |
41 | load_dotenv()
42 |
43 | app = FastAPI()
44 |
45 | # --- CORS Configuration ---
46 | app.add_middleware(
47 | CORSMiddleware,
48 | allow_origins=["*"], # Allows all origins
49 | allow_credentials=True,
50 | allow_methods=["*"], # Allows all methods
51 | allow_headers=["*"], # Allows all headers
52 | )
53 |
54 |
55 | templates = Jinja2Templates(directory="templates")
56 |
57 | class Config:
58 | API_KEY = os.getenv("GEMINI_API_KEY")
59 | LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO").upper()
60 | SNIPPET_LENGTH = 5000
61 | DEEP_RESEARCH_SNIPPET_LENGTH = 10000
62 | MAX_TOKENS_PER_CHUNK = 25000
63 | REQUEST_TIMEOUT = 60
64 | USER_AGENTS = [
65 | 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
66 | 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15',
67 | 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0',
68 | 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0',
69 | 'Mozilla/5.0 (iPad; CPU OS 14_4 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1',
70 | 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
71 | 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36',
72 | 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
73 | 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
74 | 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'
75 | ]
76 | SEARCH_ENGINES = ["google", "duckduckgo", "bing", "yahoo", "brave", "linkedin"]
77 | JOB_SEARCH_ENGINES = ["linkedin", "indeed", "glassdoor"]
78 | MAX_WORKERS = 10
79 | CACHE_ENABLED = True
80 | CACHE = {}
81 | CACHE_TIMEOUT = 300
82 | INDEED_BASE_DELAY = 2
83 | INDEED_MAX_DELAY = 10
84 | INDEED_RETRIES = 5
85 | JOB_RELEVANCE_MODEL = os.getenv("JOB_RELEVANCE_MODEL", "gemini-2.0-flash")
86 | # REMOVED PDF PAGE LIMIT
87 |
88 | # --- Prompts: More Modular and Specific ---
89 | DEEP_RESEARCH_TABLE_PROMPT = (
90 | "Create a detailed comparison table analyzing: '{query}'.\n\n"
91 | "**Strict Table Formatting:**\n"
92 | "* **Markdown table ONLY.**\n"
93 | "* **Structure:** Header row, separator row (---), data rows.\n"
94 | "* **Rows:** Start and end with a pipe (|), spaces around pipes.\n"
95 | "* **Separator:** Three dashes (---) per column, alignment colons (:---:).\n"
96 | "* **Cells:** Concise (max 2-3 lines), consistent capitalization, 'N/A' for empty.\n"
97 | "* **NO line breaks within cells.** Use for internal line breaks if absolutely necessary.\n"
98 | "**Content Guidelines:**\n"
99 | "* 3-5 relevant columns.\n"
100 | "* 4-8 data rows.\n"
101 | "* Proper alignment (usually center or left).\n"
102 | "* Verify all pipe and spacing rules.\n"
103 | "* **Output ONLY the table, NO extra text.**"
104 | )
105 | DEEP_RESEARCH_REFINEMENT_PROMPT = (
106 | "Analyze the following research summaries to identify key themes and entities. "
107 | "Suggest 3-5 new, more specific search queries that are *directly related* to the original topic: '{original_query}'. "
108 | "Identify any gaps in the current research and suggest queries to address those gaps. "
109 | "Do not suggest overly broad or generic queries. Focus on refining the search and addressing specific aspects. "
110 | "Prioritize queries that are likely to yield *different* results than the previous searches."
111 | )
112 | DEEP_RESEARCH_SUMMARY_PROMPT = (
113 | "Analyze snippets for: '{query}'. Extract key facts, figures, and insights. "
114 | "Be concise, ignore irrelevant content, and prioritize authoritative sources. "
115 | "Focus on the main topic and avoid discussing the research process itself.\n\nContent Snippets:"
116 | )
117 |
118 | DEEP_RESEARCH_REPORT_PROMPT = (
119 | "DEEP RESEARCH REPORT: Synthesize a comprehensive report from web research on: '{search_query}'.\n\n"
120 | "{report_structure}\n\n"
121 | "Research Summaries (all iterations):\n{summaries}\n\n"
122 | "Generate the report in Markdown."
123 | )
124 |
125 | config = Config()
126 |
127 | # --- Logging Configuration ---
128 | logging.basicConfig(level=config.LOG_LEVEL, format='%(asctime)s - %(levelname)s - %(message)s')
129 |
130 | # --- Gemini Configuration ---
131 | genai.configure(api_key=config.API_KEY)
132 | if not config.API_KEY:
133 | logging.error("GEMINI_API_KEY not set. Exiting.")
134 | exit(1)
135 |
136 | conversation_history = []
137 | deep_research_rate_limits = {
138 | "gemini-2.0-flash": {"requests_per_minute": 15, "last_request": 0},
139 | "gemini-2.0-flash-thinking-exp-01-21": {"requests_per_minute": 10, "last_request": 0}
140 | }
141 | DEFAULT_DEEP_RESEARCH_MODEL = "gemini-2.0-flash"
142 |
143 | def rate_limit_model(model_name):
144 | if model_name in deep_research_rate_limits:
145 | rate_limit_data = deep_research_rate_limits[model_name]
146 | now = time.time()
147 | time_since_last_request = now - rate_limit_data["last_request"]
148 | requests_per_minute = rate_limit_data["requests_per_minute"]
149 | wait_time = max(0, 60 / requests_per_minute - time_since_last_request)
150 | if wait_time > 0:
151 | logging.info(f"Rate limiting {model_name}, waiting for {wait_time:.2f} seconds")
152 | time.sleep(wait_time)
153 | rate_limit_data["last_request"] = time.time()
154 |
155 | _user_agents = config.USER_AGENTS
156 |
157 | def get_random_user_agent():
158 | return random.choice(_user_agents)
159 |
160 | def process_base64_image(base64_string):
161 | try:
162 | if 'base64,' in base64_string:
163 | base64_string = base64_string.split('base64,')[1]
164 | image_data = base64.b64decode(base64_string)
165 | image_stream = io.BytesIO(image_data)
166 | image = Image.open(image_stream)
167 | if image.mode != 'RGB':
168 | image = image.convert('RGB')
169 | img_byte_arr = io.BytesIO()
170 | image.save(img_byte_arr, format='JPEG')
171 | return {'mime_type': 'image/jpeg', 'data': img_byte_arr.getvalue()}
172 | except Exception as e:
173 | logging.error(f"Error processing image: {e}")
174 | return None
175 |
176 | def get_shortened_url(url):
177 | try:
178 | parsed_url = urlparse(url)
179 | if not parsed_url.scheme:
180 | url = "http://" + url # Add scheme if missing
181 | tinyurl_api = f"https://tinyurl.com/api-create.php?url={quote_plus(url)}"
182 | response = requests.get(tinyurl_api, timeout=5)
183 | response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx)
184 | return response.text
185 | except requests.exceptions.RequestException as e:
186 | logging.error(f"Error shortening URL '{url}': {e}")
187 | return url # Return original URL on error
188 | except Exception as e:
189 | logging.error(f"Unexpected error shortening URL '{url}': {e}")
190 | return url
191 |
192 | def fix_url(url):
193 | try:
194 | parsed = urlparse(url)
195 | if not parsed.scheme:
196 | url = "https://" + url
197 | parsed = urlparse(url) # Re-parse with scheme
198 | if not parsed.netloc:
199 | return None # Invalid URL
200 | return url.split("?")[0] # Removes parameters
201 | except Exception:
202 | return None
203 |
204 | def scrape_search_engine(search_query: str, engine_name: str) -> List[str]:
205 | """Scrapes search results from specified search engine."""
206 | if engine_name == "google":
207 | return scrape_google(search_query)
208 | elif engine_name == "duckduckgo":
209 | return scrape_duckduckgo(search_query)
210 | elif engine_name == "bing":
211 | return scrape_bing(search_query)
212 | elif engine_name == "yahoo":
213 | return scrape_yahoo(search_query)
214 | elif engine_name == "brave":
215 | return scrape_brave(search_query)
216 | elif engine_name == "linkedin":
217 | return scrape_linkedin(search_query)
218 | else:
219 | logging.warning(f"Unknown search engine: {engine_name}")
220 | return []
221 |
222 | def scrape_google(search_query: str) -> List[str]:
223 | """Scrapes Google search results."""
224 | search_results = []
225 | google_url = f"https://www.google.com/search?q={quote_plus(search_query)}&num=20"
226 | try:
227 | headers = {'User-Agent': get_random_user_agent()}
228 | response = requests.get(google_url, headers=headers, timeout=config.REQUEST_TIMEOUT)
229 | response.raise_for_status()
230 | logging.info(f"Google Status Code: {response.status_code} for query: {search_query}")
231 | if response.status_code == 200:
232 | only_results = SoupStrainer('div', class_='tF2Cxc')
233 | google_soup = BeautifulSoup(response.text, 'html.parser', parse_only=only_results)
234 | for result in google_soup.find_all('div', class_='tF2Cxc'):
235 | link = result.find('a', href=True)
236 | if link:
237 | href = link['href']
238 | fixed_url = fix_url(href)
239 | if fixed_url:
240 | search_results.append(fixed_url)
241 | elif response.status_code == 429:
242 | logging.warning("Google rate limit hit (429).")
243 | else:
244 | logging.warning(f"Google search failed with status code: {response.status_code}")
245 | except requests.exceptions.RequestException as e:
246 | logging.error(f"Error scraping Google: {e}")
247 | return list(set(search_results)) # Remove duplicates
248 |
249 | def scrape_duckduckgo(search_query: str) -> List[str]:
250 | """Scrapes DuckDuckGo search results."""
251 | search_results = []
252 | duck_url = f"https://html.duckduckgo.com/html/?q={quote_plus(search_query)}"
253 | try:
254 | response = requests.get(duck_url, headers={'User-Agent': get_random_user_agent()}, timeout=config.REQUEST_TIMEOUT)
255 | response.raise_for_status()
256 | logging.info(f"DuckDuckGo Status Code: {response.status_code}")
257 | if response.status_code == 200:
258 | only_results = SoupStrainer('a', class_='result__a')
259 | duck_soup = BeautifulSoup(response.text, 'html.parser', parse_only=only_results)
260 | for a_tag in duck_soup.find_all('a', class_='result__a', href=True):
261 | href = a_tag['href']
262 | fixed_url = fix_url(urljoin("https://html.duckduckgo.com/", href)) # Make absolute
263 | if fixed_url: search_results.append(fixed_url)
264 | except Exception as e:
265 | logging.error(f"Error scraping DuckDuckGo: {e}")
266 | return list(set(search_results))
267 |
268 | def scrape_bing(search_query: str) -> List[str]:
269 | """Scrapes Bing search results."""
270 | search_results = []
271 | bing_url = f"https://www.bing.com/search?q={quote_plus(search_query)}"
272 | try:
273 | response = requests.get(bing_url, headers={'User-Agent': get_random_user_agent()}, timeout=config.REQUEST_TIMEOUT)
274 | response.raise_for_status()
275 | logging.info(f"Bing Status Code: {response.status_code}")
276 | if response.status_code == 200:
277 | only_results = SoupStrainer('li', class_='b_algo')
278 | bing_soup = BeautifulSoup(response.text, 'html.parser', parse_only=only_results)
279 | for li in bing_soup.find_all('li', class_='b_algo'):
280 | for a_tag in li.find_all('a', href=True):
281 | href = a_tag['href']
282 | fixed_url = fix_url(href) # Fix the URL
283 | if fixed_url: search_results.append(fixed_url)
284 | except Exception as e:
285 | logging.error(f"Error scraping Bing: {e}")
286 | return list(set(search_results))
287 |
288 | def scrape_yahoo(search_query: str) -> List[str]:
289 | """Scrapes Yahoo search results."""
290 | search_results = []
291 | yahoo_url = f"https://search.yahoo.com/search?p={quote_plus(search_query)}"
292 | try:
293 | response = requests.get(yahoo_url, headers={'User-Agent': get_random_user_agent()}, timeout=config.REQUEST_TIMEOUT)
294 | response.raise_for_status()
295 | logging.info(f"Yahoo Status Code: {response.status_code}")
296 | if response.status_code == 200:
297 | only_dd_divs = SoupStrainer('div', class_=lambda x: x and x.startswith('dd'))
298 | yahoo_soup = BeautifulSoup(response.text, 'html.parser', parse_only=only_dd_divs)
299 | for div in yahoo_soup.find_all('div', class_=lambda x: x and x.startswith('dd')):
300 | for a_tag in div.find_all('a', href=True):
301 | href = a_tag['href']
302 | match = re.search(r'/RU=(.*?)/RK=', href) # Yahoo uses a redirect
303 | if match:
304 | try:
305 | decoded_url = unquote(match.group(1))
306 | fixed_url = fix_url(decoded_url) # Fix the URL
307 | if fixed_url: search_results.append(fixed_url)
308 | except:
309 | logging.warning(f"Error decoding Yahoo URL: {href}")
310 | elif href: # Sometimes the direct URL is present
311 | fixed_url = fix_url(href)
312 | if fixed_url: search_results.append(fixed_url)
313 | except Exception as e:
314 | logging.error(f"Error scraping Yahoo: {e}")
315 | return list(set(search_results))
316 |
317 | def scrape_brave(search_query: str) -> List[str]:
318 | """Scrapes Brave search results."""
319 | search_results = []
320 | brave_url = f"https://search.brave.com/search?q={quote_plus(search_query)}"
321 | try:
322 | response = requests.get(brave_url, headers={'User-Agent': get_random_user_agent()}, timeout=config.REQUEST_TIMEOUT)
323 | response.raise_for_status()
324 | logging.info(f"Brave Status Code: {response.status_code}")
325 |
326 | if response.status_code == 200:
327 | if response.headers.get('Content-Encoding') == 'br':
328 | try:
329 | content = brotli.decompress(response.content) # Decompress Brotli
330 | only_links = SoupStrainer('a', class_='result-title')
331 | brave_soup = BeautifulSoup(content, 'html.parser', parse_only=only_links)
332 |
333 | except brotli.error as e:
334 | logging.error(f"Error decoding Brotli content: {e}")
335 | return []
336 | else:
337 | only_links = SoupStrainer('a', class_='result-title')
338 | brave_soup = BeautifulSoup(response.text, 'html.parser', parse_only=only_links)
339 |
340 | for a_tag in brave_soup.find_all('a', class_='result-title', href=True):
341 | href = a_tag['href']
342 | fixed_url = fix_url(href) # Fix URL
343 | if fixed_url:
344 | search_results.append(fixed_url)
345 |
346 | elif response.status_code == 429:
347 | logging.warning("Brave rate limit hit (429).")
348 | else:
349 | logging.warning(f"Brave search failed with status code: {response.status_code}")
350 |
351 | except Exception as e:
352 | logging.error(f"Error scraping Brave: {e}")
353 | return list(set(search_results))
354 |
355 | def scrape_linkedin(search_query: str) -> List[str]:
356 | """Scrapes LinkedIn search results (people primarily)."""
357 | search_results = []
358 | linkedin_url = f"https://www.linkedin.com/search/results/all/?keywords={quote_plus(search_query)}"
359 | try:
360 | response = requests.get(linkedin_url, headers={'User-Agent': get_random_user_agent()}, timeout=config.REQUEST_TIMEOUT)
361 | response.raise_for_status()
362 | logging.info(f"LinkedIn Status Code: {response.status_code}")
363 | if response.status_code == 200:
364 | only_results = SoupStrainer('div', class_='entity-result__item')
365 | linkedin_soup = BeautifulSoup(response.text, 'html.parser', parse_only=only_results)
366 | for result in linkedin_soup.find_all('div', class_='entity-result__item'):
367 | try:
368 | link_tag = result.find('a', class_='app-aware-link')
369 | if not link_tag or not link_tag.get('href'):
370 | continue
371 | profile_url = fix_url(link_tag.get('href')) # Fix the URL
372 | if not profile_url or '/in/' not in profile_url: # Check for profile
373 | continue
374 | # check for company context
375 | if " at " in search_query.lower():
376 | context = search_query.lower().split(" at ")[1] # company name
377 | name = result.find('span', class_='entity-result__title-text')
378 | title_company = result.find('div', class_='entity-result__primary-subtitle')
379 | combined_text = ""
380 | if name:
381 | combined_text += name.get_text(strip=True).lower() + " "
382 | if title_company:
383 | combined_text += title_company.get_text(strip=True).lower()
384 | if context not in combined_text: # Check if the company matches
385 | continue
386 |
387 | search_results.append(profile_url)
388 | except Exception as e:
389 | logging.warning(f"Error processing LinkedIn result: {e}")
390 | continue
391 | except Exception as e:
392 | logging.error(f"Error scraping LinkedIn: {e}")
393 | return search_results # No need to remove duplicates for LinkedIn
394 |
395 |
396 | def _decode_content(response: requests.Response) -> str:
397 | """Decodes response content, handling different encodings."""
398 | detected_encoding = chardet.detect(response.content)['encoding']
399 | if detected_encoding is None:
400 | logging.warning(f"Chardet failed. Using UTF-8.")
401 | detected_encoding = 'utf-8'
402 | logging.debug(f"Detected encoding: {detected_encoding}")
403 | try:
404 | return response.content.decode(detected_encoding, errors='replace')
405 | except UnicodeDecodeError:
406 | logging.warning(f"Decoding failed with {detected_encoding}. Trying UTF-8.")
407 | try: return response.content.decode('utf-8', errors='replace')
408 | except:
409 | logging.warning("Decoding failed. Using latin-1 (may cause data loss).")
410 | return response.content.decode('latin-1', errors='replace')
411 |
412 | def fetch_page_content(url: str, snippet_length: Optional[int] = None,
413 | extract_links: bool = False, extract_emails: bool = False) -> Tuple[List[str], List[str], Dict[str, Any]]:
414 | """Fetches content, handles caching, extracts data."""
415 | if snippet_length is None:
416 | snippet_length = config.SNIPPET_LENGTH
417 | content_snippets = []
418 | references = []
419 | extracted_data = {}
420 |
421 | if config.CACHE_ENABLED:
422 | if url in config.CACHE:
423 | if time.time() - config.CACHE[url]['timestamp'] < config.CACHE_TIMEOUT:
424 | logging.info(f"Using cached content for: {url}")
425 | return config.CACHE[url]['content_snippets'], config.CACHE[url]['references'], config.CACHE[url]['extracted_data']
426 | else:
427 | logging.info(f"Cache expired for: {url}")
428 | del config.CACHE[url] # Remove expired entry
429 |
430 | try:
431 | response = requests.get(url, headers={'User-Agent': get_random_user_agent()}, timeout=config.REQUEST_TIMEOUT)
432 | response.raise_for_status() # Raise HTTPError for bad responses
433 | logging.debug(f"Fetching page content status: {response.status_code} for: {url}")
434 | if response.status_code == 200:
435 | page_text = _decode_content(response)
436 | if page_text:
437 | page_soup = BeautifulSoup(page_text, 'html.parser')
438 | for script in page_soup(["script", "style"]):
439 | script.decompose() # Remove script and style tags
440 |
441 | text = page_soup.get_text(separator=' ', strip=True)
442 | text = re.sub(r'[\ud800-\udbff](?![\udc00-\udfff])|(? List[str]:
474 | """Generates alternative search queries using Gemini."""
475 | prompt = f"Suggest 3 refined search queries for '{original_query}', optimizing for broad and effective web results."
476 | parts = [{"role": "user", "parts": [{"text": prompt}]}]
477 | safety_settings = { # Gemini Pro safety settings
478 | "HARM_CATEGORY_HARASSMENT": "BLOCK_NONE",
479 | "HARM_CATEGORY_HATE_SPEECH": "BLOCK_NONE",
480 | "HARM_CATEGORY_SEXUALLY_EXPLICIT": "BLOCK_NONE",
481 | "HARM_CATEGORY_DANGEROUS_CONTENT": "BLOCK_NONE",
482 | }
483 | model = genai.GenerativeModel(model_name="gemini-2.0-flash") #Using gemini-2.0-flash for generating alternative quires
484 | try:
485 | response = model.generate_content(parts, safety_settings=safety_settings)
486 | return [q.strip() for q in response.text.split('\n') if q.strip()] # returns refined prompts
487 | except Exception as e:
488 | logging.error(f"Error generating alternative queries: {e}")
489 | return []
490 |
491 | @retry(wait=wait_exponential(multiplier=1, min=4, max=10), stop=stop_after_attempt(3), retry=retry_if_exception_type(Exception))
492 | def generate_gemini_response(prompt: str, model_name: str = "gemini-2.0-flash", response_format: str = "markdown") -> Union[str, Dict, List]:
493 | """Generates a response from Gemini, handling retries/formats."""
494 |
495 | if model_name not in deep_research_rate_limits: # No rate limit for job relevance model
496 | logging.info(f"Using model: {model_name}")
497 | else:
498 | rate_limit_model(model_name) # Rate limit the deep research model
499 |
500 | parts = [{"role": "user", "parts": [{"text": prompt}]}]
501 | safety_settings = {
502 | "HARM_CATEGORY_HARASSMENT": "BLOCK_NONE",
503 | "HARM_CATEGORY_HATE_SPEECH": "BLOCK_NONE",
504 | "HARM_CATEGORY_SEXUALLY_EXPLICIT": "BLOCK_NONE",
505 | "HARM_CATEGORY_DANGEROUS_CONTENT": "BLOCK_NONE",
506 | }
507 |
508 | model = genai.GenerativeModel(model_name=model_name)
509 |
510 | try:
511 | response = model.generate_content(parts, safety_settings=safety_settings)
512 | text_response = response.text
513 |
514 | if response_format == "json":
515 | try:
516 | # More robust JSON parsing: Handle leading/trailing text, comments.
517 | response_text_cleaned = re.sub(r"```json\n?|```|[\s]*//.*|[\s]*/\*[\s\S]*?\*/[\s]*", "", text_response).strip()
518 | return json.loads(response_text_cleaned)
519 | except json.JSONDecodeError as e:
520 | logging.warning(f"Invalid JSON, returning raw text. Error: {e}, Response: {text_response}")
521 | return {"error": "Invalid JSON", "raw_text": text_response}
522 |
523 | elif response_format == "csv":
524 | try:
525 | csv_data = io.StringIO(text_response)
526 | return list(csv.reader(csv_data, delimiter=',', quotechar='"'))
527 | except Exception as e:
528 | logging.warning(f"Invalid CSV, returning raw text. Error: {e}")
529 | return {"error": "Invalid CSV", "raw_text": text_response}
530 | else: # Default to Markdown
531 | text_response = re.sub(r'\n+', '\n\n', text_response) # Consistent newlines
532 | text_response = re.sub(r' +', ' ', text_response) # Single spaces
533 | return text_response.replace("```markdown", "").replace("```", "").strip() # Remove Markdown fences
534 | except Exception as e:
535 | logging.error(f"Gemini error: {e}")
536 | raise
537 |
538 | # --- FastAPI Endpoints ---
539 |
540 | @app.get("/", response_class=HTMLResponse)
541 | async def read_root(request: Request):
542 | return templates.TemplateResponse("index.html", {"request": request})
543 |
544 | @app.post("/api/chat")
545 | async def chat_endpoint(request: Request):
546 | global conversation_history
547 | try:
548 | data = await request.json()
549 | user_message = data.get('message', '')
550 | image_data = data.get('image')
551 | custom_instruction = data.get('custom_instruction')
552 | model_name = data.get('model_name', 'gemini-2.0-flash')
553 |
554 | if custom_instruction and len(conversation_history) == 0:
555 | model = genai.GenerativeModel(model_name=model_name)
556 | chat = model.start_chat(history=[
557 | {"role": "user", "parts": [{"text": custom_instruction}]},
558 | {"role": "model", "parts": ["Understood."]}
559 | ])
560 | conversation_history = chat.history
561 | else:
562 | model = genai.GenerativeModel(model_name=model_name)
563 | chat = model.start_chat(history=conversation_history)
564 |
565 | if image_data:
566 | image_part = process_base64_image(image_data)
567 | if image_part:
568 | # Correctly pass the image to the model
569 | response = chat.send_message([user_message, {"mime_type": image_part["mime_type"], "data": image_part["data"]}], stream=False)
570 |
571 | else:
572 | raise HTTPException(status_code=400, detail="Failed to process image")
573 | else:
574 | response = chat.send_message(user_message, stream=False)
575 |
576 | response_text = response.text
577 | response_text = re.sub(r'\n+', '\n\n', response_text)
578 | response_text = re.sub(r' +', ' ', response_text)
579 | response_text = re.sub(r'^- ', '* ', response_text, flags=re.MULTILINE) # Correct bullet points
580 | response_text = response_text.replace("```markdown", "").replace("```", "").strip() # Remove markdown
581 |
582 | conversation_history = chat.history
583 |
584 | def content_to_dict(content):
585 | return {
586 | "role": content.role,
587 | "parts": [part.text if hasattr(part, 'text') else str(part) for part in content.parts]
588 | }
589 | serialized_history = [content_to_dict(content) for content in conversation_history]
590 |
591 | return JSONResponse({"response": response_text, "history": serialized_history})
592 |
593 | except Exception as e:
594 | logging.error(f"Chat error: {e}")
595 | raise HTTPException(status_code=500, detail=str(e))
596 |
597 | @app.post("/api/clear")
598 | async def clear_history_endpoint():
599 | global conversation_history
600 | conversation_history = []
601 | return JSONResponse({"message": "Cleared history."})
602 |
603 | def process_in_chunks(search_results: List[str], search_query: str, prompt_prefix: str = "",
604 | fetch_options: Optional[Dict] = None) -> Tuple[List[str], List[str], List[Dict]]:
605 | """Processes search results in chunks, fetching/summarizing."""
606 | chunk_summaries = []
607 | references = []
608 | processed_tokens = 0
609 | current_chunk_content = []
610 | extracted_data_all = []
611 |
612 | if fetch_options is None:
613 | fetch_options = {}
614 |
615 | with concurrent.futures.ThreadPoolExecutor(max_workers=config.MAX_WORKERS) as executor:
616 | futures = {executor.submit(fetch_page_content, url, config.DEEP_RESEARCH_SNIPPET_LENGTH, **fetch_options): url for url in search_results}
617 | for future in concurrent.futures.as_completed(futures):
618 | url = futures[future]
619 | try:
620 | page_snippets, page_refs, extracted_data = future.result()
621 | references.extend(page_refs)
622 | extracted_data_all.append({'url': url, 'data': extracted_data})
623 |
624 | for snippet in page_snippets:
625 | estimated_tokens = len(snippet) // 4 # Estimate tokens
626 | if processed_tokens + estimated_tokens > config.MAX_TOKENS_PER_CHUNK:
627 | # Combine, summarize, and reset
628 | combined_content = "\n\n".join(current_chunk_content)
629 | if combined_content.strip():
630 | summary_prompt = config.DEEP_RESEARCH_SUMMARY_PROMPT.format(query=search_query) + f"\n\n{combined_content}"
631 | summary = generate_gemini_response(summary_prompt, model_name=DEFAULT_DEEP_RESEARCH_MODEL)
632 | chunk_summaries.append(summary)
633 | current_chunk_content = []
634 | processed_tokens = 0
635 |
636 | current_chunk_content.append(snippet)
637 | processed_tokens += estimated_tokens
638 |
639 | except Exception as e:
640 | logging.error(f"Error processing {url}: {e}")
641 | continue
642 |
643 | # Process any remaining content
644 | if current_chunk_content:
645 | combined_content = "\n\n".join(current_chunk_content)
646 | if combined_content.strip():
647 | summary_prompt = config.DEEP_RESEARCH_SUMMARY_PROMPT.format(query=search_query) + f"\n\n{combined_content}"
648 | summary = generate_gemini_response(summary_prompt, model_name=DEFAULT_DEEP_RESEARCH_MODEL)
649 | chunk_summaries.append(summary)
650 |
651 | return chunk_summaries, references, extracted_data_all
652 |
653 | @app.post("/api/online")
654 | async def online_search_endpoint(request: Request):
655 | """Performs an online search and summarizes results."""
656 | try:
657 | data = await request.json()
658 | search_query = data.get('query', '')
659 | if not search_query:
660 | raise HTTPException(status_code=400, detail="No query provided")
661 |
662 | references = []
663 | search_results = []
664 | content_snippets = []
665 | search_engines_requested = data.get('search_engines', config.SEARCH_ENGINES)
666 |
667 | with concurrent.futures.ThreadPoolExecutor(max_workers=config.MAX_WORKERS) as executor:
668 | search_futures = [executor.submit(scrape_search_engine, search_query, engine) for engine in search_engines_requested]
669 | for future in concurrent.futures.as_completed(search_futures):
670 | try:
671 | search_results.extend(future.result())
672 | except Exception as e:
673 | logging.error(f"Search engine scrape error: {e}")
674 |
675 | if not search_results:
676 | logging.warning(f"Initial search failed: {search_query}. Trying alternatives.")
677 | alternative_queries = generate_alternative_queries(search_query)
678 | if alternative_queries:
679 | logging.info(f"Alternative queries: {alternative_queries}")
680 | for alt_query in alternative_queries:
681 | alt_search_futures = [executor.submit(scrape_search_engine, alt_query, engine) for engine in
682 | search_engines_requested]
683 | for future in concurrent.futures.as_completed(alt_search_futures):
684 | try:
685 | result = future.result()
686 | if result:
687 | search_results.extend(result)
688 | logging.info(f"Results found with alternative: {alt_query}")
689 | break # Stop on first result
690 | except Exception as e:
691 | logging.error(f"Alternative query scrape error: {e}")
692 | if search_results:
693 | break # Stop after finding results
694 | else:
695 | logging.warning("Gemini failed to generate alternatives.")
696 |
697 | if not search_results:
698 | raise HTTPException(status_code=404, detail="No results found")
699 |
700 | unique_search_results = list(set(search_results))
701 | logging.debug(f"Unique URLs to fetch: {unique_search_results}")
702 | # Fetch content concurrently
703 | fetch_futures = {executor.submit(fetch_page_content, url): url for url in unique_search_results}
704 | for future in concurrent.futures.as_completed(fetch_futures):
705 | url = fetch_futures[future]
706 | try:
707 | page_snippets, page_refs, _ = future.result()
708 | content_snippets.extend(page_snippets)
709 | references.extend(page_refs)
710 | except Exception as e:
711 | logging.error(f"Error fetching {url}: {e}")
712 |
713 | combined_content = "\n\n".join(content_snippets)
714 | prompt = (f"Analyze web content for: '{search_query}'. Extract key facts, figures, and details. Be concise. "
715 | f"Content:\n\n{combined_content}\n\nProvide a fact-based summary.")
716 | explanation = generate_gemini_response(prompt) #Default model
717 | global conversation_history # Access global variable
718 |
719 | def serialize_content(content):
720 | if isinstance(content, list):
721 | return [serialize_content(item) for item in content]
722 | elif hasattr(content, 'role') and hasattr(content, 'parts'):
723 | return {
724 | "role": content.role,
725 | "parts": [part.text if hasattr(part, 'text') else str(part) for part in content.parts]
726 | }
727 | else:
728 | return content
729 |
730 | conversation_history.append({"role": "user", "parts": [f"Online: {search_query}"]})
731 | conversation_history.append({"role": "model", "parts": [explanation]})
732 | serialized_history = [serialize_content(item) for item in conversation_history]
733 |
734 | with concurrent.futures.ThreadPoolExecutor(max_workers=config.MAX_WORKERS) as executor:
735 | shortened_references = list(executor.map(get_shortened_url, references)) #Shorten URLs concurrently
736 |
737 |
738 | return JSONResponse({"explanation": explanation, "references": shortened_references, "history": serialized_history})
739 |
740 | except HTTPException as e:
741 | raise e # Re-raise HTTP exceptions
742 | except Exception as e:
743 | logging.exception(f"Online search error: {e}") # Log with traceback
744 | raise HTTPException(status_code=500, detail=str(e))
745 |
746 |
747 | def serialize_content(content): #helper function
748 | if isinstance(content, list):
749 | return [serialize_content(item) for item in content]
750 | elif hasattr(content, 'role') and hasattr(content, 'parts'):
751 | return {
752 | "role": content.role,
753 | "parts": [part.text if hasattr(part, 'text') else str(part) for part in content.parts]
754 | }
755 | else:
756 | return content
757 |
758 |
759 | @app.post("/api/deep_research")
760 | async def deep_research_endpoint(request: Request):
761 | try:
762 | data = await request.json()
763 | search_query = data.get('query', '')
764 | if not search_query:
765 | raise HTTPException(status_code=400, detail="No query provided")
766 |
767 | model_name = data.get('model_name', DEFAULT_DEEP_RESEARCH_MODEL)
768 | start_time = time.time()
769 | search_engines_requested = data.get('search_engines', config.SEARCH_ENGINES)
770 | output_format = data.get('output_format', 'markdown') # Default to markdown
771 | extract_links = data.get('extract_links', False)
772 | extract_emails = data.get('extract_emails', False)
773 | download_pdf = data.get('download_pdf', True) # Default to True
774 | max_iterations = int(data.get('max_iterations', 3)) # Default to 3
775 |
776 | all_summaries = []
777 | all_references = []
778 | all_extracted_data = []
779 | current_query = search_query # initial query
780 |
781 |
782 | # --- Enhanced Report Structure Logic ---
783 | report_structure = (
784 | "**Structure your report with clear headings and subheadings.**\n"
785 | "Use bullet points and numbered lists where appropriate.\n"
786 | "Include a concise introduction and conclusion.\n\n"
787 | )
788 |
789 | # Conditionally add table instructions
790 | if "table" in output_format.lower():
791 | report_structure += (
792 | "**Include a comparison table summarizing key findings.** "
793 | "Use the detailed table formatting guidelines provided earlier.\n"
794 | )
795 | table_prompt = config.DEEP_RESEARCH_TABLE_PROMPT.format(query=search_query) #prepare the table prompt
796 | else:
797 | report_structure += "**Do NOT include a table.** Focus on a narrative report.\n"
798 |
799 |
800 |
801 | with concurrent.futures.ThreadPoolExecutor(max_workers=config.MAX_WORKERS) as executor:
802 | for iteration in range(max_iterations):
803 | logging.info(f"Iteration {iteration + 1}: {current_query}")
804 | search_results = []
805 | #current_query = search_query if iteration == 0 else current_query # To keep track of current
806 | search_futures = [executor.submit(scrape_search_engine, current_query, engine) for engine in
807 | search_engines_requested]
808 | for future in concurrent.futures.as_completed(search_futures):
809 | search_results.extend(future.result())
810 |
811 | unique_results = list(set(search_results)) # Remove duplicates
812 | logging.debug(f"Iteration {iteration + 1} - URLs: {unique_results}")
813 |
814 | prompt_prefix = config.DEEP_RESEARCH_SUMMARY_PROMPT.format(query=current_query)
815 | fetch_options = {'extract_links': extract_links, 'extract_emails': extract_emails}
816 |
817 | chunk_summaries, refs, extracted = process_in_chunks(unique_results, current_query, prompt_prefix,
818 | fetch_options)
819 | all_summaries.extend(chunk_summaries)
820 | all_references.extend(refs)
821 | all_extracted_data.extend(extracted)
822 |
823 | if iteration < max_iterations - 1:
824 | # Refine the search query
825 | if all_summaries: # Check if we have any summaries to work with.
826 | refinement_prompt = config.DEEP_RESEARCH_REFINEMENT_PROMPT.format(original_query=search_query) + "\n\nResearch Summaries:\n" + "\n".join(all_summaries)
827 | refined_response = generate_gemini_response(refinement_prompt, model_name=model_name)
828 | new_queries = [q.strip() for q in refined_response.split('\n') if q.strip()]
829 | current_query = " ".join(new_queries[:3]) # Use top queries
830 | else:
831 | logging.info("No summaries for refinement. Skipping to next iteration.")
832 | break # If no summaries, stop refining.
833 |
834 |
835 | # --- Final Report Generation (with structure) ---
836 | if all_summaries:
837 | final_prompt = config.DEEP_RESEARCH_REPORT_PROMPT.format(
838 | search_query=search_query,
839 | report_structure=report_structure,
840 | summaries="\n\n".join(all_summaries)
841 | )
842 |
843 | # If table is requested, prepend the table prompt.
844 | if "table" in output_format.lower():
845 | final_prompt = table_prompt + "\n\n" + final_prompt
846 |
847 |
848 | final_explanation = generate_gemini_response(final_prompt, response_format=output_format,
849 | model_name=model_name)
850 |
851 | # --- Table Parsing (if applicable) ---
852 | if "table" in output_format.lower():
853 | try:
854 | parsed_table = parse_markdown_table(final_explanation)
855 | if parsed_table:
856 | final_explanation = parsed_table # Use parsed table
857 | else:
858 | logging.warning("Table parsing failed. Returning raw response.")
859 | final_explanation = {"error": "Failed to parse table", "raw_text": final_explanation}
860 | except Exception as e:
861 | logging.error(f"Error during table parsing: {e}")
862 | final_explanation = {"error": "Failed to parse table", "raw_text": final_explanation}
863 | else:
864 | final_explanation = "No relevant content found for the given query."
865 |
866 |
867 |
868 | global conversation_history # Access global variable
869 | conversation_history.append({"role": "user", "parts": [f"Deep research query: {search_query}"]})
870 | conversation_history.append({"role": "model", "parts": [final_explanation]})
871 | serialized_history = [serialize_content(item) for item in conversation_history]
872 |
873 | end_time = time.time()
874 | elapsed_time = end_time - start_time
875 |
876 |
877 | response_data = {
878 | "explanation": final_explanation,
879 | "references": all_references,
880 | "history": serialized_history,
881 | "elapsed_time": f"{elapsed_time:.2f} seconds",
882 | "extracted_data": all_extracted_data,
883 | "current_query": current_query, # Include the final query used
884 | "iteration": iteration + 1 # Include the final iteration number
885 |
886 | }
887 | if download_pdf:
888 | pdf_buffer = generate_pdf(
889 | "", # Pass an EMPTY STRING as the title.
890 | final_explanation if isinstance(final_explanation, str)
891 | else "\n".join(str(row) for row in final_explanation),
892 | all_references
893 | )
894 | headers = {
895 | 'Content-Disposition': f'attachment; filename="{quote_plus(search_query)}_report.pdf"'
896 | }
897 | return StreamingResponse(iter([pdf_buffer.getvalue()]), media_type="application/pdf", headers=headers)
898 |
899 |
900 | if output_format == "json":
901 | if isinstance(final_explanation, dict):
902 | # If it's already a dict (like error case), return it directly
903 | response_data = final_explanation
904 | elif isinstance(final_explanation, list):
905 | # table data
906 | response_data = {"table_data": final_explanation}
907 | else:
908 | # text explanation
909 | response_data = {"explanation": final_explanation}
910 | # Add other data to the JSON response
911 | response_data.update({
912 | "references": all_references,
913 | "history": serialized_history,
914 | "elapsed_time": f"{elapsed_time:.2f} seconds",
915 | "extracted_data": all_extracted_data
916 | })
917 | return JSONResponse(response_data)
918 |
919 |
920 | elif output_format == "csv":
921 | if isinstance(final_explanation, list):
922 | output = io.StringIO()
923 | writer = csv.writer(output)
924 | writer.writerows(final_explanation) # Write the list of lists
925 | response_data["explanation"] = output.getvalue()
926 | elif isinstance(final_explanation, dict) and "raw_text" in final_explanation:
927 | # Handle potential error dict
928 | response_data = {"explanation": final_explanation["raw_text"]}
929 | else:
930 | response_data = {"explanation": final_explanation} # for normal text
931 |
932 | response_data.update({
933 | "references": all_references,
934 | "history": serialized_history,
935 | "elapsed_time": f"{elapsed_time:.2f} seconds",
936 | "extracted_data": all_extracted_data
937 | })
938 |
939 | return JSONResponse(response_data)
940 |
941 | # If not JSON or CSV, return as is (Markdown)
942 | return JSONResponse(response_data)
943 |
944 | except HTTPException as e:
945 | raise e # Re-raise HTTP exceptions
946 | except Exception as e:
947 | logging.exception(f"Error in deep research: {e}") # Log full traceback
948 | raise HTTPException(status_code=500, detail=str(e))
949 |
950 | def parse_markdown_table(markdown_table_string):
951 | """Parses a Markdown table string with improved robustness."""
952 | lines = [line.strip() for line in markdown_table_string.split('\n') if line.strip()]
953 | if not lines:
954 | return []
955 |
956 | table_data = []
957 | header_detected = False
958 |
959 | for line in lines:
960 | line = line.strip().strip('|').replace(' | ', '|').replace('| ', '|').replace(' |', '|') # Normalize spacing
961 | cells = [cell.strip() for cell in line.split('|')]
962 |
963 | if all(c in '-:| ' for c in line) and len(cells) > 1 and not header_detected:
964 | # Skip header separator, but only *before* processing the first non-separator row
965 | header_detected = True
966 | continue
967 |
968 | if cells:
969 | table_data.append(cells)
970 |
971 | # Handle missing cells and inconsistent column counts.
972 | if table_data:
973 | max_cols = len(table_data[0])
974 | normalized_data = []
975 | for row in table_data:
976 | normalized_data.append(row + [''] * (max_cols - len(row))) # Pad with empty strings
977 | return normalized_data
978 | else:
979 | return []
980 |
981 | def generate_pdf(report_title, content, references):
982 | buffer = BytesIO()
983 | doc = SimpleDocTemplate(buffer, pagesize=A4,
984 | rightMargin=0.7*inch, leftMargin=0.7*inch,
985 | topMargin=0.7*inch, bottomMargin=0.7*inch)
986 | styles = getSampleStyleSheet()
987 | today = date.today()
988 | formatted_date = today.strftime("%B %d, %Y")
989 |
990 | # --- Custom Styles ---
991 | custom_styles = {
992 | 'Title': ParagraphStyle(
993 | 'CustomTitle',
994 | parent=styles['Heading1'],
995 | fontSize=24,
996 | leading=32,
997 | spaceAfter=20,
998 | alignment=TA_CENTER,
999 | textColor=HexColor("#1a237e"),
1000 | fontName='Helvetica-Bold'
1001 | ),
1002 | 'Heading1': ParagraphStyle(
1003 | 'CustomHeading1',
1004 | parent=styles['Heading1'],
1005 | fontSize=18,
1006 | leading=24,
1007 | spaceBefore=20,
1008 | spaceAfter=12,
1009 | textColor=HexColor("#283593"),
1010 | fontName='Helvetica-Bold',
1011 | keepWithNext=True # Keep with the following paragraph
1012 | ),
1013 | 'Heading2': ParagraphStyle(
1014 | 'CustomHeading2',
1015 | parent=styles['Heading2'],
1016 | fontSize=16,
1017 | leading=22,
1018 | spaceBefore=16,
1019 | spaceAfter=10,
1020 | textColor=HexColor("#3949ab"),
1021 | fontName='Helvetica-Bold',
1022 | keepWithNext=True
1023 | ),
1024 | 'Heading3': ParagraphStyle(
1025 | 'CustomHeading3',
1026 | parent=styles['Heading3'],
1027 | fontSize=14,
1028 | leading=20,
1029 | spaceBefore=14,
1030 | spaceAfter=8,
1031 | textColor=HexColor("#455a64"),
1032 | fontName='Helvetica-Bold',
1033 | keepWithNext=True
1034 | ),
1035 | 'Paragraph': ParagraphStyle(
1036 | 'CustomParagraph',
1037 | parent=styles['Normal'],
1038 | fontSize=11,
1039 | leading=16,
1040 | spaceAfter=10,
1041 | alignment=TA_JUSTIFY,
1042 | textColor=HexColor("#212121"),
1043 | firstLineIndent=0.25*inch
1044 | ),
1045 | 'TableCell': ParagraphStyle(
1046 | 'CustomTableCell',
1047 | parent=styles['Normal'],
1048 | fontSize=10,
1049 | leading=14,
1050 | spaceBefore=4,
1051 | spaceAfter=4,
1052 | textColor=HexColor("#212121")
1053 | ),
1054 | 'Bullet': ParagraphStyle( # Corrected bullet style
1055 | 'CustomBullet',
1056 | parent=styles['Normal'],
1057 | fontSize=11,
1058 | leading=16,
1059 | leftIndent=0.5*inch,
1060 | rightIndent=0,
1061 | spaceBefore=4,
1062 | spaceAfter=4,
1063 | bulletIndent=0.3*inch,
1064 | textColor=HexColor("#212121"),
1065 | bulletFontName='Helvetica', # Ensure consistent font
1066 | bulletFontSize=11
1067 | ),
1068 | 'Reference': ParagraphStyle(
1069 | 'CustomReference',
1070 | parent=styles['Normal'],
1071 | fontSize=10,
1072 | leading=14,
1073 | spaceAfter=4,
1074 | textColor=HexColor("#1565c0"),
1075 | alignment=TA_LEFT,
1076 | leftIndent=0.5*inch
1077 | ),
1078 | 'Footer': ParagraphStyle(
1079 | 'CustomFooter',
1080 | parent=styles['Italic'],
1081 | fontSize=9,
1082 | alignment=TA_CENTER,
1083 | textColor=HexColor("#757575"),
1084 | spaceBefore=24 # Space above the footer
1085 | )
1086 | }
1087 |
1088 | def clean_text(text):
1089 | # Convert common Markdown formatting to ReportLab equivalents
1090 | text = re.sub(r'\*\*(.*?)\*\*', r'\1 ', text) # Bold
1091 | text = re.sub(r'\*(.*?)\*', r'\1 ', text) # Italics
1092 | text = re.sub(r'`(.*?)`', r'\1 ', text) # Inline code
1093 | text = re.sub(r'\[(.*?)\]\((.*?)\)', r'\1 ', text) # Links
1094 | # Escape HTML entities to prevent issues in Paragraph
1095 | text = text.replace('&', '&').replace('<', '<').replace('>', '>')
1096 | return text.strip()
1097 |
1098 |
1099 | def process_table(table_text):
1100 | rows = [row.strip() for row in table_text.split('\n') if row.strip()]
1101 | if len(rows) < 2:
1102 | return None # Not a valid table
1103 |
1104 | header = [clean_text(cell) for cell in rows[0].strip('|').split('|')]
1105 | data_rows = []
1106 | for row in rows[2:]: # Skip header and separator lines
1107 | cells = [clean_text(cell) for cell in row.strip('|').split('|')]
1108 | data_rows.append(cells)
1109 |
1110 | # Convert to ReportLab Paragraph objects
1111 | table_data = [[Paragraph(cell, custom_styles['TableCell']) for cell in header]]
1112 | for row in data_rows:
1113 | table_data.append([Paragraph(cell, custom_styles['TableCell']) for cell in row])
1114 |
1115 | table_style = TableStyle([
1116 | ('BACKGROUND', (0, 0), (-1, 0), HexColor("#f5f5f5")), # Header background
1117 | ('TEXTCOLOR', (0, 0), (-1, 0), HexColor("#1a237e")), # Header text color
1118 | ('ALIGN', (0, 0), (-1, -1), 'CENTER'), # Center-align all cells
1119 | ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'), # Header font
1120 | ('FONTSIZE', (0, 0), (-1, 0), 10),
1121 | ('BOTTOMPADDING', (0, 0), (-1, 0), 8), # Header padding
1122 | ('TOPPADDING', (0, 0), (-1, 0), 8),
1123 | ('GRID', (0, 0), (-1, -1), 1, HexColor("#e0e0e0")), # Grid
1124 | ('ROWBACKGROUNDS', (0, 1), (-1, -1), [HexColor("#ffffff"), HexColor("#f8f9fa")]), # Alternating row colors
1125 | ('ALIGN', (0, 0), (-1, -1), 'LEFT'), # Left-align cell content
1126 | ('FONTNAME', (0, 1), (-1, -1), 'Helvetica'), # Body font
1127 | ('FONTSIZE', (0, 1), (-1, -1), 9), # Body font size
1128 | ('TOPPADDING', (0, 1), (-1, -1), 6), # padding
1129 | ('BOTTOMPADDING', (0, 1), (-1, -1), 6),
1130 | ('VALIGN', (0, 0), (-1, -1), 'MIDDLE'), # Vertically center cell content
1131 | ])
1132 |
1133 | col_widths = [doc.width/len(header) for _ in header] # Equal column widths
1134 | table = Table(table_data, colWidths=col_widths)
1135 | table.setStyle(table_style)
1136 | return table
1137 |
1138 | def footer(canvas, doc):
1139 | canvas.saveState()
1140 | footer_text = f"Generated by Kv - AI Companion & Deep Research Tool • {formatted_date}"
1141 | footer = Paragraph(footer_text, custom_styles['Footer'])
1142 | w, h = footer.wrap(doc.width, doc.bottomMargin)
1143 | footer.drawOn(canvas, doc.leftMargin, h) # Draw footer
1144 | canvas.restoreState()
1145 |
1146 | story = []
1147 | story.append(Paragraph(report_title, custom_styles['Title'])) # Title
1148 | story.append(Paragraph(formatted_date, custom_styles['Footer'])) # Date
1149 | story.append(Spacer(1, 0.2*inch)) # Initial space
1150 |
1151 | current_table = []
1152 | in_table = False
1153 | lines = content.split('\n')
1154 | i = 0
1155 | current_paragraph = [] # Accumulate lines for a paragraph
1156 |
1157 | while i < len(lines):
1158 | line = lines[i].strip()
1159 |
1160 | if not line:
1161 | # Empty line: End current paragraph (if any), add it to story.
1162 | if current_paragraph:
1163 | story.append(Paragraph(clean_text(" ".join(current_paragraph)), custom_styles['Paragraph']))
1164 | current_paragraph = []
1165 | # Handle any pending table
1166 | if in_table and current_table:
1167 | table = process_table('\n'.join(current_table))
1168 | if table:
1169 | story.append(table)
1170 | story.append(Spacer(1, 0.1*inch))
1171 | current_table = []
1172 | in_table = False
1173 | story.append(Spacer(1, 0.05*inch)) # Consistent spacing
1174 | i += 1
1175 | continue
1176 |
1177 | if '|' in line and (line.count('|') > 1 or (i + 1 < len(lines) and '|' in lines[i + 1])):
1178 | # Likely a table row. Start/continue accumulating table lines.
1179 | in_table = True
1180 | current_table.append(line)
1181 | # End any current paragraph before starting a table.
1182 | if current_paragraph:
1183 | story.append(Paragraph(clean_text(" ".join(current_paragraph)), custom_styles['Paragraph']))
1184 | current_paragraph = []
1185 | elif in_table:
1186 | # End of table. Process accumulated table lines.
1187 | if current_table:
1188 | table = process_table('\n'.join(current_table))
1189 | if table:
1190 | story.append(table)
1191 | story.append(Spacer(1, 0.1*inch))
1192 | current_table = []
1193 | in_table = False
1194 | continue #Crucial to continue here, and not add to current_paragraph below.
1195 |
1196 | elif line.startswith('# '):
1197 | # End current paragraph (if any) before starting a heading.
1198 | if current_paragraph:
1199 | story.append(Paragraph(clean_text(" ".join(current_paragraph)), custom_styles['Paragraph']))
1200 | current_paragraph = []
1201 | story.append(Paragraph(clean_text(line[2:]), custom_styles['Heading1'])) # H1
1202 | elif line.startswith('## '):
1203 | if current_paragraph:
1204 | story.append(Paragraph(clean_text(" ".join(current_paragraph)), custom_styles['Paragraph']))
1205 | current_paragraph = []
1206 | story.append(Paragraph(clean_text(line[3:]), custom_styles['Heading2'])) # H2
1207 | elif line.startswith('### '):
1208 | if current_paragraph:
1209 | story.append(Paragraph(clean_text(" ".join(current_paragraph)), custom_styles['Paragraph']))
1210 | current_paragraph = []
1211 | story.append(Paragraph(clean_text(line[4:]), custom_styles['Heading3'])) # H3
1212 | elif line.startswith('* ') or line.startswith('- '):
1213 | # End current paragraph before a list item.
1214 | if current_paragraph:
1215 | story.append(Paragraph(clean_text(" ".join(current_paragraph)), custom_styles['Paragraph']))
1216 | current_paragraph = []
1217 | story.append(Paragraph(f"• {clean_text(line[2:])}", custom_styles['Bullet'])) # Bullet points
1218 |
1219 | else:
1220 | # Regular text line: add to the current paragraph.
1221 | current_paragraph.append(line)
1222 |
1223 | i += 1
1224 | # Add any remaining paragraph content (important!).
1225 | if current_paragraph:
1226 | story.append(Paragraph(clean_text(" ".join(current_paragraph)), custom_styles['Paragraph']))
1227 |
1228 | if current_table:
1229 | table = process_table('\n'.join(current_table))
1230 | if table:
1231 | story.append(table)
1232 | story.append(Spacer(1, 0.1*inch))
1233 |
1234 | if references:
1235 | story.append(PageBreak()) # References on a new page
1236 | story.append(Paragraph("References", custom_styles['Heading1']))
1237 | story.append(Spacer(1, 0.1*inch))
1238 | for i, ref in enumerate(references, 1):
1239 | story.append(Paragraph(f"[{i}] {ref}", custom_styles['Reference']))
1240 |
1241 | doc.build(story, onLaterPages=footer, onFirstPage=footer) # Apply to all pages.
1242 | buffer.seek(0)
1243 | return buffer
1244 |
1245 | # --- Product Scraping ---
1246 | def scrape_product_details(url):
1247 | """Scrapes product details from a given URL."""
1248 | try:
1249 | response = requests.get(url, headers={'User-Agent': get_random_user_agent()}, timeout=config.REQUEST_TIMEOUT)
1250 | response.raise_for_status()
1251 |
1252 | soup = BeautifulSoup(response.text, 'html.parser')
1253 | product_data = {}
1254 | # Title
1255 | for tag in ['h1', 'h2', 'span', 'div']:
1256 | for class_name in ['product-title', 'title', 'productName', 'product-name']:
1257 | if title_element := soup.find(tag, class_=class_name):
1258 | product_data['title'] = title_element.get_text(strip=True)
1259 | break
1260 | if 'title' in product_data:
1261 | break
1262 |
1263 | # Price
1264 | for tag in ['span', 'div', 'p']:
1265 | for class_name in ['price', 'product-price', 'sales-price', 'regular-price']:
1266 | if price_element := soup.find(tag, class_=class_name):
1267 | product_data['price'] = price_element.get_text(strip=True)
1268 | break
1269 | if 'price' in product_data:
1270 | break
1271 |
1272 | # Description
1273 | if (description_element := soup.find('div', {'itemprop': 'description'})):
1274 | product_data['description'] = description_element.get_text(strip=True)
1275 | else:
1276 | for class_name in ['description', 'product-description', 'product-details', 'details']:
1277 | if desc_element := soup.find(['div', 'p'], class_=class_name):
1278 | product_data['description'] = desc_element.get_text(separator='\n', strip=True)
1279 | break
1280 |
1281 | # Image URL
1282 | if (image_element := soup.find('img', {'itemprop': 'image'})):
1283 | product_data['image_url'] = urljoin(url, image_element['src'])
1284 | else:
1285 | for tag in ['img', 'div']:
1286 | for class_name in ['product-image', 'image', 'main-image', 'productImage']:
1287 | if (image_element := soup.find(tag, class_=class_name)) and image_element.get('src'):
1288 | product_data['image_url'] = urljoin(url, image_element['src'])
1289 | break
1290 | if 'image_url' in product_data:
1291 | break
1292 |
1293 | # Rating
1294 | if (rating_element := soup.find(['span', 'div'], class_=['rating', 'star-rating', 'product-rating'])):
1295 | product_data['rating'] = rating_element.get_text(strip=True)
1296 |
1297 | return product_data
1298 |
1299 | except requests.exceptions.RequestException as e:
1300 | logging.error(f"Error scraping product details from {url}: {e}")
1301 | return None # Return None on error
1302 | except Exception as e:
1303 | logging.error(f"Unexpected error scraping {url}: {e}")
1304 | return None # Return None on unexpected error
1305 |
1306 | @app.post("/api/scrape_product")
1307 | async def scrape_product_endpoint(request: Request):
1308 | try:
1309 | data = await request.json()
1310 | product_query = data.get('query', '')
1311 | if not product_query:
1312 | raise HTTPException(status_code=400, detail="No product query provided")
1313 |
1314 | search_results = []
1315 | with concurrent.futures.ThreadPoolExecutor(max_workers=config.MAX_WORKERS) as executor:
1316 | futures = [executor.submit(scrape_search_engine, product_query, engine)
1317 | for engine in config.SEARCH_ENGINES]
1318 | for future in concurrent.futures.as_completed(futures):
1319 | try:
1320 | search_results.extend(future.result())
1321 | except Exception as e:
1322 | logging.error(f"Error in search engine scrape: {e}")
1323 |
1324 | unique_urls = list(set(search_results)) # Remove duplicate URLs
1325 |
1326 | all_product_data = []
1327 | with concurrent.futures.ThreadPoolExecutor(max_workers=config.MAX_WORKERS) as executor:
1328 | futures = {executor.submit(scrape_product_details, url): url for url in unique_urls}
1329 | for future in concurrent.futures.as_completed(futures):
1330 | try:
1331 | product_data = future.result()
1332 | if product_data: # Only add if data was successfully scraped
1333 | all_product_data.append(product_data)
1334 | except Exception as e:
1335 | url = futures[future]
1336 | logging.error(f"Error processing {url}: {e}")
1337 |
1338 | if all_product_data:
1339 | # Create a prompt to summarize the product information
1340 | prompt = "Summarize the following product information:\n\n"
1341 | for product in all_product_data:
1342 | prompt += f"- Title: {product.get('title', 'N/A')}\n"
1343 | prompt += f" Price: {product.get('price', 'N/A')}\n"
1344 | prompt += f" Description: {product.get('description', 'N/A')}\n"
1345 | prompt += "\n" # Add a separator between products
1346 |
1347 | prompt += "\nProvide a concise summary, including key features and price range."
1348 |
1349 | summary = generate_gemini_response(prompt) # default model
1350 |
1351 | return JSONResponse({"summary": summary, "products": all_product_data})
1352 | else:
1353 | raise HTTPException(status_code=404, detail="No product information found")
1354 |
1355 | except HTTPException as e:
1356 | raise e # Re-raise HTTP exceptions
1357 | except Exception as e:
1358 | logging.error(f"Error in product scraping endpoint: {e}")
1359 | raise HTTPException(status_code=500, detail=str(e))
1360 |
1361 | # --- Job Scraping ---
1362 | def extract_text_from_resume(resume_data: bytes) -> str:
1363 | """Extracts text from a resume (PDF, DOCX, or plain text)."""
1364 | try:
1365 | if resume_data.startswith(b"%PDF"):
1366 | # PDF file
1367 | resume_text = pdf_extract_text(io.BytesIO(resume_data))
1368 | elif resume_data.startswith(b"PK\x03\x04"): # Common DOCX header
1369 | # DOCX file
1370 | resume_text = docx2txt.process(io.BytesIO(resume_data))
1371 | else:
1372 | # Assume plain text
1373 | try:
1374 | resume_text = resume_data.decode('utf-8')
1375 | except UnicodeDecodeError:
1376 | resume_text = resume_data.decode('latin-1', errors='replace') # Fallback encoding
1377 | return resume_text
1378 | except Exception as e:
1379 | logging.error(f"Error extracting resume text: {e}")
1380 | return ""
1381 |
1382 | # --- Helper functions for job scraping ---
1383 |
1384 | def linkedin_params(job_title: str, job_location: str, start: int = 0, experience_level: Optional[str] = None) -> Dict:
1385 | """Generates parameters for a LinkedIn job search URL."""
1386 | params = {
1387 | 'keywords': job_title,
1388 | 'location': job_location,
1389 | 'f_TPR': 'r86400', # Past 24 hours (consider making configurable)
1390 | 'sortBy': 'R', # Sort by relevance
1391 | 'start': start # pagination
1392 | }
1393 | if experience_level:
1394 | # LinkedIn-specific experience level filters (add others as needed)
1395 | if experience_level.lower() == "fresher":
1396 | params['f_E'] = '1' # Internship
1397 | elif experience_level.lower() == "entry-level":
1398 | params['f_E'] = '2' # Entry-Level
1399 | elif experience_level.lower() == "mid-level":
1400 | params['f_E'] = '3' # Associate
1401 | elif experience_level.lower() == "senior":
1402 | params['f_E'] = '4' # Senior
1403 | elif experience_level.lower() == "executive":
1404 | params['f_E'] = '5' # Director , '6' - executive
1405 | # params['f_E'] = ['4', '5'] # Combine Senior/Executive for LinkedIn
1406 |
1407 | return params
1408 |
1409 | def indeed_params(job_title: str, job_location: str, start: int = 0, experience_level: Optional[str] = None) -> Dict:
1410 | """Generates parameters for an Indeed job search URL."""
1411 | params = {
1412 | 'q': job_title,
1413 | 'l': job_location,
1414 | 'sort': 'relevance', # Sort by relevance
1415 | 'fromage': '1', # Past 24 hours
1416 | 'limit': 50, # Fetch more results per page
1417 | 'start': start # Pagination
1418 | }
1419 | if experience_level:
1420 | params['q'] = f"{experience_level} {params['q']}" # Add to the main query
1421 |
1422 | return params
1423 | def parse_linkedin_job_card(job_card: BeautifulSoup) -> Dict:
1424 | """Parses a single LinkedIn job card and extracts relevant information."""
1425 | try:
1426 | job_url_element = job_card.find('a', class_='base-card__full-link')
1427 | job_url = job_url_element['href'] if job_url_element else None
1428 |
1429 | title_element = job_card.find('h3', class_='base-search-card__title')
1430 | title = title_element.get_text(strip=True) if title_element else "N/A"
1431 |
1432 | company_element = job_card.find('h4', class_='base-search-card__subtitle')
1433 | company = company_element.get_text(strip=True) if company_element else "N/A"
1434 |
1435 | location_element = job_card.find('span', class_='job-search-card__location')
1436 | location = location_element.get_text(strip=True) if location_element else "N/A"
1437 |
1438 | return {
1439 | 'url': job_url,
1440 | 'title': title,
1441 | 'company': company,
1442 | 'location': location,
1443 | 'relevance': 0.0,
1444 | 'missing_skills': [],
1445 | 'justification': "Relevance not assessed.",
1446 | 'experience': 'N/A' # default value
1447 | }
1448 | except Exception as e:
1449 | logging.error(f"Error parsing LinkedIn job card: {e}")
1450 | return { # Return defaults on error
1451 | 'url': None,
1452 | 'title': "N/A",
1453 | 'company': "N/A",
1454 | 'location': "N/A",
1455 | 'relevance': 0.0,
1456 | 'missing_skills': [],
1457 | 'justification': f"Error parsing job card: {type(e).__name__}",
1458 | 'experience': 'N/A'
1459 | }
1460 |
1461 | def parse_indeed_job_card(job_card: BeautifulSoup) -> Dict:
1462 | """Parses a single Indeed job card and extracts relevant information."""
1463 | try:
1464 | title_element = job_card.find(['h2', 'a'], class_=lambda x: x and ('title' in x or 'jobtitle' in x))
1465 | title = title_element.get_text(strip=True) if title_element else "N/A"
1466 |
1467 | company_element = job_card.find(['span', 'a'], class_='companyName')
1468 | company = company_element.get_text(strip=True) if company_element else "N/A"
1469 |
1470 | location_element = job_card.find('div', class_='companyLocation')
1471 | location = location_element.get_text(strip=True) if location_element else "N/A"
1472 |
1473 | job_url = None
1474 | link_element = job_card.find('a', href=True)
1475 | if link_element and 'pagead' not in link_element['href']:
1476 | job_url = urljoin("https://www.indeed.com/jobs", link_element['href'])
1477 | if not job_url:
1478 | data_jk = job_card.get('data-jk')
1479 | if data_jk:
1480 | job_url = f"https://www.indeed.com/viewjob?jk={data_jk}"
1481 |
1482 | return {
1483 | 'url': job_url,
1484 | 'title': title,
1485 | 'company': company,
1486 | 'location': location,
1487 | 'relevance': 0.0,
1488 | 'missing_skills': [],
1489 | 'justification': "Relevance not assessed.",
1490 | 'experience':'N/A' # Default
1491 | }
1492 | except Exception as e:
1493 | logging.error(f"Error parsing Indeed job card: {e}")
1494 | return { # Return defaults on error
1495 | 'url': None,
1496 | 'title': "N/A",
1497 | 'company': "N/A",
1498 | 'location': "N/A",
1499 | 'relevance': 0.0,
1500 | 'missing_skills': [],
1501 | 'justification': f"Error parsing job card: {type(e).__name__}",
1502 | 'experience':'N/A'
1503 | }
1504 |
1505 | @retry(
1506 | wait=wait_exponential(multiplier=config.INDEED_BASE_DELAY, max=config.INDEED_MAX_DELAY),
1507 | stop=stop_after_attempt(config.INDEED_RETRIES),
1508 | retry=retry_if_exception_type(requests.exceptions.RequestException),
1509 | before_sleep=lambda retry_state: logging.warning(
1510 | f"Indeed request failed (attempt {retry_state.attempt_number}). Retrying in {retry_state.next_action.sleep} seconds..."
1511 | )
1512 | )
1513 | def scrape_job_site(job_title: str, job_location: str, resume_text: Optional[str],
1514 | base_url: str, params_func: callable, parse_func: callable, site_name:str,
1515 | experience_level: Optional[str] = None) -> List[Dict]:
1516 | """
1517 | Generic function to scrape job listings from a given site.
1518 |
1519 | Args:
1520 | job_title: The job title to search for.
1521 | job_location: The location to search for jobs.
1522 | resume_text: Optional resume text for relevance assessment.
1523 | base_url: The base URL of the job site.
1524 | params_func: A function that generates the URL parameters for the site.
1525 | parse_func: A function that parses a job card from the site's HTML.
1526 | site_name: The name of the job site (e.g., "LinkedIn", "Indeed").
1527 | experience_level: Optional experience level string
1528 |
1529 | Returns:
1530 | A list of dictionaries, where each dictionary represents a job listing.
1531 | """
1532 |
1533 | search_results = []
1534 | start = 0 # pagination
1535 | MAX_PAGES = 10 # Limit pages to prevent infinite loops. Adjust as needed.
1536 |
1537 | while True:
1538 | params = params_func(job_title, job_location, start, experience_level) # Pass experience
1539 | try:
1540 | headers = {'User-Agent': get_random_user_agent()} # Rotate User-Agent
1541 | response = requests.get(base_url, params=params, headers=headers, timeout=config.REQUEST_TIMEOUT)
1542 | response.raise_for_status() # Raises HTTPError for bad (4xx, 5xx) responses
1543 |
1544 | if "captcha" in response.text.lower():
1545 | logging.warning(f"{site_name} CAPTCHA detected. Stopping.")
1546 | break # Exit pagination
1547 |
1548 | soup = BeautifulSoup(response.text, 'html.parser')
1549 |
1550 | # Use a general way to find job cards (more robust to site changes)
1551 | job_cards = soup.find_all('div', class_=lambda x: x and x.startswith('job_'))
1552 | if not job_cards:
1553 | job_cards = soup.find_all('div', class_='base-card') # For linkedin try another method
1554 |
1555 |
1556 | if not job_cards:
1557 | if start == 0:
1558 | logging.warning(f"No {site_name} jobs found for: {params}")
1559 | else:
1560 | logging.info(f"No more {site_name} jobs found (page {start//50 + 1}).")
1561 | break # No more jobs, stop pagination
1562 |
1563 | for job_card in job_cards:
1564 | try:
1565 | job_data = parse_func(job_card) # Parse Individual job card.
1566 |
1567 | if not job_data['url']: # Skip if no URL
1568 | continue
1569 |
1570 | if resume_text:
1571 | try:
1572 | # Fetch the full job description
1573 | job_response = requests.get(job_data['url'], headers={'User-Agent': get_random_user_agent()},
1574 | timeout=config.REQUEST_TIMEOUT)
1575 | job_response.raise_for_status()
1576 | job_soup = BeautifulSoup(job_response.text, 'html.parser')
1577 | description_element = job_soup.find('div', id='jobDescriptionText') # Indeed
1578 | if not description_element: # for linkedin
1579 | description_element = job_soup.find('div', class_='description__text')
1580 | job_description = description_element.get_text(separator='\n', strip=True) if description_element else ""
1581 |
1582 | # --- Experience Level Extraction (from description) ---
1583 | experience_match = re.search(r'(\d+\+?)\s*(?:-|to)?\s*(\d*)\s*years?', job_description, re.IGNORECASE)
1584 | if experience_match:
1585 | if experience_match.group(2): # If range
1586 | job_data['experience'] = f"{experience_match.group(1)}-{experience_match.group(2)} years"
1587 | else: # Just single number
1588 | job_data['experience'] = f"{experience_match.group(1)} years"
1589 | else: # Check for keywords
1590 | exp_keywords = {
1591 | 'fresher': ['fresher', 'graduate', 'entry level', '0 years', 'no experience'],
1592 | 'entry-level': ['0-2 years', '1-3 years', 'entry level', 'junior'],
1593 | 'mid-level' : ['3-5 years','2-5 years','mid level','intermediate'],
1594 | 'senior' : ['5+ years','5-10 years', 'senior','expert', 'lead'],
1595 | 'executive': ['10+ years', 'executive', 'director', 'vp', 'c-level']
1596 | }
1597 | for level, keywords in exp_keywords.items():
1598 | for keyword in keywords:
1599 | if keyword.lower() in job_description.lower():
1600 | job_data['experience'] = level
1601 | break # Stop checking once a level is found
1602 | if job_data['experience'] != 'N/A':
1603 | break # Stop checking other levels
1604 |
1605 |
1606 |
1607 |
1608 | job_description = job_description[:2000] # Truncate!
1609 | resume_text_trunc = resume_text[:2000] # Truncate!
1610 |
1611 | relevance_prompt = (
1612 | f"Assess the relevance of the following job to the resume. "
1613 | f"Provide a JSON object with ONLY the following keys:\n"
1614 | f"'relevance': float (between 0.0 and 1.0, where 1.0 is perfectly relevant),\n"
1615 | f"'missing_skills': list of strings (skills in the job description but not in the resume, or an empty list if none),\n"
1616 | f"'justification': string (REQUIRED. Explain the relevance score, including factors like experience level mismatch, skill gaps, or industry differences.).\n\n"
1617 | f"Job Description:\n{job_description}\n\nResume:\n{resume_text_trunc}"
1618 | )
1619 | relevance_assessment = generate_gemini_response(relevance_prompt, response_format="json", model_name=config.JOB_RELEVANCE_MODEL)
1620 | if isinstance(relevance_assessment, dict):
1621 | job_data['relevance'] = relevance_assessment.get('relevance', 0.0) # Provide default
1622 | job_data['missing_skills'] = relevance_assessment.get('missing_skills', [])
1623 | job_data['justification'] = relevance_assessment.get('justification', "Relevance assessed.")
1624 | # Basic validation (optional, but good practice):
1625 | if not isinstance(job_data['relevance'], (int, float)):
1626 | logging.warning(f"Invalid relevance value: {job_data['relevance']}")
1627 | job_data['relevance'] = 0.0
1628 | if not isinstance(job_data['missing_skills'], list):
1629 | logging.warning(f"Invalid missing_skills: {job_data['missing_skills']}")
1630 | job_data['missing_skills'] = []
1631 | if not isinstance(job_data['justification'], str):
1632 | logging.warning(f"Invalid justification: {job_data['justification']}")
1633 | job_data['justification'] = "Error: Could not assess relevance properly."
1634 | elif isinstance(relevance_assessment, dict) and "error" in relevance_assessment:
1635 | # Handle the "Invalid JSON" case specifically
1636 | logging.warning(f"Invalid JSON from Gemini: {relevance_assessment['raw_text']}")
1637 | job_data['relevance'] = 0.0 # Set default values
1638 | job_data['missing_skills'] = []
1639 | job_data['justification'] = "Error: Could not assess relevance due to invalid JSON response."
1640 |
1641 | else: # Unexpected return from Gemini
1642 | logging.warning(f"Unexpected response from relevance assessment: {relevance_assessment}")
1643 | job_data['relevance'] = 0.0
1644 | job_data['missing_skills'] = []
1645 | job_data['justification'] = "Error: Could not assess relevance (unexpected response)."
1646 |
1647 | except requests.exceptions.RequestException as e:
1648 | logging.warning(f"Failed to fetch job description from {job_data['url']}: {e}")
1649 | job_data['justification'] = f"Error: Could not fetch job description ({type(e).__name__})."
1650 | except Exception as e:
1651 | logging.exception(f"Error during relevance assessment for {job_data['url']}: {e}")
1652 | job_data['relevance'] = 0.0
1653 | job_data['missing_skills'] = []
1654 | job_data['justification'] = "Error: Could not assess relevance (unexpected error)."
1655 |
1656 | search_results.append(job_data) # Append even with errors
1657 |
1658 | except Exception as e:
1659 | logging.warning(f"Error processing a {site_name} job card: {e}")
1660 | continue # skip to the next job card
1661 |
1662 | start += 50 # pagination
1663 | if start//50 +1 > MAX_PAGES:
1664 | logging.info(f"Reached max pages ({MAX_PAGES}) for {site_name}.")
1665 | break
1666 |
1667 | except requests.exceptions.HTTPError as e:
1668 | logging.error(f"{site_name} HTTP Error: {e}")
1669 | break # unrecoverable errors
1670 | except requests.exceptions.RequestException as e:
1671 | logging.error(f"{site_name} Request Exception: {e}")
1672 | break # network errors
1673 |
1674 | return search_results
1675 |
1676 |
1677 | @app.post("/api/scrape_jobs")
1678 | async def scrape_jobs_endpoint(job_title: Optional[str] = Form(""), job_location: str = Form(...), resume: UploadFile = File(None),
1679 | job_experience: Optional[str] = Form(None)): # New parameter
1680 | try:
1681 | # Removed the 'not job_title' check to allow it to be optional
1682 |
1683 | resume_text = None
1684 | if resume:
1685 | resume_content = await resume.read()
1686 | resume_text = extract_text_from_resume(resume_content)
1687 | if not resume_text:
1688 | raise HTTPException(status_code=400, detail="Could not extract text from resume.")
1689 |
1690 | all_job_results = []
1691 |
1692 | with concurrent.futures.ThreadPoolExecutor(max_workers=config.MAX_WORKERS) as executor:
1693 | # Submit both scraping tasks *concurrently*
1694 | linkedin_future = executor.submit(scrape_job_site, job_title, job_location, resume_text,
1695 | "https://www.linkedin.com/jobs/search", linkedin_params, parse_linkedin_job_card, "LinkedIn", job_experience) # Pass experience
1696 | indeed_future = executor.submit(scrape_job_site, job_title, job_location, resume_text,
1697 | "https://www.indeed.com/jobs", indeed_params, parse_indeed_job_card, "Indeed", job_experience) # Pass experience
1698 |
1699 | # Get results, handling exceptions gracefully. Don't stop if one fails.
1700 | try:
1701 | all_job_results.extend(linkedin_future.result())
1702 | except Exception as e:
1703 | logging.error(f"Error scraping LinkedIn: {e}") # Log, but don't stop
1704 | try:
1705 | all_job_results.extend(indeed_future.result())
1706 | except Exception as e:
1707 | logging.error(f"Error scraping Indeed: {e}") # Log, but don't stop
1708 |
1709 | # --- Filtering and Sorting ---
1710 | # Filter by experience first
1711 | if job_experience:
1712 | filtered_jobs = [job for job in all_job_results if job.get('experience', '').lower() == job_experience.lower()]
1713 | else:
1714 | filtered_jobs = all_job_results
1715 |
1716 |
1717 | # Then, sort by experience level, then by relevance WITHIN each experience level
1718 | experience_order = ['fresher', 'entry-level', 'mid-level', 'senior', 'executive', 'N/A']
1719 | def sort_key(job):
1720 | # Get experience level (default to 'N/A' if missing, put last)
1721 | exp = job.get('experience', 'N/A').lower()
1722 | if exp not in experience_order:
1723 | exp = 'N/A' # Normalize to 'N/A'
1724 |
1725 | return (experience_order.index(exp), -job.get('relevance', 0.0))
1726 |
1727 | filtered_jobs.sort(key=sort_key)
1728 |
1729 |
1730 |
1731 | if filtered_jobs:
1732 | return JSONResponse({'jobs': filtered_jobs, 'jobs_found': len(all_job_results)})
1733 | else:
1734 | # More specific message if no jobs *after* filtering
1735 | return JSONResponse({'jobs': [], 'jobs_found': len(all_job_results)}, status_code=200) # Return 200 OK even if no jobs are found after filtering
1736 |
1737 |
1738 |
1739 | except HTTPException as e:
1740 | raise e # Re-raise HTTP exceptions for FastAPI to handle
1741 | except Exception as e:
1742 | logging.error(f"Error in jobs scraping endpoint: {e}")
1743 | raise HTTPException(status_code=500, detail=str(e))
1744 |
1745 | # Image Analysis Tool
1746 | @app.post("/api/analyze_image")
1747 | async def analyze_image_endpoint(request: Request):
1748 | try:
1749 | data = await request.json()
1750 | image_data = data.get('image')
1751 | if not image_data:
1752 | return JSONResponse({"error": "No image provided"}, status_code=400)
1753 |
1754 | image_part = process_base64_image(image_data)
1755 | if not image_part:
1756 | return JSONResponse({"error": "Failed to process image"}, status_code=400)
1757 |
1758 | model = genai.GenerativeModel('gemini-2.0-flash')
1759 | # image = Image.open(io.BytesIO(image_part['data'])) # No longer needed
1760 | response = model.generate_content(["Describe this image in detail", image_part]) # Simple description prompt ,pass image part directly
1761 | response.resolve() # Ensure generation is complete
1762 |
1763 | return JSONResponse({"description": response.text})
1764 |
1765 | except Exception as e:
1766 | logging.exception("Error in image analysis")
1767 | return JSONResponse({"error": "Image analysis failed."}, status_code=500)
1768 |
1769 |
1770 | # Sentiment Analysis Tool
1771 | @app.post("/api/analyze_sentiment")
1772 | async def analyze_sentiment_endpoint(request: Request):
1773 | try:
1774 | data = await request.json()
1775 | text = data.get('text')
1776 | if not text:
1777 | return JSONResponse({"error": "No text provided"}, status_code=400)
1778 |
1779 | prompt = f"Analyze the sentiment of the following text and classify it as 'Positive', 'Negative', or 'Neutral'. Provide a brief justification:\n\n{text}"
1780 | sentiment_result = generate_gemini_response(prompt)
1781 |
1782 | return JSONResponse({"sentiment": sentiment_result})
1783 | except Exception as e:
1784 | logging.exception("Error in sentiment analysis")
1785 | return JSONResponse({"error": "Sentiment analysis failed."}, status_code=500)
1786 |
1787 | # Website Summarization Tool
1788 | @app.post("/api/summarize_website")
1789 | async def summarize_website_endpoint(request: Request):
1790 | try:
1791 | data = await request.json()
1792 | url = data.get('url')
1793 | if not url:
1794 | return JSONResponse({"error": "No URL provided"}, status_code=400)
1795 |
1796 | content_snippets, _, _ = fetch_page_content(url, snippet_length=config.SNIPPET_LENGTH)
1797 | if not content_snippets:
1798 | return JSONResponse({"error": "Could not fetch website content"}, status_code=400)
1799 | combined_content = "\n\n".join(content_snippets)
1800 | prompt = f"Summarize the following webpage content concisely:\n\n{combined_content}"
1801 | summary = generate_gemini_response(prompt)
1802 | return JSONResponse({"summary": summary})
1803 | except Exception as e:
1804 | logging.exception("Error in website summarization")
1805 | return JSONResponse({"error": "Website summarization failed."}, status_code=500)
1806 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | fastapi==0.110.0
2 | uvicorn==0.29.0
3 | python-dotenv==1.0.1
4 | Pillow==10.2.0
5 | requests==2.31.0
6 | beautifulsoup4==4.12.3
7 | soupsieve==2.5
8 | google-generativeai==0.4.1
9 | reportlab==4.1.0
10 | python-multipart==0.0.9
11 | tenacity==8.2.3
12 | urllib3==2.2.1
13 | brotli==1.1.0
14 | pdfminer.six==20231228
15 | docx2txt==0.8
16 | chardet==5.2.0
17 | Jinja2==3.1.3
18 | aiofiles==23.2.1
19 |
--------------------------------------------------------------------------------
/templates/index.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 | KV AI Assistant | Next-Gen Research Tool
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
1356 |
1357 |
1358 |
1359 |
1360 |
1385 |
1386 |
1387 |
1390 |
1391 |
1420 |
1421 |
1422 |
1423 |
1424 |
1425 |
1434 |
1435 |
1436 |
1437 |
1455 |
1456 |
1457 |
1458 |
1521 |
1522 |
1523 |
1682 |
1683 |
1684 |
1685 |
1686 |
1687 |
1688 |
×
1689 |
Settings
1690 |
1691 | Custom Instruction:
1692 |
1693 |
1694 |
1695 | AI Model:
1696 |
1697 | gemini-2.0-flash
1698 | gemini-2.0-flash-thinking-exp-01-21
1699 |
1700 |
1701 |
Save Settings
1702 |
1703 |
1704 |
1705 |
1706 |
1707 |
1708 |
1709 |
Generating Report... 🚀🧠💡
1710 |
1711 |
1712 |
1713 |
1714 |
1715 |
1716 |
1717 |
1718 |
1719 |
1720 |
2861 |
2862 |
2863 |
--------------------------------------------------------------------------------