├── LICENSE ├── README.md ├── app.py ├── requirements.txt └── templates └── index.html /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 KARRI VAMSI KRISHNA 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 | 3 | 4 | 5 |

6 | 7 | AI-Powered Deep Research & Job Search Companion 8 | 9 |

10 | 11 |

Founded by K. Vamsi Krishna

12 | 13 | 14 |

15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 |
28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 |

38 | 39 | 40 |

41 | 42 | 43 | 44 | 45 | 46 | 47 |

48 | 49 |

50 | 51 | 52 |

53 | 54 |

55 | 56 | 57 |

58 | 59 |

60 | 61 | 62 |

63 |

📋 Table of Contents

64 |
65 | 66 |

67 | 68 | | | Section | Description | 69 | |----|----------------------------------------|--------------------------------------| 70 | | 📺 | [Demo Video](#-demo-video) | See KV in action | 71 | | 🌟 | [Key Features](#-key-features) | What makes KV special | 72 | | 🖼️ | [Screenshots](#-screenshots) | Visual previews | 73 | | 🚀 | [Why Choose KV](#-why-choose-kv) | Benefits & advantages | 74 | | ⚙️ | [Installation](#%EF%B8%8F-installation) | Get up and running | 75 | | 🎮 | [Usage Guide](#-usage-guide) | How to use KV effectively | 76 | | 🤝 | [Contribution](#-contribution) | Join our community | 77 | | 📄 | [License](#-license) | MIT License | 78 | | 📞 | [Contact & Support](#-contact--support) | Get in touch | 79 | 80 |

81 |

82 | 83 | 84 |

85 |

86 |

87 | 88 |

89 | KV is a revolutionary open-source platform that combines the power of Google Gemini AI with advanced web scraping for unparalleled research and job search capabilities. KV goes beyond conventional tools to provide deep insights, analyze resumes, identify skill gaps, and help you find your ideal career path. 90 |

91 | 92 | 93 |

94 | 95 |

96 | 97 | ## 📺 Demo Video 98 | 99 |

100 | 101 |

102 | 103 |

104 | 105 |

106 | 107 |

108 | 109 | 110 |

111 | 112 |

113 | 114 | ## 🌟 Key Features 115 | 116 |

117 | 118 | | 🔥 Feature | 💫 Description | 119 | |:--:|:--| 120 | |

| **Advanced AI Integration**
Leverages Google Gemini's powerful AI capabilities for deep analysis | 121 | |

| **Intelligent Web Scraping**
Gathers comprehensive data from across the internet | 122 | |

| **Multi-Search Engine Support**
Access Google, Bing, DuckDuckGo, and LinkedIn simultaneously | 123 | |

| **Iterative Research**
Self-refining search strategies for more precise results | 124 | |

| **Resume Analysis**
AI-powered evaluation of your resume with improvement suggestions | 125 | |

| **Job Match Algorithm**
Finds ideal job opportunities based on your profile | 126 | |

| **Skill Gap Detection**
Identifies missing skills needed for your target positions | 127 | |

| **Professional Reports**
Generates beautiful PDF reports with data visualizations | 128 | |

| **UI Customization**
Dark & light mode with responsive design for all devices | 129 | 130 |

131 | 132 | 133 |

134 | 135 |

136 | 137 | ## 🖼️ Screenshots 138 | 139 | 140 |

141 |

142 |

📱 Modern Chat Interface

143 |

144 |

145 | 146 |
147 | 148 |

149 |

🔍 Deep Research in Action

150 |

151 |

152 | 153 |
154 | 155 |

156 |

⚙️ Customization Options

157 |

158 |

159 |

160 | 161 | 162 |

163 | 164 |

165 | 166 | ## 🚀 Why Choose KV 167 | 168 |

169 | 170 | 171 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 |

172 \| 🌟 Unique Advantages 173 \|
	Zero API Costs All features available with no usage fees or hidden charges
	Fully Open Source Transparent code, community-driven development
	Unrestricted Data Access No API limitations, broader data coverage
	AI-Enhanced Job Matching Finds opportunities that truly match your profile
	Versatile Research Tools Product analysis, image recognition, sentiment analysis, website summarization

196 |

197 | 198 | 199 |

200 | 201 |

202 | 203 | ## ⚙️ Installation 204 | 205 | 206 |

207 |

⚡ Quick Setup

208 |

237 |

🌐 Open http://127.0.0.1:8000 in your browser

238 | 239 | 240 |

241 |

242 | 243 | 244 |

245 | 246 |

247 | 248 | ## 🎮 Usage Guide 249 | 250 |

251 |

🚀 Get Started in 3 Easy Steps

252 |

253 | 254 | 255 | 256 | 259 | 263 | 264 | 265 | 268 | 272 | 273 | 274 | 277 | 281 | 282 |

257 \| 258 \|	260 \| Start a new chat session or research query 261 \| Navigate to the homepage and select your desired research mode 262 \|
266 \| 267 \|	269 \| Upload your resume for job search or enter research topic 270 \| Provide relevant documents and specify your preferences 271 \|
275 \| 276 \|	278 \| Review results, download reports, and implement recommendations 279 \| KV provides actionable insights and suggestions for improvement 280 \|

283 | 284 | 285 |

286 | 287 |

288 | 289 | ## 🤝 Contribution 290 | 291 |

292 |

293 |

294 | 295 |

296 |

We welcome all contributions! Here's how to get involved:

297 |

298 | 299 |

300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 |

	Code: Fix bugs, add features, improve performance
	Design: Improve UI/UX, create assets, enhance visual appeal
	Documentation: Help with guides, examples, API docs
	Testing: Report bugs, create test cases, QA features
	Spreading the Word: Share KV with others who might benefit

322 |

323 | 324 |

325 | 326 |

327 | 328 | 329 |

330 | 331 |

332 | 333 | 334 |

335 | 336 |

337 | 338 | ## 📄 License 339 | 340 |

341 | 342 | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. 343 | 344 |

345 | 346 |

347 | 348 | ## 📞 Contact & Support 349 | 350 |

351 | 352 |

353 | 354 | 355 |

356 | 357 |

358 | 359 |
360 | 361 | 362 |

363 |

364 |

✨ Thank you for using KV! ✨

365 |

If you found it helpful, please consider giving it a star! ⭐

366 | 367 | 368 |

369 | 370 | 371 |

372 | 373 | 374 |

375 |

376 | 377 | 378 | 386 | -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | # app.py (Complete, Modified) 2 | 3 | from fastapi import FastAPI, Request, HTTPException, UploadFile, File, Form 4 | from fastapi.responses import HTMLResponse, JSONResponse, StreamingResponse 5 | from fastapi.templating import Jinja2Templates 6 | from fastapi.middleware.cors import CORSMiddleware 7 | from typing import List, Dict, Tuple, Optional, Union, Any 8 | import google.generativeai as genai 9 | import os 10 | from dotenv import load_dotenv 11 | import base64 12 | from PIL import Image 13 | import io 14 | import requests 15 | from bs4 import BeautifulSoup, SoupStrainer 16 | import re 17 | import random 18 | import logging 19 | from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type, before_sleep_log 20 | import time 21 | from urllib.parse import urlparse, urljoin, quote_plus, unquote 22 | import json 23 | from io import BytesIO 24 | from reportlab.lib.pagesizes import letter, A4 25 | from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Image as RLImage, PageBreak, Table, TableStyle, KeepTogether 26 | from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle 27 | from reportlab.lib.units import inch 28 | from reportlab.lib.enums import TA_LEFT, TA_CENTER, TA_RIGHT, TA_JUSTIFY 29 | from reportlab.lib.colors import HexColor, black, lightgrey 30 | import csv 31 | from datetime import date, datetime 32 | import math 33 | import concurrent.futures 34 | import brotli 35 | from pdfminer.high_level import extract_text as pdf_extract_text 36 | import docx2txt 37 | import chardet 38 | import asyncio 39 | 40 | 41 | load_dotenv() 42 | 43 | app = FastAPI() 44 | 45 | # --- CORS Configuration --- 46 | app.add_middleware( 47 | CORSMiddleware, 48 | allow_origins=["*"], # Allows all origins 49 | allow_credentials=True, 50 | allow_methods=["*"], # Allows all methods 51 | allow_headers=["*"], # Allows all headers 52 | ) 53 | 54 | 55 | templates = Jinja2Templates(directory="templates") 56 | 57 | class Config: 58 | API_KEY = os.getenv("GEMINI_API_KEY") 59 | LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO").upper() 60 | SNIPPET_LENGTH = 5000 61 | DEEP_RESEARCH_SNIPPET_LENGTH = 10000 62 | MAX_TOKENS_PER_CHUNK = 25000 63 | REQUEST_TIMEOUT = 60 64 | USER_AGENTS = [ 65 | 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 66 | 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15', 67 | 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0', 68 | 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0', 69 | 'Mozilla/5.0 (iPad; CPU OS 14_4 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1', 70 | 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36', 71 | 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36', 72 | 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 73 | 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 74 | 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0' 75 | ] 76 | SEARCH_ENGINES = ["google", "duckduckgo", "bing", "yahoo", "brave", "linkedin"] 77 | JOB_SEARCH_ENGINES = ["linkedin", "indeed", "glassdoor"] 78 | MAX_WORKERS = 10 79 | CACHE_ENABLED = True 80 | CACHE = {} 81 | CACHE_TIMEOUT = 300 82 | INDEED_BASE_DELAY = 2 83 | INDEED_MAX_DELAY = 10 84 | INDEED_RETRIES = 5 85 | JOB_RELEVANCE_MODEL = os.getenv("JOB_RELEVANCE_MODEL", "gemini-2.0-flash") 86 | # REMOVED PDF PAGE LIMIT 87 | 88 | # --- Prompts: More Modular and Specific --- 89 | DEEP_RESEARCH_TABLE_PROMPT = ( 90 | "Create a detailed comparison table analyzing: '{query}'.\n\n" 91 | "**Strict Table Formatting:**\n" 92 | "* **Markdown table ONLY.**\n" 93 | "* **Structure:** Header row, separator row (---), data rows.\n" 94 | "* **Rows:** Start and end with a pipe (|), spaces around pipes.\n" 95 | "* **Separator:** Three dashes (---) per column, alignment colons (:---:).\n" 96 | "* **Cells:** Concise (max 2-3 lines), consistent capitalization, 'N/A' for empty.\n" 97 | "* **NO line breaks within cells.** Use
for internal line breaks if absolutely necessary.\n" 98 | "**Content Guidelines:**\n" 99 | "* 3-5 relevant columns.\n" 100 | "* 4-8 data rows.\n" 101 | "* Proper alignment (usually center or left).\n" 102 | "* Verify all pipe and spacing rules.\n" 103 | "* **Output ONLY the table, NO extra text.**" 104 | ) 105 | DEEP_RESEARCH_REFINEMENT_PROMPT = ( 106 | "Analyze the following research summaries to identify key themes and entities. " 107 | "Suggest 3-5 new, more specific search queries that are *directly related* to the original topic: '{original_query}'. " 108 | "Identify any gaps in the current research and suggest queries to address those gaps. " 109 | "Do not suggest overly broad or generic queries. Focus on refining the search and addressing specific aspects. " 110 | "Prioritize queries that are likely to yield *different* results than the previous searches." 111 | ) 112 | DEEP_RESEARCH_SUMMARY_PROMPT = ( 113 | "Analyze snippets for: '{query}'. Extract key facts, figures, and insights. " 114 | "Be concise, ignore irrelevant content, and prioritize authoritative sources. " 115 | "Focus on the main topic and avoid discussing the research process itself.\n\nContent Snippets:" 116 | ) 117 | 118 | DEEP_RESEARCH_REPORT_PROMPT = ( 119 | "DEEP RESEARCH REPORT: Synthesize a comprehensive report from web research on: '{search_query}'.\n\n" 120 | "{report_structure}\n\n" 121 | "Research Summaries (all iterations):\n{summaries}\n\n" 122 | "Generate the report in Markdown." 123 | ) 124 | 125 | config = Config() 126 | 127 | # --- Logging Configuration --- 128 | logging.basicConfig(level=config.LOG_LEVEL, format='%(asctime)s - %(levelname)s - %(message)s') 129 | 130 | # --- Gemini Configuration --- 131 | genai.configure(api_key=config.API_KEY) 132 | if not config.API_KEY: 133 | logging.error("GEMINI_API_KEY not set. Exiting.") 134 | exit(1) 135 | 136 | conversation_history = [] 137 | deep_research_rate_limits = { 138 | "gemini-2.0-flash": {"requests_per_minute": 15, "last_request": 0}, 139 | "gemini-2.0-flash-thinking-exp-01-21": {"requests_per_minute": 10, "last_request": 0} 140 | } 141 | DEFAULT_DEEP_RESEARCH_MODEL = "gemini-2.0-flash" 142 | 143 | def rate_limit_model(model_name): 144 | if model_name in deep_research_rate_limits: 145 | rate_limit_data = deep_research_rate_limits[model_name] 146 | now = time.time() 147 | time_since_last_request = now - rate_limit_data["last_request"] 148 | requests_per_minute = rate_limit_data["requests_per_minute"] 149 | wait_time = max(0, 60 / requests_per_minute - time_since_last_request) 150 | if wait_time > 0: 151 | logging.info(f"Rate limiting {model_name}, waiting for {wait_time:.2f} seconds") 152 | time.sleep(wait_time) 153 | rate_limit_data["last_request"] = time.time() 154 | 155 | _user_agents = config.USER_AGENTS 156 | 157 | def get_random_user_agent(): 158 | return random.choice(_user_agents) 159 | 160 | def process_base64_image(base64_string): 161 | try: 162 | if 'base64,' in base64_string: 163 | base64_string = base64_string.split('base64,')[1] 164 | image_data = base64.b64decode(base64_string) 165 | image_stream = io.BytesIO(image_data) 166 | image = Image.open(image_stream) 167 | if image.mode != 'RGB': 168 | image = image.convert('RGB') 169 | img_byte_arr = io.BytesIO() 170 | image.save(img_byte_arr, format='JPEG') 171 | return {'mime_type': 'image/jpeg', 'data': img_byte_arr.getvalue()} 172 | except Exception as e: 173 | logging.error(f"Error processing image: {e}") 174 | return None 175 | 176 | def get_shortened_url(url): 177 | try: 178 | parsed_url = urlparse(url) 179 | if not parsed_url.scheme: 180 | url = "http://" + url # Add scheme if missing 181 | tinyurl_api = f"https://tinyurl.com/api-create.php?url={quote_plus(url)}" 182 | response = requests.get(tinyurl_api, timeout=5) 183 | response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx) 184 | return response.text 185 | except requests.exceptions.RequestException as e: 186 | logging.error(f"Error shortening URL '{url}': {e}") 187 | return url # Return original URL on error 188 | except Exception as e: 189 | logging.error(f"Unexpected error shortening URL '{url}': {e}") 190 | return url 191 | 192 | def fix_url(url): 193 | try: 194 | parsed = urlparse(url) 195 | if not parsed.scheme: 196 | url = "https://" + url 197 | parsed = urlparse(url) # Re-parse with scheme 198 | if not parsed.netloc: 199 | return None # Invalid URL 200 | return url.split("?")[0] # Removes parameters 201 | except Exception: 202 | return None 203 | 204 | def scrape_search_engine(search_query: str, engine_name: str) -> List[str]: 205 | """Scrapes search results from specified search engine.""" 206 | if engine_name == "google": 207 | return scrape_google(search_query) 208 | elif engine_name == "duckduckgo": 209 | return scrape_duckduckgo(search_query) 210 | elif engine_name == "bing": 211 | return scrape_bing(search_query) 212 | elif engine_name == "yahoo": 213 | return scrape_yahoo(search_query) 214 | elif engine_name == "brave": 215 | return scrape_brave(search_query) 216 | elif engine_name == "linkedin": 217 | return scrape_linkedin(search_query) 218 | else: 219 | logging.warning(f"Unknown search engine: {engine_name}") 220 | return [] 221 | 222 | def scrape_google(search_query: str) -> List[str]: 223 | """Scrapes Google search results.""" 224 | search_results = [] 225 | google_url = f"https://www.google.com/search?q={quote_plus(search_query)}&num=20" 226 | try: 227 | headers = {'User-Agent': get_random_user_agent()} 228 | response = requests.get(google_url, headers=headers, timeout=config.REQUEST_TIMEOUT) 229 | response.raise_for_status() 230 | logging.info(f"Google Status Code: {response.status_code} for query: {search_query}") 231 | if response.status_code == 200: 232 | only_results = SoupStrainer('div', class_='tF2Cxc') 233 | google_soup = BeautifulSoup(response.text, 'html.parser', parse_only=only_results) 234 | for result in google_soup.find_all('div', class_='tF2Cxc'): 235 | link = result.find('a', href=True) 236 | if link: 237 | href = link['href'] 238 | fixed_url = fix_url(href) 239 | if fixed_url: 240 | search_results.append(fixed_url) 241 | elif response.status_code == 429: 242 | logging.warning("Google rate limit hit (429).") 243 | else: 244 | logging.warning(f"Google search failed with status code: {response.status_code}") 245 | except requests.exceptions.RequestException as e: 246 | logging.error(f"Error scraping Google: {e}") 247 | return list(set(search_results)) # Remove duplicates 248 | 249 | def scrape_duckduckgo(search_query: str) -> List[str]: 250 | """Scrapes DuckDuckGo search results.""" 251 | search_results = [] 252 | duck_url = f"https://html.duckduckgo.com/html/?q={quote_plus(search_query)}" 253 | try: 254 | response = requests.get(duck_url, headers={'User-Agent': get_random_user_agent()}, timeout=config.REQUEST_TIMEOUT) 255 | response.raise_for_status() 256 | logging.info(f"DuckDuckGo Status Code: {response.status_code}") 257 | if response.status_code == 200: 258 | only_results = SoupStrainer('a', class_='result__a') 259 | duck_soup = BeautifulSoup(response.text, 'html.parser', parse_only=only_results) 260 | for a_tag in duck_soup.find_all('a', class_='result__a', href=True): 261 | href = a_tag['href'] 262 | fixed_url = fix_url(urljoin("https://html.duckduckgo.com/", href)) # Make absolute 263 | if fixed_url: search_results.append(fixed_url) 264 | except Exception as e: 265 | logging.error(f"Error scraping DuckDuckGo: {e}") 266 | return list(set(search_results)) 267 | 268 | def scrape_bing(search_query: str) -> List[str]: 269 | """Scrapes Bing search results.""" 270 | search_results = [] 271 | bing_url = f"https://www.bing.com/search?q={quote_plus(search_query)}" 272 | try: 273 | response = requests.get(bing_url, headers={'User-Agent': get_random_user_agent()}, timeout=config.REQUEST_TIMEOUT) 274 | response.raise_for_status() 275 | logging.info(f"Bing Status Code: {response.status_code}") 276 | if response.status_code == 200: 277 | only_results = SoupStrainer('li', class_='b_algo') 278 | bing_soup = BeautifulSoup(response.text, 'html.parser', parse_only=only_results) 279 | for li in bing_soup.find_all('li', class_='b_algo'): 280 | for a_tag in li.find_all('a', href=True): 281 | href = a_tag['href'] 282 | fixed_url = fix_url(href) # Fix the URL 283 | if fixed_url: search_results.append(fixed_url) 284 | except Exception as e: 285 | logging.error(f"Error scraping Bing: {e}") 286 | return list(set(search_results)) 287 | 288 | def scrape_yahoo(search_query: str) -> List[str]: 289 | """Scrapes Yahoo search results.""" 290 | search_results = [] 291 | yahoo_url = f"https://search.yahoo.com/search?p={quote_plus(search_query)}" 292 | try: 293 | response = requests.get(yahoo_url, headers={'User-Agent': get_random_user_agent()}, timeout=config.REQUEST_TIMEOUT) 294 | response.raise_for_status() 295 | logging.info(f"Yahoo Status Code: {response.status_code}") 296 | if response.status_code == 200: 297 | only_dd_divs = SoupStrainer('div', class_=lambda x: x and x.startswith('dd')) 298 | yahoo_soup = BeautifulSoup(response.text, 'html.parser', parse_only=only_dd_divs) 299 | for div in yahoo_soup.find_all('div', class_=lambda x: x and x.startswith('dd')): 300 | for a_tag in div.find_all('a', href=True): 301 | href = a_tag['href'] 302 | match = re.search(r'/RU=(.*?)/RK=', href) # Yahoo uses a redirect 303 | if match: 304 | try: 305 | decoded_url = unquote(match.group(1)) 306 | fixed_url = fix_url(decoded_url) # Fix the URL 307 | if fixed_url: search_results.append(fixed_url) 308 | except: 309 | logging.warning(f"Error decoding Yahoo URL: {href}") 310 | elif href: # Sometimes the direct URL is present 311 | fixed_url = fix_url(href) 312 | if fixed_url: search_results.append(fixed_url) 313 | except Exception as e: 314 | logging.error(f"Error scraping Yahoo: {e}") 315 | return list(set(search_results)) 316 | 317 | def scrape_brave(search_query: str) -> List[str]: 318 | """Scrapes Brave search results.""" 319 | search_results = [] 320 | brave_url = f"https://search.brave.com/search?q={quote_plus(search_query)}" 321 | try: 322 | response = requests.get(brave_url, headers={'User-Agent': get_random_user_agent()}, timeout=config.REQUEST_TIMEOUT) 323 | response.raise_for_status() 324 | logging.info(f"Brave Status Code: {response.status_code}") 325 | 326 | if response.status_code == 200: 327 | if response.headers.get('Content-Encoding') == 'br': 328 | try: 329 | content = brotli.decompress(response.content) # Decompress Brotli 330 | only_links = SoupStrainer('a', class_='result-title') 331 | brave_soup = BeautifulSoup(content, 'html.parser', parse_only=only_links) 332 | 333 | except brotli.error as e: 334 | logging.error(f"Error decoding Brotli content: {e}") 335 | return [] 336 | else: 337 | only_links = SoupStrainer('a', class_='result-title') 338 | brave_soup = BeautifulSoup(response.text, 'html.parser', parse_only=only_links) 339 | 340 | for a_tag in brave_soup.find_all('a', class_='result-title', href=True): 341 | href = a_tag['href'] 342 | fixed_url = fix_url(href) # Fix URL 343 | if fixed_url: 344 | search_results.append(fixed_url) 345 | 346 | elif response.status_code == 429: 347 | logging.warning("Brave rate limit hit (429).") 348 | else: 349 | logging.warning(f"Brave search failed with status code: {response.status_code}") 350 | 351 | except Exception as e: 352 | logging.error(f"Error scraping Brave: {e}") 353 | return list(set(search_results)) 354 | 355 | def scrape_linkedin(search_query: str) -> List[str]: 356 | """Scrapes LinkedIn search results (people primarily).""" 357 | search_results = [] 358 | linkedin_url = f"https://www.linkedin.com/search/results/all/?keywords={quote_plus(search_query)}" 359 | try: 360 | response = requests.get(linkedin_url, headers={'User-Agent': get_random_user_agent()}, timeout=config.REQUEST_TIMEOUT) 361 | response.raise_for_status() 362 | logging.info(f"LinkedIn Status Code: {response.status_code}") 363 | if response.status_code == 200: 364 | only_results = SoupStrainer('div', class_='entity-result__item') 365 | linkedin_soup = BeautifulSoup(response.text, 'html.parser', parse_only=only_results) 366 | for result in linkedin_soup.find_all('div', class_='entity-result__item'): 367 | try: 368 | link_tag = result.find('a', class_='app-aware-link') 369 | if not link_tag or not link_tag.get('href'): 370 | continue 371 | profile_url = fix_url(link_tag.get('href')) # Fix the URL 372 | if not profile_url or '/in/' not in profile_url: # Check for profile 373 | continue 374 | # check for company context 375 | if " at " in search_query.lower(): 376 | context = search_query.lower().split(" at ")[1] # company name 377 | name = result.find('span', class_='entity-result__title-text') 378 | title_company = result.find('div', class_='entity-result__primary-subtitle') 379 | combined_text = "" 380 | if name: 381 | combined_text += name.get_text(strip=True).lower() + " " 382 | if title_company: 383 | combined_text += title_company.get_text(strip=True).lower() 384 | if context not in combined_text: # Check if the company matches 385 | continue 386 | 387 | search_results.append(profile_url) 388 | except Exception as e: 389 | logging.warning(f"Error processing LinkedIn result: {e}") 390 | continue 391 | except Exception as e: 392 | logging.error(f"Error scraping LinkedIn: {e}") 393 | return search_results # No need to remove duplicates for LinkedIn 394 | 395 | 396 | def _decode_content(response: requests.Response) -> str: 397 | """Decodes response content, handling different encodings.""" 398 | detected_encoding = chardet.detect(response.content)['encoding'] 399 | if detected_encoding is None: 400 | logging.warning(f"Chardet failed. Using UTF-8.") 401 | detected_encoding = 'utf-8' 402 | logging.debug(f"Detected encoding: {detected_encoding}") 403 | try: 404 | return response.content.decode(detected_encoding, errors='replace') 405 | except UnicodeDecodeError: 406 | logging.warning(f"Decoding failed with {detected_encoding}. Trying UTF-8.") 407 | try: return response.content.decode('utf-8', errors='replace') 408 | except: 409 | logging.warning("Decoding failed. Using latin-1 (may cause data loss).") 410 | return response.content.decode('latin-1', errors='replace') 411 | 412 | def fetch_page_content(url: str, snippet_length: Optional[int] = None, 413 | extract_links: bool = False, extract_emails: bool = False) -> Tuple[List[str], List[str], Dict[str, Any]]: 414 | """Fetches content, handles caching, extracts data.""" 415 | if snippet_length is None: 416 | snippet_length = config.SNIPPET_LENGTH 417 | content_snippets = [] 418 | references = [] 419 | extracted_data = {} 420 | 421 | if config.CACHE_ENABLED: 422 | if url in config.CACHE: 423 | if time.time() - config.CACHE[url]['timestamp'] < config.CACHE_TIMEOUT: 424 | logging.info(f"Using cached content for: {url}") 425 | return config.CACHE[url]['content_snippets'], config.CACHE[url]['references'], config.CACHE[url]['extracted_data'] 426 | else: 427 | logging.info(f"Cache expired for: {url}") 428 | del config.CACHE[url] # Remove expired entry 429 | 430 | try: 431 | response = requests.get(url, headers={'User-Agent': get_random_user_agent()}, timeout=config.REQUEST_TIMEOUT) 432 | response.raise_for_status() # Raise HTTPError for bad responses 433 | logging.debug(f"Fetching page content status: {response.status_code} for: {url}") 434 | if response.status_code == 200: 435 | page_text = _decode_content(response) 436 | if page_text: 437 | page_soup = BeautifulSoup(page_text, 'html.parser') 438 | for script in page_soup(["script", "style"]): 439 | script.decompose() # Remove script and style tags 440 | 441 | text = page_soup.get_text(separator=' ', strip=True) 442 | text = re.sub(r'[\ud800-\udbff](?![\udc00-\udfff])|(? List[str]: 474 | """Generates alternative search queries using Gemini.""" 475 | prompt = f"Suggest 3 refined search queries for '{original_query}', optimizing for broad and effective web results." 476 | parts = [{"role": "user", "parts": [{"text": prompt}]}] 477 | safety_settings = { # Gemini Pro safety settings 478 | "HARM_CATEGORY_HARASSMENT": "BLOCK_NONE", 479 | "HARM_CATEGORY_HATE_SPEECH": "BLOCK_NONE", 480 | "HARM_CATEGORY_SEXUALLY_EXPLICIT": "BLOCK_NONE", 481 | "HARM_CATEGORY_DANGEROUS_CONTENT": "BLOCK_NONE", 482 | } 483 | model = genai.GenerativeModel(model_name="gemini-2.0-flash") #Using gemini-2.0-flash for generating alternative quires 484 | try: 485 | response = model.generate_content(parts, safety_settings=safety_settings) 486 | return [q.strip() for q in response.text.split('\n') if q.strip()] # returns refined prompts 487 | except Exception as e: 488 | logging.error(f"Error generating alternative queries: {e}") 489 | return [] 490 | 491 | @retry(wait=wait_exponential(multiplier=1, min=4, max=10), stop=stop_after_attempt(3), retry=retry_if_exception_type(Exception)) 492 | def generate_gemini_response(prompt: str, model_name: str = "gemini-2.0-flash", response_format: str = "markdown") -> Union[str, Dict, List]: 493 | """Generates a response from Gemini, handling retries/formats.""" 494 | 495 | if model_name not in deep_research_rate_limits: # No rate limit for job relevance model 496 | logging.info(f"Using model: {model_name}") 497 | else: 498 | rate_limit_model(model_name) # Rate limit the deep research model 499 | 500 | parts = [{"role": "user", "parts": [{"text": prompt}]}] 501 | safety_settings = { 502 | "HARM_CATEGORY_HARASSMENT": "BLOCK_NONE", 503 | "HARM_CATEGORY_HATE_SPEECH": "BLOCK_NONE", 504 | "HARM_CATEGORY_SEXUALLY_EXPLICIT": "BLOCK_NONE", 505 | "HARM_CATEGORY_DANGEROUS_CONTENT": "BLOCK_NONE", 506 | } 507 | 508 | model = genai.GenerativeModel(model_name=model_name) 509 | 510 | try: 511 | response = model.generate_content(parts, safety_settings=safety_settings) 512 | text_response = response.text 513 | 514 | if response_format == "json": 515 | try: 516 | # More robust JSON parsing: Handle leading/trailing text, comments. 517 | response_text_cleaned = re.sub(r"```json\n?|```|[\s]*//.*|[\s]*/\*[\s\S]*?\*/[\s]*", "", text_response).strip() 518 | return json.loads(response_text_cleaned) 519 | except json.JSONDecodeError as e: 520 | logging.warning(f"Invalid JSON, returning raw text. Error: {e}, Response: {text_response}") 521 | return {"error": "Invalid JSON", "raw_text": text_response} 522 | 523 | elif response_format == "csv": 524 | try: 525 | csv_data = io.StringIO(text_response) 526 | return list(csv.reader(csv_data, delimiter=',', quotechar='"')) 527 | except Exception as e: 528 | logging.warning(f"Invalid CSV, returning raw text. Error: {e}") 529 | return {"error": "Invalid CSV", "raw_text": text_response} 530 | else: # Default to Markdown 531 | text_response = re.sub(r'\n+', '\n\n', text_response) # Consistent newlines 532 | text_response = re.sub(r' +', ' ', text_response) # Single spaces 533 | return text_response.replace("```markdown", "").replace("```", "").strip() # Remove Markdown fences 534 | except Exception as e: 535 | logging.error(f"Gemini error: {e}") 536 | raise 537 | 538 | # --- FastAPI Endpoints --- 539 | 540 | @app.get("/", response_class=HTMLResponse) 541 | async def read_root(request: Request): 542 | return templates.TemplateResponse("index.html", {"request": request}) 543 | 544 | @app.post("/api/chat") 545 | async def chat_endpoint(request: Request): 546 | global conversation_history 547 | try: 548 | data = await request.json() 549 | user_message = data.get('message', '') 550 | image_data = data.get('image') 551 | custom_instruction = data.get('custom_instruction') 552 | model_name = data.get('model_name', 'gemini-2.0-flash') 553 | 554 | if custom_instruction and len(conversation_history) == 0: 555 | model = genai.GenerativeModel(model_name=model_name) 556 | chat = model.start_chat(history=[ 557 | {"role": "user", "parts": [{"text": custom_instruction}]}, 558 | {"role": "model", "parts": ["Understood."]} 559 | ]) 560 | conversation_history = chat.history 561 | else: 562 | model = genai.GenerativeModel(model_name=model_name) 563 | chat = model.start_chat(history=conversation_history) 564 | 565 | if image_data: 566 | image_part = process_base64_image(image_data) 567 | if image_part: 568 | # Correctly pass the image to the model 569 | response = chat.send_message([user_message, {"mime_type": image_part["mime_type"], "data": image_part["data"]}], stream=False) 570 | 571 | else: 572 | raise HTTPException(status_code=400, detail="Failed to process image") 573 | else: 574 | response = chat.send_message(user_message, stream=False) 575 | 576 | response_text = response.text 577 | response_text = re.sub(r'\n+', '\n\n', response_text) 578 | response_text = re.sub(r' +', ' ', response_text) 579 | response_text = re.sub(r'^- ', '* ', response_text, flags=re.MULTILINE) # Correct bullet points 580 | response_text = response_text.replace("```markdown", "").replace("```", "").strip() # Remove markdown 581 | 582 | conversation_history = chat.history 583 | 584 | def content_to_dict(content): 585 | return { 586 | "role": content.role, 587 | "parts": [part.text if hasattr(part, 'text') else str(part) for part in content.parts] 588 | } 589 | serialized_history = [content_to_dict(content) for content in conversation_history] 590 | 591 | return JSONResponse({"response": response_text, "history": serialized_history}) 592 | 593 | except Exception as e: 594 | logging.error(f"Chat error: {e}") 595 | raise HTTPException(status_code=500, detail=str(e)) 596 | 597 | @app.post("/api/clear") 598 | async def clear_history_endpoint(): 599 | global conversation_history 600 | conversation_history = [] 601 | return JSONResponse({"message": "Cleared history."}) 602 | 603 | def process_in_chunks(search_results: List[str], search_query: str, prompt_prefix: str = "", 604 | fetch_options: Optional[Dict] = None) -> Tuple[List[str], List[str], List[Dict]]: 605 | """Processes search results in chunks, fetching/summarizing.""" 606 | chunk_summaries = [] 607 | references = [] 608 | processed_tokens = 0 609 | current_chunk_content = [] 610 | extracted_data_all = [] 611 | 612 | if fetch_options is None: 613 | fetch_options = {} 614 | 615 | with concurrent.futures.ThreadPoolExecutor(max_workers=config.MAX_WORKERS) as executor: 616 | futures = {executor.submit(fetch_page_content, url, config.DEEP_RESEARCH_SNIPPET_LENGTH, **fetch_options): url for url in search_results} 617 | for future in concurrent.futures.as_completed(futures): 618 | url = futures[future] 619 | try: 620 | page_snippets, page_refs, extracted_data = future.result() 621 | references.extend(page_refs) 622 | extracted_data_all.append({'url': url, 'data': extracted_data}) 623 | 624 | for snippet in page_snippets: 625 | estimated_tokens = len(snippet) // 4 # Estimate tokens 626 | if processed_tokens + estimated_tokens > config.MAX_TOKENS_PER_CHUNK: 627 | # Combine, summarize, and reset 628 | combined_content = "\n\n".join(current_chunk_content) 629 | if combined_content.strip(): 630 | summary_prompt = config.DEEP_RESEARCH_SUMMARY_PROMPT.format(query=search_query) + f"\n\n{combined_content}" 631 | summary = generate_gemini_response(summary_prompt, model_name=DEFAULT_DEEP_RESEARCH_MODEL) 632 | chunk_summaries.append(summary) 633 | current_chunk_content = [] 634 | processed_tokens = 0 635 | 636 | current_chunk_content.append(snippet) 637 | processed_tokens += estimated_tokens 638 | 639 | except Exception as e: 640 | logging.error(f"Error processing {url}: {e}") 641 | continue 642 | 643 | # Process any remaining content 644 | if current_chunk_content: 645 | combined_content = "\n\n".join(current_chunk_content) 646 | if combined_content.strip(): 647 | summary_prompt = config.DEEP_RESEARCH_SUMMARY_PROMPT.format(query=search_query) + f"\n\n{combined_content}" 648 | summary = generate_gemini_response(summary_prompt, model_name=DEFAULT_DEEP_RESEARCH_MODEL) 649 | chunk_summaries.append(summary) 650 | 651 | return chunk_summaries, references, extracted_data_all 652 | 653 | @app.post("/api/online") 654 | async def online_search_endpoint(request: Request): 655 | """Performs an online search and summarizes results.""" 656 | try: 657 | data = await request.json() 658 | search_query = data.get('query', '') 659 | if not search_query: 660 | raise HTTPException(status_code=400, detail="No query provided") 661 | 662 | references = [] 663 | search_results = [] 664 | content_snippets = [] 665 | search_engines_requested = data.get('search_engines', config.SEARCH_ENGINES) 666 | 667 | with concurrent.futures.ThreadPoolExecutor(max_workers=config.MAX_WORKERS) as executor: 668 | search_futures = [executor.submit(scrape_search_engine, search_query, engine) for engine in search_engines_requested] 669 | for future in concurrent.futures.as_completed(search_futures): 670 | try: 671 | search_results.extend(future.result()) 672 | except Exception as e: 673 | logging.error(f"Search engine scrape error: {e}") 674 | 675 | if not search_results: 676 | logging.warning(f"Initial search failed: {search_query}. Trying alternatives.") 677 | alternative_queries = generate_alternative_queries(search_query) 678 | if alternative_queries: 679 | logging.info(f"Alternative queries: {alternative_queries}") 680 | for alt_query in alternative_queries: 681 | alt_search_futures = [executor.submit(scrape_search_engine, alt_query, engine) for engine in 682 | search_engines_requested] 683 | for future in concurrent.futures.as_completed(alt_search_futures): 684 | try: 685 | result = future.result() 686 | if result: 687 | search_results.extend(result) 688 | logging.info(f"Results found with alternative: {alt_query}") 689 | break # Stop on first result 690 | except Exception as e: 691 | logging.error(f"Alternative query scrape error: {e}") 692 | if search_results: 693 | break # Stop after finding results 694 | else: 695 | logging.warning("Gemini failed to generate alternatives.") 696 | 697 | if not search_results: 698 | raise HTTPException(status_code=404, detail="No results found") 699 | 700 | unique_search_results = list(set(search_results)) 701 | logging.debug(f"Unique URLs to fetch: {unique_search_results}") 702 | # Fetch content concurrently 703 | fetch_futures = {executor.submit(fetch_page_content, url): url for url in unique_search_results} 704 | for future in concurrent.futures.as_completed(fetch_futures): 705 | url = fetch_futures[future] 706 | try: 707 | page_snippets, page_refs, _ = future.result() 708 | content_snippets.extend(page_snippets) 709 | references.extend(page_refs) 710 | except Exception as e: 711 | logging.error(f"Error fetching {url}: {e}") 712 | 713 | combined_content = "\n\n".join(content_snippets) 714 | prompt = (f"Analyze web content for: '{search_query}'. Extract key facts, figures, and details. Be concise. " 715 | f"Content:\n\n{combined_content}\n\nProvide a fact-based summary.") 716 | explanation = generate_gemini_response(prompt) #Default model 717 | global conversation_history # Access global variable 718 | 719 | def serialize_content(content): 720 | if isinstance(content, list): 721 | return [serialize_content(item) for item in content] 722 | elif hasattr(content, 'role') and hasattr(content, 'parts'): 723 | return { 724 | "role": content.role, 725 | "parts": [part.text if hasattr(part, 'text') else str(part) for part in content.parts] 726 | } 727 | else: 728 | return content 729 | 730 | conversation_history.append({"role": "user", "parts": [f"Online: {search_query}"]}) 731 | conversation_history.append({"role": "model", "parts": [explanation]}) 732 | serialized_history = [serialize_content(item) for item in conversation_history] 733 | 734 | with concurrent.futures.ThreadPoolExecutor(max_workers=config.MAX_WORKERS) as executor: 735 | shortened_references = list(executor.map(get_shortened_url, references)) #Shorten URLs concurrently 736 | 737 | 738 | return JSONResponse({"explanation": explanation, "references": shortened_references, "history": serialized_history}) 739 | 740 | except HTTPException as e: 741 | raise e # Re-raise HTTP exceptions 742 | except Exception as e: 743 | logging.exception(f"Online search error: {e}") # Log with traceback 744 | raise HTTPException(status_code=500, detail=str(e)) 745 | 746 | 747 | def serialize_content(content): #helper function 748 | if isinstance(content, list): 749 | return [serialize_content(item) for item in content] 750 | elif hasattr(content, 'role') and hasattr(content, 'parts'): 751 | return { 752 | "role": content.role, 753 | "parts": [part.text if hasattr(part, 'text') else str(part) for part in content.parts] 754 | } 755 | else: 756 | return content 757 | 758 | 759 | @app.post("/api/deep_research") 760 | async def deep_research_endpoint(request: Request): 761 | try: 762 | data = await request.json() 763 | search_query = data.get('query', '') 764 | if not search_query: 765 | raise HTTPException(status_code=400, detail="No query provided") 766 | 767 | model_name = data.get('model_name', DEFAULT_DEEP_RESEARCH_MODEL) 768 | start_time = time.time() 769 | search_engines_requested = data.get('search_engines', config.SEARCH_ENGINES) 770 | output_format = data.get('output_format', 'markdown') # Default to markdown 771 | extract_links = data.get('extract_links', False) 772 | extract_emails = data.get('extract_emails', False) 773 | download_pdf = data.get('download_pdf', True) # Default to True 774 | max_iterations = int(data.get('max_iterations', 3)) # Default to 3 775 | 776 | all_summaries = [] 777 | all_references = [] 778 | all_extracted_data = [] 779 | current_query = search_query # initial query 780 | 781 | 782 | # --- Enhanced Report Structure Logic --- 783 | report_structure = ( 784 | "**Structure your report with clear headings and subheadings.**\n" 785 | "Use bullet points and numbered lists where appropriate.\n" 786 | "Include a concise introduction and conclusion.\n\n" 787 | ) 788 | 789 | # Conditionally add table instructions 790 | if "table" in output_format.lower(): 791 | report_structure += ( 792 | "**Include a comparison table summarizing key findings.** " 793 | "Use the detailed table formatting guidelines provided earlier.\n" 794 | ) 795 | table_prompt = config.DEEP_RESEARCH_TABLE_PROMPT.format(query=search_query) #prepare the table prompt 796 | else: 797 | report_structure += "**Do NOT include a table.** Focus on a narrative report.\n" 798 | 799 | 800 | 801 | with concurrent.futures.ThreadPoolExecutor(max_workers=config.MAX_WORKERS) as executor: 802 | for iteration in range(max_iterations): 803 | logging.info(f"Iteration {iteration + 1}: {current_query}") 804 | search_results = [] 805 | #current_query = search_query if iteration == 0 else current_query # To keep track of current 806 | search_futures = [executor.submit(scrape_search_engine, current_query, engine) for engine in 807 | search_engines_requested] 808 | for future in concurrent.futures.as_completed(search_futures): 809 | search_results.extend(future.result()) 810 | 811 | unique_results = list(set(search_results)) # Remove duplicates 812 | logging.debug(f"Iteration {iteration + 1} - URLs: {unique_results}") 813 | 814 | prompt_prefix = config.DEEP_RESEARCH_SUMMARY_PROMPT.format(query=current_query) 815 | fetch_options = {'extract_links': extract_links, 'extract_emails': extract_emails} 816 | 817 | chunk_summaries, refs, extracted = process_in_chunks(unique_results, current_query, prompt_prefix, 818 | fetch_options) 819 | all_summaries.extend(chunk_summaries) 820 | all_references.extend(refs) 821 | all_extracted_data.extend(extracted) 822 | 823 | if iteration < max_iterations - 1: 824 | # Refine the search query 825 | if all_summaries: # Check if we have any summaries to work with. 826 | refinement_prompt = config.DEEP_RESEARCH_REFINEMENT_PROMPT.format(original_query=search_query) + "\n\nResearch Summaries:\n" + "\n".join(all_summaries) 827 | refined_response = generate_gemini_response(refinement_prompt, model_name=model_name) 828 | new_queries = [q.strip() for q in refined_response.split('\n') if q.strip()] 829 | current_query = " ".join(new_queries[:3]) # Use top queries 830 | else: 831 | logging.info("No summaries for refinement. Skipping to next iteration.") 832 | break # If no summaries, stop refining. 833 | 834 | 835 | # --- Final Report Generation (with structure) --- 836 | if all_summaries: 837 | final_prompt = config.DEEP_RESEARCH_REPORT_PROMPT.format( 838 | search_query=search_query, 839 | report_structure=report_structure, 840 | summaries="\n\n".join(all_summaries) 841 | ) 842 | 843 | # If table is requested, prepend the table prompt. 844 | if "table" in output_format.lower(): 845 | final_prompt = table_prompt + "\n\n" + final_prompt 846 | 847 | 848 | final_explanation = generate_gemini_response(final_prompt, response_format=output_format, 849 | model_name=model_name) 850 | 851 | # --- Table Parsing (if applicable) --- 852 | if "table" in output_format.lower(): 853 | try: 854 | parsed_table = parse_markdown_table(final_explanation) 855 | if parsed_table: 856 | final_explanation = parsed_table # Use parsed table 857 | else: 858 | logging.warning("Table parsing failed. Returning raw response.") 859 | final_explanation = {"error": "Failed to parse table", "raw_text": final_explanation} 860 | except Exception as e: 861 | logging.error(f"Error during table parsing: {e}") 862 | final_explanation = {"error": "Failed to parse table", "raw_text": final_explanation} 863 | else: 864 | final_explanation = "No relevant content found for the given query." 865 | 866 | 867 | 868 | global conversation_history # Access global variable 869 | conversation_history.append({"role": "user", "parts": [f"Deep research query: {search_query}"]}) 870 | conversation_history.append({"role": "model", "parts": [final_explanation]}) 871 | serialized_history = [serialize_content(item) for item in conversation_history] 872 | 873 | end_time = time.time() 874 | elapsed_time = end_time - start_time 875 | 876 | 877 | response_data = { 878 | "explanation": final_explanation, 879 | "references": all_references, 880 | "history": serialized_history, 881 | "elapsed_time": f"{elapsed_time:.2f} seconds", 882 | "extracted_data": all_extracted_data, 883 | "current_query": current_query, # Include the final query used 884 | "iteration": iteration + 1 # Include the final iteration number 885 | 886 | } 887 | if download_pdf: 888 | pdf_buffer = generate_pdf( 889 | "", # Pass an EMPTY STRING as the title. 890 | final_explanation if isinstance(final_explanation, str) 891 | else "\n".join(str(row) for row in final_explanation), 892 | all_references 893 | ) 894 | headers = { 895 | 'Content-Disposition': f'attachment; filename="{quote_plus(search_query)}_report.pdf"' 896 | } 897 | return StreamingResponse(iter([pdf_buffer.getvalue()]), media_type="application/pdf", headers=headers) 898 | 899 | 900 | if output_format == "json": 901 | if isinstance(final_explanation, dict): 902 | # If it's already a dict (like error case), return it directly 903 | response_data = final_explanation 904 | elif isinstance(final_explanation, list): 905 | # table data 906 | response_data = {"table_data": final_explanation} 907 | else: 908 | # text explanation 909 | response_data = {"explanation": final_explanation} 910 | # Add other data to the JSON response 911 | response_data.update({ 912 | "references": all_references, 913 | "history": serialized_history, 914 | "elapsed_time": f"{elapsed_time:.2f} seconds", 915 | "extracted_data": all_extracted_data 916 | }) 917 | return JSONResponse(response_data) 918 | 919 | 920 | elif output_format == "csv": 921 | if isinstance(final_explanation, list): 922 | output = io.StringIO() 923 | writer = csv.writer(output) 924 | writer.writerows(final_explanation) # Write the list of lists 925 | response_data["explanation"] = output.getvalue() 926 | elif isinstance(final_explanation, dict) and "raw_text" in final_explanation: 927 | # Handle potential error dict 928 | response_data = {"explanation": final_explanation["raw_text"]} 929 | else: 930 | response_data = {"explanation": final_explanation} # for normal text 931 | 932 | response_data.update({ 933 | "references": all_references, 934 | "history": serialized_history, 935 | "elapsed_time": f"{elapsed_time:.2f} seconds", 936 | "extracted_data": all_extracted_data 937 | }) 938 | 939 | return JSONResponse(response_data) 940 | 941 | # If not JSON or CSV, return as is (Markdown) 942 | return JSONResponse(response_data) 943 | 944 | except HTTPException as e: 945 | raise e # Re-raise HTTP exceptions 946 | except Exception as e: 947 | logging.exception(f"Error in deep research: {e}") # Log full traceback 948 | raise HTTPException(status_code=500, detail=str(e)) 949 | 950 | def parse_markdown_table(markdown_table_string): 951 | """Parses a Markdown table string with improved robustness.""" 952 | lines = [line.strip() for line in markdown_table_string.split('\n') if line.strip()] 953 | if not lines: 954 | return [] 955 | 956 | table_data = [] 957 | header_detected = False 958 | 959 | for line in lines: 960 | line = line.strip().strip('|').replace(' | ', '|').replace('| ', '|').replace(' |', '|') # Normalize spacing 961 | cells = [cell.strip() for cell in line.split('|')] 962 | 963 | if all(c in '-:| ' for c in line) and len(cells) > 1 and not header_detected: 964 | # Skip header separator, but only *before* processing the first non-separator row 965 | header_detected = True 966 | continue 967 | 968 | if cells: 969 | table_data.append(cells) 970 | 971 | # Handle missing cells and inconsistent column counts. 972 | if table_data: 973 | max_cols = len(table_data[0]) 974 | normalized_data = [] 975 | for row in table_data: 976 | normalized_data.append(row + [''] * (max_cols - len(row))) # Pad with empty strings 977 | return normalized_data 978 | else: 979 | return [] 980 | 981 | def generate_pdf(report_title, content, references): 982 | buffer = BytesIO() 983 | doc = SimpleDocTemplate(buffer, pagesize=A4, 984 | rightMargin=0.7*inch, leftMargin=0.7*inch, 985 | topMargin=0.7*inch, bottomMargin=0.7*inch) 986 | styles = getSampleStyleSheet() 987 | today = date.today() 988 | formatted_date = today.strftime("%B %d, %Y") 989 | 990 | # --- Custom Styles --- 991 | custom_styles = { 992 | 'Title': ParagraphStyle( 993 | 'CustomTitle', 994 | parent=styles['Heading1'], 995 | fontSize=24, 996 | leading=32, 997 | spaceAfter=20, 998 | alignment=TA_CENTER, 999 | textColor=HexColor("#1a237e"), 1000 | fontName='Helvetica-Bold' 1001 | ), 1002 | 'Heading1': ParagraphStyle( 1003 | 'CustomHeading1', 1004 | parent=styles['Heading1'], 1005 | fontSize=18, 1006 | leading=24, 1007 | spaceBefore=20, 1008 | spaceAfter=12, 1009 | textColor=HexColor("#283593"), 1010 | fontName='Helvetica-Bold', 1011 | keepWithNext=True # Keep with the following paragraph 1012 | ), 1013 | 'Heading2': ParagraphStyle( 1014 | 'CustomHeading2', 1015 | parent=styles['Heading2'], 1016 | fontSize=16, 1017 | leading=22, 1018 | spaceBefore=16, 1019 | spaceAfter=10, 1020 | textColor=HexColor("#3949ab"), 1021 | fontName='Helvetica-Bold', 1022 | keepWithNext=True 1023 | ), 1024 | 'Heading3': ParagraphStyle( 1025 | 'CustomHeading3', 1026 | parent=styles['Heading3'], 1027 | fontSize=14, 1028 | leading=20, 1029 | spaceBefore=14, 1030 | spaceAfter=8, 1031 | textColor=HexColor("#455a64"), 1032 | fontName='Helvetica-Bold', 1033 | keepWithNext=True 1034 | ), 1035 | 'Paragraph': ParagraphStyle( 1036 | 'CustomParagraph', 1037 | parent=styles['Normal'], 1038 | fontSize=11, 1039 | leading=16, 1040 | spaceAfter=10, 1041 | alignment=TA_JUSTIFY, 1042 | textColor=HexColor("#212121"), 1043 | firstLineIndent=0.25*inch 1044 | ), 1045 | 'TableCell': ParagraphStyle( 1046 | 'CustomTableCell', 1047 | parent=styles['Normal'], 1048 | fontSize=10, 1049 | leading=14, 1050 | spaceBefore=4, 1051 | spaceAfter=4, 1052 | textColor=HexColor("#212121") 1053 | ), 1054 | 'Bullet': ParagraphStyle( # Corrected bullet style 1055 | 'CustomBullet', 1056 | parent=styles['Normal'], 1057 | fontSize=11, 1058 | leading=16, 1059 | leftIndent=0.5*inch, 1060 | rightIndent=0, 1061 | spaceBefore=4, 1062 | spaceAfter=4, 1063 | bulletIndent=0.3*inch, 1064 | textColor=HexColor("#212121"), 1065 | bulletFontName='Helvetica', # Ensure consistent font 1066 | bulletFontSize=11 1067 | ), 1068 | 'Reference': ParagraphStyle( 1069 | 'CustomReference', 1070 | parent=styles['Normal'], 1071 | fontSize=10, 1072 | leading=14, 1073 | spaceAfter=4, 1074 | textColor=HexColor("#1565c0"), 1075 | alignment=TA_LEFT, 1076 | leftIndent=0.5*inch 1077 | ), 1078 | 'Footer': ParagraphStyle( 1079 | 'CustomFooter', 1080 | parent=styles['Italic'], 1081 | fontSize=9, 1082 | alignment=TA_CENTER, 1083 | textColor=HexColor("#757575"), 1084 | spaceBefore=24 # Space above the footer 1085 | ) 1086 | } 1087 | 1088 | def clean_text(text): 1089 | # Convert common Markdown formatting to ReportLab equivalents 1090 | text = re.sub(r'\*\*(.*?)\*\*', r'\1', text) # Bold 1091 | text = re.sub(r'\*(.*?)\*', r'\1', text) # Italics 1092 | text = re.sub(r'`(.*?)`', r'\1', text) # Inline code 1093 | text = re.sub(r'\[(.*?)\]\((.*?)\)', r'\1', text) # Links 1094 | # Escape HTML entities to prevent issues in Paragraph 1095 | text = text.replace('&', '&').replace('<', '<').replace('>', '>') 1096 | return text.strip() 1097 | 1098 | 1099 | def process_table(table_text): 1100 | rows = [row.strip() for row in table_text.split('\n') if row.strip()] 1101 | if len(rows) < 2: 1102 | return None # Not a valid table 1103 | 1104 | header = [clean_text(cell) for cell in rows[0].strip('|').split('|')] 1105 | data_rows = [] 1106 | for row in rows[2:]: # Skip header and separator lines 1107 | cells = [clean_text(cell) for cell in row.strip('|').split('|')] 1108 | data_rows.append(cells) 1109 | 1110 | # Convert to ReportLab Paragraph objects 1111 | table_data = [[Paragraph(cell, custom_styles['TableCell']) for cell in header]] 1112 | for row in data_rows: 1113 | table_data.append([Paragraph(cell, custom_styles['TableCell']) for cell in row]) 1114 | 1115 | table_style = TableStyle([ 1116 | ('BACKGROUND', (0, 0), (-1, 0), HexColor("#f5f5f5")), # Header background 1117 | ('TEXTCOLOR', (0, 0), (-1, 0), HexColor("#1a237e")), # Header text color 1118 | ('ALIGN', (0, 0), (-1, -1), 'CENTER'), # Center-align all cells 1119 | ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'), # Header font 1120 | ('FONTSIZE', (0, 0), (-1, 0), 10), 1121 | ('BOTTOMPADDING', (0, 0), (-1, 0), 8), # Header padding 1122 | ('TOPPADDING', (0, 0), (-1, 0), 8), 1123 | ('GRID', (0, 0), (-1, -1), 1, HexColor("#e0e0e0")), # Grid 1124 | ('ROWBACKGROUNDS', (0, 1), (-1, -1), [HexColor("#ffffff"), HexColor("#f8f9fa")]), # Alternating row colors 1125 | ('ALIGN', (0, 0), (-1, -1), 'LEFT'), # Left-align cell content 1126 | ('FONTNAME', (0, 1), (-1, -1), 'Helvetica'), # Body font 1127 | ('FONTSIZE', (0, 1), (-1, -1), 9), # Body font size 1128 | ('TOPPADDING', (0, 1), (-1, -1), 6), # padding 1129 | ('BOTTOMPADDING', (0, 1), (-1, -1), 6), 1130 | ('VALIGN', (0, 0), (-1, -1), 'MIDDLE'), # Vertically center cell content 1131 | ]) 1132 | 1133 | col_widths = [doc.width/len(header) for _ in header] # Equal column widths 1134 | table = Table(table_data, colWidths=col_widths) 1135 | table.setStyle(table_style) 1136 | return table 1137 | 1138 | def footer(canvas, doc): 1139 | canvas.saveState() 1140 | footer_text = f"Generated by Kv - AI Companion & Deep Research Tool • {formatted_date}" 1141 | footer = Paragraph(footer_text, custom_styles['Footer']) 1142 | w, h = footer.wrap(doc.width, doc.bottomMargin) 1143 | footer.drawOn(canvas, doc.leftMargin, h) # Draw footer 1144 | canvas.restoreState() 1145 | 1146 | story = [] 1147 | story.append(Paragraph(report_title, custom_styles['Title'])) # Title 1148 | story.append(Paragraph(formatted_date, custom_styles['Footer'])) # Date 1149 | story.append(Spacer(1, 0.2*inch)) # Initial space 1150 | 1151 | current_table = [] 1152 | in_table = False 1153 | lines = content.split('\n') 1154 | i = 0 1155 | current_paragraph = [] # Accumulate lines for a paragraph 1156 | 1157 | while i < len(lines): 1158 | line = lines[i].strip() 1159 | 1160 | if not line: 1161 | # Empty line: End current paragraph (if any), add it to story. 1162 | if current_paragraph: 1163 | story.append(Paragraph(clean_text(" ".join(current_paragraph)), custom_styles['Paragraph'])) 1164 | current_paragraph = [] 1165 | # Handle any pending table 1166 | if in_table and current_table: 1167 | table = process_table('\n'.join(current_table)) 1168 | if table: 1169 | story.append(table) 1170 | story.append(Spacer(1, 0.1*inch)) 1171 | current_table = [] 1172 | in_table = False 1173 | story.append(Spacer(1, 0.05*inch)) # Consistent spacing 1174 | i += 1 1175 | continue 1176 | 1177 | if '|' in line and (line.count('|') > 1 or (i + 1 < len(lines) and '|' in lines[i + 1])): 1178 | # Likely a table row. Start/continue accumulating table lines. 1179 | in_table = True 1180 | current_table.append(line) 1181 | # End any current paragraph before starting a table. 1182 | if current_paragraph: 1183 | story.append(Paragraph(clean_text(" ".join(current_paragraph)), custom_styles['Paragraph'])) 1184 | current_paragraph = [] 1185 | elif in_table: 1186 | # End of table. Process accumulated table lines. 1187 | if current_table: 1188 | table = process_table('\n'.join(current_table)) 1189 | if table: 1190 | story.append(table) 1191 | story.append(Spacer(1, 0.1*inch)) 1192 | current_table = [] 1193 | in_table = False 1194 | continue #Crucial to continue here, and not add to current_paragraph below. 1195 | 1196 | elif line.startswith('# '): 1197 | # End current paragraph (if any) before starting a heading. 1198 | if current_paragraph: 1199 | story.append(Paragraph(clean_text(" ".join(current_paragraph)), custom_styles['Paragraph'])) 1200 | current_paragraph = [] 1201 | story.append(Paragraph(clean_text(line[2:]), custom_styles['Heading1'])) # H1 1202 | elif line.startswith('## '): 1203 | if current_paragraph: 1204 | story.append(Paragraph(clean_text(" ".join(current_paragraph)), custom_styles['Paragraph'])) 1205 | current_paragraph = [] 1206 | story.append(Paragraph(clean_text(line[3:]), custom_styles['Heading2'])) # H2 1207 | elif line.startswith('### '): 1208 | if current_paragraph: 1209 | story.append(Paragraph(clean_text(" ".join(current_paragraph)), custom_styles['Paragraph'])) 1210 | current_paragraph = [] 1211 | story.append(Paragraph(clean_text(line[4:]), custom_styles['Heading3'])) # H3 1212 | elif line.startswith('* ') or line.startswith('- '): 1213 | # End current paragraph before a list item. 1214 | if current_paragraph: 1215 | story.append(Paragraph(clean_text(" ".join(current_paragraph)), custom_styles['Paragraph'])) 1216 | current_paragraph = [] 1217 | story.append(Paragraph(f"• {clean_text(line[2:])}", custom_styles['Bullet'])) # Bullet points 1218 | 1219 | else: 1220 | # Regular text line: add to the current paragraph. 1221 | current_paragraph.append(line) 1222 | 1223 | i += 1 1224 | # Add any remaining paragraph content (important!). 1225 | if current_paragraph: 1226 | story.append(Paragraph(clean_text(" ".join(current_paragraph)), custom_styles['Paragraph'])) 1227 | 1228 | if current_table: 1229 | table = process_table('\n'.join(current_table)) 1230 | if table: 1231 | story.append(table) 1232 | story.append(Spacer(1, 0.1*inch)) 1233 | 1234 | if references: 1235 | story.append(PageBreak()) # References on a new page 1236 | story.append(Paragraph("References", custom_styles['Heading1'])) 1237 | story.append(Spacer(1, 0.1*inch)) 1238 | for i, ref in enumerate(references, 1): 1239 | story.append(Paragraph(f"[{i}] {ref}", custom_styles['Reference'])) 1240 | 1241 | doc.build(story, onLaterPages=footer, onFirstPage=footer) # Apply to all pages. 1242 | buffer.seek(0) 1243 | return buffer 1244 | 1245 | # --- Product Scraping --- 1246 | def scrape_product_details(url): 1247 | """Scrapes product details from a given URL.""" 1248 | try: 1249 | response = requests.get(url, headers={'User-Agent': get_random_user_agent()}, timeout=config.REQUEST_TIMEOUT) 1250 | response.raise_for_status() 1251 | 1252 | soup = BeautifulSoup(response.text, 'html.parser') 1253 | product_data = {} 1254 | # Title 1255 | for tag in ['h1', 'h2', 'span', 'div']: 1256 | for class_name in ['product-title', 'title', 'productName', 'product-name']: 1257 | if title_element := soup.find(tag, class_=class_name): 1258 | product_data['title'] = title_element.get_text(strip=True) 1259 | break 1260 | if 'title' in product_data: 1261 | break 1262 | 1263 | # Price 1264 | for tag in ['span', 'div', 'p']: 1265 | for class_name in ['price', 'product-price', 'sales-price', 'regular-price']: 1266 | if price_element := soup.find(tag, class_=class_name): 1267 | product_data['price'] = price_element.get_text(strip=True) 1268 | break 1269 | if 'price' in product_data: 1270 | break 1271 | 1272 | # Description 1273 | if (description_element := soup.find('div', {'itemprop': 'description'})): 1274 | product_data['description'] = description_element.get_text(strip=True) 1275 | else: 1276 | for class_name in ['description', 'product-description', 'product-details', 'details']: 1277 | if desc_element := soup.find(['div', 'p'], class_=class_name): 1278 | product_data['description'] = desc_element.get_text(separator='\n', strip=True) 1279 | break 1280 | 1281 | # Image URL 1282 | if (image_element := soup.find('img', {'itemprop': 'image'})): 1283 | product_data['image_url'] = urljoin(url, image_element['src']) 1284 | else: 1285 | for tag in ['img', 'div']: 1286 | for class_name in ['product-image', 'image', 'main-image', 'productImage']: 1287 | if (image_element := soup.find(tag, class_=class_name)) and image_element.get('src'): 1288 | product_data['image_url'] = urljoin(url, image_element['src']) 1289 | break 1290 | if 'image_url' in product_data: 1291 | break 1292 | 1293 | # Rating 1294 | if (rating_element := soup.find(['span', 'div'], class_=['rating', 'star-rating', 'product-rating'])): 1295 | product_data['rating'] = rating_element.get_text(strip=True) 1296 | 1297 | return product_data 1298 | 1299 | except requests.exceptions.RequestException as e: 1300 | logging.error(f"Error scraping product details from {url}: {e}") 1301 | return None # Return None on error 1302 | except Exception as e: 1303 | logging.error(f"Unexpected error scraping {url}: {e}") 1304 | return None # Return None on unexpected error 1305 | 1306 | @app.post("/api/scrape_product") 1307 | async def scrape_product_endpoint(request: Request): 1308 | try: 1309 | data = await request.json() 1310 | product_query = data.get('query', '') 1311 | if not product_query: 1312 | raise HTTPException(status_code=400, detail="No product query provided") 1313 | 1314 | search_results = [] 1315 | with concurrent.futures.ThreadPoolExecutor(max_workers=config.MAX_WORKERS) as executor: 1316 | futures = [executor.submit(scrape_search_engine, product_query, engine) 1317 | for engine in config.SEARCH_ENGINES] 1318 | for future in concurrent.futures.as_completed(futures): 1319 | try: 1320 | search_results.extend(future.result()) 1321 | except Exception as e: 1322 | logging.error(f"Error in search engine scrape: {e}") 1323 | 1324 | unique_urls = list(set(search_results)) # Remove duplicate URLs 1325 | 1326 | all_product_data = [] 1327 | with concurrent.futures.ThreadPoolExecutor(max_workers=config.MAX_WORKERS) as executor: 1328 | futures = {executor.submit(scrape_product_details, url): url for url in unique_urls} 1329 | for future in concurrent.futures.as_completed(futures): 1330 | try: 1331 | product_data = future.result() 1332 | if product_data: # Only add if data was successfully scraped 1333 | all_product_data.append(product_data) 1334 | except Exception as e: 1335 | url = futures[future] 1336 | logging.error(f"Error processing {url}: {e}") 1337 | 1338 | if all_product_data: 1339 | # Create a prompt to summarize the product information 1340 | prompt = "Summarize the following product information:\n\n" 1341 | for product in all_product_data: 1342 | prompt += f"- Title: {product.get('title', 'N/A')}\n" 1343 | prompt += f" Price: {product.get('price', 'N/A')}\n" 1344 | prompt += f" Description: {product.get('description', 'N/A')}\n" 1345 | prompt += "\n" # Add a separator between products 1346 | 1347 | prompt += "\nProvide a concise summary, including key features and price range." 1348 | 1349 | summary = generate_gemini_response(prompt) # default model 1350 | 1351 | return JSONResponse({"summary": summary, "products": all_product_data}) 1352 | else: 1353 | raise HTTPException(status_code=404, detail="No product information found") 1354 | 1355 | except HTTPException as e: 1356 | raise e # Re-raise HTTP exceptions 1357 | except Exception as e: 1358 | logging.error(f"Error in product scraping endpoint: {e}") 1359 | raise HTTPException(status_code=500, detail=str(e)) 1360 | 1361 | # --- Job Scraping --- 1362 | def extract_text_from_resume(resume_data: bytes) -> str: 1363 | """Extracts text from a resume (PDF, DOCX, or plain text).""" 1364 | try: 1365 | if resume_data.startswith(b"%PDF"): 1366 | # PDF file 1367 | resume_text = pdf_extract_text(io.BytesIO(resume_data)) 1368 | elif resume_data.startswith(b"PK\x03\x04"): # Common DOCX header 1369 | # DOCX file 1370 | resume_text = docx2txt.process(io.BytesIO(resume_data)) 1371 | else: 1372 | # Assume plain text 1373 | try: 1374 | resume_text = resume_data.decode('utf-8') 1375 | except UnicodeDecodeError: 1376 | resume_text = resume_data.decode('latin-1', errors='replace') # Fallback encoding 1377 | return resume_text 1378 | except Exception as e: 1379 | logging.error(f"Error extracting resume text: {e}") 1380 | return "" 1381 | 1382 | # --- Helper functions for job scraping --- 1383 | 1384 | def linkedin_params(job_title: str, job_location: str, start: int = 0, experience_level: Optional[str] = None) -> Dict: 1385 | """Generates parameters for a LinkedIn job search URL.""" 1386 | params = { 1387 | 'keywords': job_title, 1388 | 'location': job_location, 1389 | 'f_TPR': 'r86400', # Past 24 hours (consider making configurable) 1390 | 'sortBy': 'R', # Sort by relevance 1391 | 'start': start # pagination 1392 | } 1393 | if experience_level: 1394 | # LinkedIn-specific experience level filters (add others as needed) 1395 | if experience_level.lower() == "fresher": 1396 | params['f_E'] = '1' # Internship 1397 | elif experience_level.lower() == "entry-level": 1398 | params['f_E'] = '2' # Entry-Level 1399 | elif experience_level.lower() == "mid-level": 1400 | params['f_E'] = '3' # Associate 1401 | elif experience_level.lower() == "senior": 1402 | params['f_E'] = '4' # Senior 1403 | elif experience_level.lower() == "executive": 1404 | params['f_E'] = '5' # Director , '6' - executive 1405 | # params['f_E'] = ['4', '5'] # Combine Senior/Executive for LinkedIn 1406 | 1407 | return params 1408 | 1409 | def indeed_params(job_title: str, job_location: str, start: int = 0, experience_level: Optional[str] = None) -> Dict: 1410 | """Generates parameters for an Indeed job search URL.""" 1411 | params = { 1412 | 'q': job_title, 1413 | 'l': job_location, 1414 | 'sort': 'relevance', # Sort by relevance 1415 | 'fromage': '1', # Past 24 hours 1416 | 'limit': 50, # Fetch more results per page 1417 | 'start': start # Pagination 1418 | } 1419 | if experience_level: 1420 | params['q'] = f"{experience_level} {params['q']}" # Add to the main query 1421 | 1422 | return params 1423 | def parse_linkedin_job_card(job_card: BeautifulSoup) -> Dict: 1424 | """Parses a single LinkedIn job card and extracts relevant information.""" 1425 | try: 1426 | job_url_element = job_card.find('a', class_='base-card__full-link') 1427 | job_url = job_url_element['href'] if job_url_element else None 1428 | 1429 | title_element = job_card.find('h3', class_='base-search-card__title') 1430 | title = title_element.get_text(strip=True) if title_element else "N/A" 1431 | 1432 | company_element = job_card.find('h4', class_='base-search-card__subtitle') 1433 | company = company_element.get_text(strip=True) if company_element else "N/A" 1434 | 1435 | location_element = job_card.find('span', class_='job-search-card__location') 1436 | location = location_element.get_text(strip=True) if location_element else "N/A" 1437 | 1438 | return { 1439 | 'url': job_url, 1440 | 'title': title, 1441 | 'company': company, 1442 | 'location': location, 1443 | 'relevance': 0.0, 1444 | 'missing_skills': [], 1445 | 'justification': "Relevance not assessed.", 1446 | 'experience': 'N/A' # default value 1447 | } 1448 | except Exception as e: 1449 | logging.error(f"Error parsing LinkedIn job card: {e}") 1450 | return { # Return defaults on error 1451 | 'url': None, 1452 | 'title': "N/A", 1453 | 'company': "N/A", 1454 | 'location': "N/A", 1455 | 'relevance': 0.0, 1456 | 'missing_skills': [], 1457 | 'justification': f"Error parsing job card: {type(e).__name__}", 1458 | 'experience': 'N/A' 1459 | } 1460 | 1461 | def parse_indeed_job_card(job_card: BeautifulSoup) -> Dict: 1462 | """Parses a single Indeed job card and extracts relevant information.""" 1463 | try: 1464 | title_element = job_card.find(['h2', 'a'], class_=lambda x: x and ('title' in x or 'jobtitle' in x)) 1465 | title = title_element.get_text(strip=True) if title_element else "N/A" 1466 | 1467 | company_element = job_card.find(['span', 'a'], class_='companyName') 1468 | company = company_element.get_text(strip=True) if company_element else "N/A" 1469 | 1470 | location_element = job_card.find('div', class_='companyLocation') 1471 | location = location_element.get_text(strip=True) if location_element else "N/A" 1472 | 1473 | job_url = None 1474 | link_element = job_card.find('a', href=True) 1475 | if link_element and 'pagead' not in link_element['href']: 1476 | job_url = urljoin("https://www.indeed.com/jobs", link_element['href']) 1477 | if not job_url: 1478 | data_jk = job_card.get('data-jk') 1479 | if data_jk: 1480 | job_url = f"https://www.indeed.com/viewjob?jk={data_jk}" 1481 | 1482 | return { 1483 | 'url': job_url, 1484 | 'title': title, 1485 | 'company': company, 1486 | 'location': location, 1487 | 'relevance': 0.0, 1488 | 'missing_skills': [], 1489 | 'justification': "Relevance not assessed.", 1490 | 'experience':'N/A' # Default 1491 | } 1492 | except Exception as e: 1493 | logging.error(f"Error parsing Indeed job card: {e}") 1494 | return { # Return defaults on error 1495 | 'url': None, 1496 | 'title': "N/A", 1497 | 'company': "N/A", 1498 | 'location': "N/A", 1499 | 'relevance': 0.0, 1500 | 'missing_skills': [], 1501 | 'justification': f"Error parsing job card: {type(e).__name__}", 1502 | 'experience':'N/A' 1503 | } 1504 | 1505 | @retry( 1506 | wait=wait_exponential(multiplier=config.INDEED_BASE_DELAY, max=config.INDEED_MAX_DELAY), 1507 | stop=stop_after_attempt(config.INDEED_RETRIES), 1508 | retry=retry_if_exception_type(requests.exceptions.RequestException), 1509 | before_sleep=lambda retry_state: logging.warning( 1510 | f"Indeed request failed (attempt {retry_state.attempt_number}). Retrying in {retry_state.next_action.sleep} seconds..." 1511 | ) 1512 | ) 1513 | def scrape_job_site(job_title: str, job_location: str, resume_text: Optional[str], 1514 | base_url: str, params_func: callable, parse_func: callable, site_name:str, 1515 | experience_level: Optional[str] = None) -> List[Dict]: 1516 | """ 1517 | Generic function to scrape job listings from a given site. 1518 | 1519 | Args: 1520 | job_title: The job title to search for. 1521 | job_location: The location to search for jobs. 1522 | resume_text: Optional resume text for relevance assessment. 1523 | base_url: The base URL of the job site. 1524 | params_func: A function that generates the URL parameters for the site. 1525 | parse_func: A function that parses a job card from the site's HTML. 1526 | site_name: The name of the job site (e.g., "LinkedIn", "Indeed"). 1527 | experience_level: Optional experience level string 1528 | 1529 | Returns: 1530 | A list of dictionaries, where each dictionary represents a job listing. 1531 | """ 1532 | 1533 | search_results = [] 1534 | start = 0 # pagination 1535 | MAX_PAGES = 10 # Limit pages to prevent infinite loops. Adjust as needed. 1536 | 1537 | while True: 1538 | params = params_func(job_title, job_location, start, experience_level) # Pass experience 1539 | try: 1540 | headers = {'User-Agent': get_random_user_agent()} # Rotate User-Agent 1541 | response = requests.get(base_url, params=params, headers=headers, timeout=config.REQUEST_TIMEOUT) 1542 | response.raise_for_status() # Raises HTTPError for bad (4xx, 5xx) responses 1543 | 1544 | if "captcha" in response.text.lower(): 1545 | logging.warning(f"{site_name} CAPTCHA detected. Stopping.") 1546 | break # Exit pagination 1547 | 1548 | soup = BeautifulSoup(response.text, 'html.parser') 1549 | 1550 | # Use a general way to find job cards (more robust to site changes) 1551 | job_cards = soup.find_all('div', class_=lambda x: x and x.startswith('job_')) 1552 | if not job_cards: 1553 | job_cards = soup.find_all('div', class_='base-card') # For linkedin try another method 1554 | 1555 | 1556 | if not job_cards: 1557 | if start == 0: 1558 | logging.warning(f"No {site_name} jobs found for: {params}") 1559 | else: 1560 | logging.info(f"No more {site_name} jobs found (page {start//50 + 1}).") 1561 | break # No more jobs, stop pagination 1562 | 1563 | for job_card in job_cards: 1564 | try: 1565 | job_data = parse_func(job_card) # Parse Individual job card. 1566 | 1567 | if not job_data['url']: # Skip if no URL 1568 | continue 1569 | 1570 | if resume_text: 1571 | try: 1572 | # Fetch the full job description 1573 | job_response = requests.get(job_data['url'], headers={'User-Agent': get_random_user_agent()}, 1574 | timeout=config.REQUEST_TIMEOUT) 1575 | job_response.raise_for_status() 1576 | job_soup = BeautifulSoup(job_response.text, 'html.parser') 1577 | description_element = job_soup.find('div', id='jobDescriptionText') # Indeed 1578 | if not description_element: # for linkedin 1579 | description_element = job_soup.find('div', class_='description__text') 1580 | job_description = description_element.get_text(separator='\n', strip=True) if description_element else "" 1581 | 1582 | # --- Experience Level Extraction (from description) --- 1583 | experience_match = re.search(r'(\d+\+?)\s*(?:-|to)?\s*(\d*)\s*years?', job_description, re.IGNORECASE) 1584 | if experience_match: 1585 | if experience_match.group(2): # If range 1586 | job_data['experience'] = f"{experience_match.group(1)}-{experience_match.group(2)} years" 1587 | else: # Just single number 1588 | job_data['experience'] = f"{experience_match.group(1)} years" 1589 | else: # Check for keywords 1590 | exp_keywords = { 1591 | 'fresher': ['fresher', 'graduate', 'entry level', '0 years', 'no experience'], 1592 | 'entry-level': ['0-2 years', '1-3 years', 'entry level', 'junior'], 1593 | 'mid-level' : ['3-5 years','2-5 years','mid level','intermediate'], 1594 | 'senior' : ['5+ years','5-10 years', 'senior','expert', 'lead'], 1595 | 'executive': ['10+ years', 'executive', 'director', 'vp', 'c-level'] 1596 | } 1597 | for level, keywords in exp_keywords.items(): 1598 | for keyword in keywords: 1599 | if keyword.lower() in job_description.lower(): 1600 | job_data['experience'] = level 1601 | break # Stop checking once a level is found 1602 | if job_data['experience'] != 'N/A': 1603 | break # Stop checking other levels 1604 | 1605 | 1606 | 1607 | 1608 | job_description = job_description[:2000] # Truncate! 1609 | resume_text_trunc = resume_text[:2000] # Truncate! 1610 | 1611 | relevance_prompt = ( 1612 | f"Assess the relevance of the following job to the resume. " 1613 | f"Provide a JSON object with ONLY the following keys:\n" 1614 | f"'relevance': float (between 0.0 and 1.0, where 1.0 is perfectly relevant),\n" 1615 | f"'missing_skills': list of strings (skills in the job description but not in the resume, or an empty list if none),\n" 1616 | f"'justification': string (REQUIRED. Explain the relevance score, including factors like experience level mismatch, skill gaps, or industry differences.).\n\n" 1617 | f"Job Description:\n{job_description}\n\nResume:\n{resume_text_trunc}" 1618 | ) 1619 | relevance_assessment = generate_gemini_response(relevance_prompt, response_format="json", model_name=config.JOB_RELEVANCE_MODEL) 1620 | if isinstance(relevance_assessment, dict): 1621 | job_data['relevance'] = relevance_assessment.get('relevance', 0.0) # Provide default 1622 | job_data['missing_skills'] = relevance_assessment.get('missing_skills', []) 1623 | job_data['justification'] = relevance_assessment.get('justification', "Relevance assessed.") 1624 | # Basic validation (optional, but good practice): 1625 | if not isinstance(job_data['relevance'], (int, float)): 1626 | logging.warning(f"Invalid relevance value: {job_data['relevance']}") 1627 | job_data['relevance'] = 0.0 1628 | if not isinstance(job_data['missing_skills'], list): 1629 | logging.warning(f"Invalid missing_skills: {job_data['missing_skills']}") 1630 | job_data['missing_skills'] = [] 1631 | if not isinstance(job_data['justification'], str): 1632 | logging.warning(f"Invalid justification: {job_data['justification']}") 1633 | job_data['justification'] = "Error: Could not assess relevance properly." 1634 | elif isinstance(relevance_assessment, dict) and "error" in relevance_assessment: 1635 | # Handle the "Invalid JSON" case specifically 1636 | logging.warning(f"Invalid JSON from Gemini: {relevance_assessment['raw_text']}") 1637 | job_data['relevance'] = 0.0 # Set default values 1638 | job_data['missing_skills'] = [] 1639 | job_data['justification'] = "Error: Could not assess relevance due to invalid JSON response." 1640 | 1641 | else: # Unexpected return from Gemini 1642 | logging.warning(f"Unexpected response from relevance assessment: {relevance_assessment}") 1643 | job_data['relevance'] = 0.0 1644 | job_data['missing_skills'] = [] 1645 | job_data['justification'] = "Error: Could not assess relevance (unexpected response)." 1646 | 1647 | except requests.exceptions.RequestException as e: 1648 | logging.warning(f"Failed to fetch job description from {job_data['url']}: {e}") 1649 | job_data['justification'] = f"Error: Could not fetch job description ({type(e).__name__})." 1650 | except Exception as e: 1651 | logging.exception(f"Error during relevance assessment for {job_data['url']}: {e}") 1652 | job_data['relevance'] = 0.0 1653 | job_data['missing_skills'] = [] 1654 | job_data['justification'] = "Error: Could not assess relevance (unexpected error)." 1655 | 1656 | search_results.append(job_data) # Append even with errors 1657 | 1658 | except Exception as e: 1659 | logging.warning(f"Error processing a {site_name} job card: {e}") 1660 | continue # skip to the next job card 1661 | 1662 | start += 50 # pagination 1663 | if start//50 +1 > MAX_PAGES: 1664 | logging.info(f"Reached max pages ({MAX_PAGES}) for {site_name}.") 1665 | break 1666 | 1667 | except requests.exceptions.HTTPError as e: 1668 | logging.error(f"{site_name} HTTP Error: {e}") 1669 | break # unrecoverable errors 1670 | except requests.exceptions.RequestException as e: 1671 | logging.error(f"{site_name} Request Exception: {e}") 1672 | break # network errors 1673 | 1674 | return search_results 1675 | 1676 | 1677 | @app.post("/api/scrape_jobs") 1678 | async def scrape_jobs_endpoint(job_title: Optional[str] = Form(""), job_location: str = Form(...), resume: UploadFile = File(None), 1679 | job_experience: Optional[str] = Form(None)): # New parameter 1680 | try: 1681 | # Removed the 'not job_title' check to allow it to be optional 1682 | 1683 | resume_text = None 1684 | if resume: 1685 | resume_content = await resume.read() 1686 | resume_text = extract_text_from_resume(resume_content) 1687 | if not resume_text: 1688 | raise HTTPException(status_code=400, detail="Could not extract text from resume.") 1689 | 1690 | all_job_results = [] 1691 | 1692 | with concurrent.futures.ThreadPoolExecutor(max_workers=config.MAX_WORKERS) as executor: 1693 | # Submit both scraping tasks *concurrently* 1694 | linkedin_future = executor.submit(scrape_job_site, job_title, job_location, resume_text, 1695 | "https://www.linkedin.com/jobs/search", linkedin_params, parse_linkedin_job_card, "LinkedIn", job_experience) # Pass experience 1696 | indeed_future = executor.submit(scrape_job_site, job_title, job_location, resume_text, 1697 | "https://www.indeed.com/jobs", indeed_params, parse_indeed_job_card, "Indeed", job_experience) # Pass experience 1698 | 1699 | # Get results, handling exceptions gracefully. Don't stop if one fails. 1700 | try: 1701 | all_job_results.extend(linkedin_future.result()) 1702 | except Exception as e: 1703 | logging.error(f"Error scraping LinkedIn: {e}") # Log, but don't stop 1704 | try: 1705 | all_job_results.extend(indeed_future.result()) 1706 | except Exception as e: 1707 | logging.error(f"Error scraping Indeed: {e}") # Log, but don't stop 1708 | 1709 | # --- Filtering and Sorting --- 1710 | # Filter by experience first 1711 | if job_experience: 1712 | filtered_jobs = [job for job in all_job_results if job.get('experience', '').lower() == job_experience.lower()] 1713 | else: 1714 | filtered_jobs = all_job_results 1715 | 1716 | 1717 | # Then, sort by experience level, then by relevance WITHIN each experience level 1718 | experience_order = ['fresher', 'entry-level', 'mid-level', 'senior', 'executive', 'N/A'] 1719 | def sort_key(job): 1720 | # Get experience level (default to 'N/A' if missing, put last) 1721 | exp = job.get('experience', 'N/A').lower() 1722 | if exp not in experience_order: 1723 | exp = 'N/A' # Normalize to 'N/A' 1724 | 1725 | return (experience_order.index(exp), -job.get('relevance', 0.0)) 1726 | 1727 | filtered_jobs.sort(key=sort_key) 1728 | 1729 | 1730 | 1731 | if filtered_jobs: 1732 | return JSONResponse({'jobs': filtered_jobs, 'jobs_found': len(all_job_results)}) 1733 | else: 1734 | # More specific message if no jobs *after* filtering 1735 | return JSONResponse({'jobs': [], 'jobs_found': len(all_job_results)}, status_code=200) # Return 200 OK even if no jobs are found after filtering 1736 | 1737 | 1738 | 1739 | except HTTPException as e: 1740 | raise e # Re-raise HTTP exceptions for FastAPI to handle 1741 | except Exception as e: 1742 | logging.error(f"Error in jobs scraping endpoint: {e}") 1743 | raise HTTPException(status_code=500, detail=str(e)) 1744 | 1745 | # Image Analysis Tool 1746 | @app.post("/api/analyze_image") 1747 | async def analyze_image_endpoint(request: Request): 1748 | try: 1749 | data = await request.json() 1750 | image_data = data.get('image') 1751 | if not image_data: 1752 | return JSONResponse({"error": "No image provided"}, status_code=400) 1753 | 1754 | image_part = process_base64_image(image_data) 1755 | if not image_part: 1756 | return JSONResponse({"error": "Failed to process image"}, status_code=400) 1757 | 1758 | model = genai.GenerativeModel('gemini-2.0-flash') 1759 | # image = Image.open(io.BytesIO(image_part['data'])) # No longer needed 1760 | response = model.generate_content(["Describe this image in detail", image_part]) # Simple description prompt ,pass image part directly 1761 | response.resolve() # Ensure generation is complete 1762 | 1763 | return JSONResponse({"description": response.text}) 1764 | 1765 | except Exception as e: 1766 | logging.exception("Error in image analysis") 1767 | return JSONResponse({"error": "Image analysis failed."}, status_code=500) 1768 | 1769 | 1770 | # Sentiment Analysis Tool 1771 | @app.post("/api/analyze_sentiment") 1772 | async def analyze_sentiment_endpoint(request: Request): 1773 | try: 1774 | data = await request.json() 1775 | text = data.get('text') 1776 | if not text: 1777 | return JSONResponse({"error": "No text provided"}, status_code=400) 1778 | 1779 | prompt = f"Analyze the sentiment of the following text and classify it as 'Positive', 'Negative', or 'Neutral'. Provide a brief justification:\n\n{text}" 1780 | sentiment_result = generate_gemini_response(prompt) 1781 | 1782 | return JSONResponse({"sentiment": sentiment_result}) 1783 | except Exception as e: 1784 | logging.exception("Error in sentiment analysis") 1785 | return JSONResponse({"error": "Sentiment analysis failed."}, status_code=500) 1786 | 1787 | # Website Summarization Tool 1788 | @app.post("/api/summarize_website") 1789 | async def summarize_website_endpoint(request: Request): 1790 | try: 1791 | data = await request.json() 1792 | url = data.get('url') 1793 | if not url: 1794 | return JSONResponse({"error": "No URL provided"}, status_code=400) 1795 | 1796 | content_snippets, _, _ = fetch_page_content(url, snippet_length=config.SNIPPET_LENGTH) 1797 | if not content_snippets: 1798 | return JSONResponse({"error": "Could not fetch website content"}, status_code=400) 1799 | combined_content = "\n\n".join(content_snippets) 1800 | prompt = f"Summarize the following webpage content concisely:\n\n{combined_content}" 1801 | summary = generate_gemini_response(prompt) 1802 | return JSONResponse({"summary": summary}) 1803 | except Exception as e: 1804 | logging.exception("Error in website summarization") 1805 | return JSONResponse({"error": "Website summarization failed."}, status_code=500) 1806 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | fastapi==0.110.0 2 | uvicorn==0.29.0 3 | python-dotenv==1.0.1 4 | Pillow==10.2.0 5 | requests==2.31.0 6 | beautifulsoup4==4.12.3 7 | soupsieve==2.5 8 | google-generativeai==0.4.1 9 | reportlab==4.1.0 10 | python-multipart==0.0.9 11 | tenacity==8.2.3 12 | urllib3==2.2.1 13 | brotli==1.1.0 14 | pdfminer.six==20231228 15 | docx2txt==0.8 16 | chardet==5.2.0 17 | Jinja2==3.1.3 18 | aiofiles==23.2.1 19 | -------------------------------------------------------------------------------- /templates/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | KV AI Assistant | Next-Gen Research Tool 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 1356 | 1357 | 1358 | 1684 | 1685 | 1686 | 1704 | 1705 | 1706 |

1707 |

1708 |

1709 |

Generating Report... 🚀🧠💡

1710 |

1711 |

1712 | 1713 | 1714 |

1715 | 1716 | 1717 | 1718 | 1719 | 1720 | 2861 | 2862 | 2863 | --------------------------------------------------------------------------------

257 \| 258 \|	260 \| Start a new chat session or research query 261 \| Navigate to the homepage and select your desired research mode 262 \|
266 \| 267 \|	269 \| Upload your resume for job search or enter research topic 270 \| Provide relevant documents and specify your preferences 271 \|
275 \| 276 \|	278 \| Review results, download reports, and implement recommendations 279 \| KV provides actionable insights and suggestions for improvement 280 \|

6 | 7 | AI-Powered Deep Research & Job Search Companion 8 | 9 |

Founded by K. Vamsi Krishna

📋 Table of Contents

🌟 Unique Advantages

⚡ Quick Setup

🚀 Get Started in 3 Easy Steps

We welcome all contributions! Here's how to get involved:

✨ Thank you for using KV! ✨

AI Assistant

Deep Research

Deep Research Results

Product Scraper

Product Scraping Results

Job Scraper

Jobs Scraping Results

Image Analysis

Image Analysis Results

Sentiment Analysis

Sentiment Analysis Results

Website Summarizer

Website Summarization Results

6 | 7 | AI-Powered Deep Research & Job Search Companion 8 | 9 |

Founded by K. Vamsi Krishna

📋 Table of Contents

🌟 Unique Advantages

⚡ Quick Setup

🚀 Get Started in 3 Easy Steps

We welcome all contributions! Here's how to get involved:

✨ Thank you for using KV! ✨

AI Assistant

Deep Research

Deep Research Results

Product Scraper

Product Scraping Results

Job Scraper

Jobs Scraping Results

Image Analysis

Image Analysis Results

Sentiment Analysis

Sentiment Analysis Results

Website Summarizer

Website Summarization Results

Settings