├── .env.template ├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── anthropic_api.py ├── app.py ├── flask_server.py ├── index.html ├── main.css ├── main.js └── requirements.txt /.env.template: -------------------------------------------------------------------------------- 1 | ANTHROPIC_API_KEY=your-api-key-here -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | venv/ 2 | __pycache__/ 3 | .vscode/ 4 | .env 5 | links.json -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Code Of Conduct 2 | 3 | ## Our Pledge 4 | 5 | We as members, contributors and leaders pledge to make participation in our project and community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity, gender expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion or sexual identity and orientation. 6 | 7 | We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive and healthy community. 8 | 9 | ## Our Standards 10 | 11 | Examples of behavior that contributes to a positive environment include: 12 | 13 | - Using welcoming and inclusive language 14 | - Being respectful of differing viewpoints and experiences 15 | - Giving and gracefully accepting constructive feedback 16 | - Showing empathy towards other community members 17 | 18 | Examples of unacceptable behavior include: 19 | 20 | - Trolling, insulting/derogatory comments and personal or political attacks 21 | - Public or private harassment 22 | - Publishing others’ private information, such as a physical or email address, without their explicit permission 23 | - Other conduct which could reasonably be considered inappropriate 24 | 25 | ## Enforcement Responsibilities 26 | 27 | Project maintainers are responsible for clarifying and enforcing standards of acceptable behavior and will take appropriate and fair corrective action in response to any instances of unacceptable behavior. 28 | 29 | ## Scope 30 | 31 | This Code Of Conduct applies within all project spaces and also applies when an individual is representing the project or its community in public spaces. 32 | 33 | ## Enforcement 34 | 35 | Instances of abusive, harassing or otherwise unacceptable behavior may be reported to the project team at contact@bretbernhoft.com. All complaints will be reviewed and investigated promptly and fairly. 36 | 37 | ## Attribution 38 | 39 | This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org), version 2.1, available [here](https://www.contributor-covenant.org/version/2/1/code_of_conduct/). 40 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing To Mapping A Website's Internal Links 2 | 3 | Thanks for taking the time to contribute. 4 | 5 | ## How To Contribute 6 | 7 | - Report bugs by opening an issue. 8 | - Suggest enhancements by opening an issue and labeling it as an enhancement. 9 | - Fork the repo, create a new branch, and submit a pull request. 10 | 11 | ## Branching Strategy 12 | 13 | - Work from a feature branch, not from `main`. 14 | - Name your branch using this pattern: `feature/your-description` or `bugfix/short-description`. 15 | 16 | ## Before Submitting A Pull Request 17 | 18 | - Ensure the PR description clearly explains the problem and solution. 19 | - Format your code with Prettier. 20 | - Update documentation if necessary. 21 | 22 | ## Code Standards 23 | 24 | - Use 2 spaces for indentation. 25 | - Follow the existing code style. 26 | - Run Prettier before pushing code. 27 | 28 | ## Running Locally 29 | 30 | 1. Clone the repo 31 | 2. Install dependencies 32 | 3. Run `python3 app.py` 33 | 34 | ## Commit Messages 35 | 36 | Follow [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/). 37 | 38 | ## Code Of Conduct 39 | 40 | Please read our [Code Of Conduct](CODE_OF_CONDUCT.md) before contributing. 41 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Bret Bernhoft 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Mapping A Website's Internal Links 2 | 3 |  4 | 5 | Explore a website's internal links, then visualize those connections as a network graph with scorecards and analysis using Claude AI. 6 | 7 | ## Set Up 8 | 9 | ### Programs Needed 10 | 11 | - [Git](https://git-scm.com/downloads) 12 | - [Python](https://www.python.org/downloads/) (When installing on Windows, make sure you check the "[Add python 3.xx to PATH](https://hosting.photobucket.com/images/i/bernhoftbret/python.png)" box.) 13 | 14 | ### Steps 15 | 16 | 1. Install the above programs. 17 | 18 | 2. Open a shell window (For Windows open PowerShell, for MacOS open Terminal & for Linux open your distro's terminal emulator). 19 | 20 | 3. Clone this repository using `git` by running the following command: `git clone git@github.com:devbret/website-internal-links.git`. 21 | 22 | 4. Navigate to the repo's directory by running: `cd website-internal-links`. 23 | 24 | 5. Create a virtual environment with this command: `python3 -m venv venv`. Then activate your virtual environment using: `source venv/bin/activate`. 25 | 26 | 6. Install the needed dependencies for running the script by running: `pip install -r requirements.txt`. 27 | 28 | 7. Set the environment variable for the Anthropic API key by renaming the `.env.template` file to `.env` and placing your value immediately after the `=` character. 29 | 30 | 8. Edit the app.py file `WEBSITE_TO_CRAWL` variable (on line 21), this is the website you would like to visualize. 31 | 32 | - Also edit the app.py file `MAX_PAGES_TO_CRAWL` variable (on line 24), this specifies how many pages you would like to crawl. 33 | 34 | 9. Run the script with the command `python3 app.py`. 35 | 36 | 10. To view the website's connections using the index.html file you will need to run a local web server. To do this run `python3 -m http.server` in a new terminal. 37 | 38 | 11. Once the network map has been launched, hover over any given node for more information about the particular web page, as well as the option submit data for analysis via Claude AI. By clicking on a node, you will be sent to the related URL address in a new tab. 39 | 40 | 12. To exit the virtual environment (venv), type: `deactivate` in the terminal. 41 | 42 | ## Performance Considerations 43 | 44 | Generating visualizations for this app takes an unexpectedly large amount of processing power. It is thus advisable to initially experiment with mapping less than one hundred pages per launch. 45 | 46 | ## Additional Notes 47 | 48 | The analysis uses textstat for readability and TextBlob for sentiment. 49 | 50 | The crawler checks for SEO and accessibility markers like: 51 | 52 | - Heading structure 53 | 54 | - Image alt tags 55 | 56 | - Form label usage 57 | 58 | - Semantic HTML elements 59 | 60 | ## Troubleshooting 61 | 62 | If working with GitHub codespaces, you may have to: 63 | 64 | - `python -m nltk.downloader punkt_tab` 65 | 66 | - Then reattempt steps 6 - 9. 67 | 68 | If all else fails, please contact the maintainer here on GitHub or via [LinkedIn](https://www.linkedin.com/in/bernhoftbret/). 69 | 70 | Cheers! 71 | -------------------------------------------------------------------------------- /anthropic_api.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | from anthropic import Anthropic 4 | from dotenv import load_dotenv 5 | 6 | load_dotenv() 7 | 8 | api_key = os.getenv("ANTHROPIC_API_KEY") 9 | if not api_key: 10 | raise ValueError("ANTHROPIC_API_KEY not found in environment variables. Please set it in your .env file.") 11 | 12 | try: 13 | anthropic = Anthropic(api_key=api_key) 14 | except Exception as e: 15 | raise ValueError(f"Failed to initialize Anthropic client: {e}") 16 | 17 | 18 | def analyze_with_anthropic(page_data): 19 | system_prompt = """You are an expert analyst. Your task is to review structured JSON data from a webpage. 20 | Summarize the strengths and weaknesses of this page in terms of SEO, accessibility, and semantic HTML structure. 21 | Provide specific, actionable suggestions for improvements. 22 | Structure your response clearly, using Markdown for headings (e.g., ## Strengths, ## Weaknesses, ## Suggestions).""" 23 | 24 | user_message_content = f""" 25 | Here is a structured JSON of a webpage: 26 | 27 | {json.dumps(page_data, indent=2)} 28 | 29 | Please analyze it based on the instructions provided. 30 | """ 31 | 32 | try: 33 | response = anthropic.messages.create( 34 | model="claude-3-7-sonnet-20250219", 35 | max_tokens=1500, 36 | temperature=0.5, 37 | system=system_prompt, 38 | messages=[ 39 | { 40 | "role": "user", 41 | "content": user_message_content 42 | } 43 | ] 44 | ) 45 | if response.content and len(response.content) > 0: 46 | return response.content[0].text.strip() 47 | else: 48 | return "No content returned from API." 49 | 50 | except Exception as e: 51 | error_message = f"Anthropic API error: {e}" 52 | print(error_message) 53 | if hasattr(e, 'response') and hasattr(e.response, 'json'): 54 | try: 55 | error_details = e.response.json() 56 | error_message += f" | Details: {json.dumps(error_details)}" 57 | except json.JSONDecodeError: 58 | error_message += f" | Details: (Could not decode JSON error response from API)" 59 | 60 | 61 | raise Exception(error_message) -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import subprocess 3 | from urllib.parse import urljoin, urlparse 4 | from bs4 import BeautifulSoup 5 | import json 6 | import time 7 | import textstat 8 | from textblob import TextBlob 9 | import nltk 10 | from nltk.corpus import stopwords 11 | from collections import Counter 12 | import re 13 | from dotenv import load_dotenv 14 | 15 | load_dotenv() 16 | 17 | nltk.download('punkt', quiet=True) 18 | nltk.download('stopwords', quiet=True) 19 | 20 | # The website you would like to visualize. 21 | WEBSITE_TO_CRAWL = 'https://example.com/' 22 | 23 | # Specify how many pages you would like to crawl. 24 | MAX_PAGES_TO_CRAWL = 20 25 | 26 | def is_internal(url, base): 27 | return urlparse(url).netloc == urlparse(base).netloc 28 | 29 | def check_heading_structure(soup): 30 | headings = [int(tag.name[1]) for tag in soup.find_all(re.compile('^h[1-6]$'))] 31 | skipped_levels = [] 32 | prev_level = 0 33 | for level in headings: 34 | if prev_level and level > prev_level + 1: 35 | skipped_levels.append((prev_level, level)) 36 | prev_level = level 37 | return skipped_levels 38 | 39 | def check_semantic_elements(soup): 40 | semantic_tags = ['main', 'nav', 'article', 'section', 'header', 'footer', 'aside'] 41 | used_semantics = {tag: bool(soup.find(tag)) for tag in semantic_tags} 42 | return used_semantics 43 | 44 | def check_image_alts(soup): 45 | images = soup.find_all('img') 46 | images_without_alt = [img['src'] for img in images if not img.get('alt') or img.get('alt').strip() == ''] 47 | return images_without_alt 48 | 49 | def check_form_labels(soup): 50 | inputs = soup.find_all(['input', 'textarea', 'select']) 51 | labeled_inputs = set() 52 | for label in soup.find_all('label'): 53 | if label.get('for'): 54 | labeled_inputs.add(label['for']) 55 | inputs_without_labels = [] 56 | for field in inputs: 57 | if field.get('id') and field.get('type') not in ['hidden', 'submit', 'button', 'reset']: 58 | if field['id'] not in labeled_inputs: 59 | parent_label = field.find_parent('label') 60 | if not parent_label: 61 | inputs_without_labels.append(field['id']) 62 | return inputs_without_labels 63 | 64 | 65 | def crawl_site(start_url, max_links=MAX_PAGES_TO_CRAWL): 66 | visited = set() 67 | site_structure = {} 68 | to_visit = [start_url.rstrip('/')] 69 | 70 | while to_visit and len(visited) < max_links: 71 | url = to_visit.pop(0) 72 | if url in visited: 73 | continue 74 | 75 | normalized_url = url.rstrip('/') 76 | if normalized_url in visited: 77 | continue 78 | 79 | visited.add(normalized_url) 80 | print(f"Crawling: {normalized_url} ({len(visited)}/{max_links})") 81 | 82 | try: 83 | start_time = time.time() 84 | response = requests.get(normalized_url, timeout=10, headers={'User-Agent': 'MyCrawler/1.0'}) 85 | response_time = time.time() - start_time 86 | status_code = response.status_code 87 | 88 | if response.status_code != 200: 89 | print(f"Skipping {normalized_url} due to status code: {status_code}") 90 | site_structure[normalized_url] = {"status_code": status_code, "error": "Failed to fetch"} 91 | continue 92 | 93 | content_type = response.headers.get('content-type', '').lower() 94 | if 'text/html' not in content_type: 95 | print(f"Skipping {normalized_url} as content type is not HTML: {content_type}") 96 | site_structure[normalized_url] = {"status_code": status_code, "error": "Not HTML content"} 97 | continue 98 | 99 | soup = BeautifulSoup(response.text, 'html.parser') 100 | page_title = soup.title.string.strip() if soup.title else '' 101 | 102 | meta_desc_tag = soup.find('meta', attrs={'name': 'description'}) 103 | meta_description = meta_desc_tag['content'].strip() if meta_desc_tag and 'content' in meta_desc_tag.attrs else '' 104 | 105 | meta_keywords_tag = soup.find('meta', attrs={'name': 'keywords'}) 106 | meta_keywords = meta_keywords_tag['content'].strip() if meta_keywords_tag and 'content' in meta_keywords_tag.attrs else '' 107 | 108 | h1_tags = [h1.get_text(strip=True) for h1 in soup.find_all('h1')] 109 | 110 | text_content_for_analysis = [] 111 | for element in soup.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'li', 'span', 'article']): 112 | text_content_for_analysis.append(element.get_text(separator=' ', strip=True)) 113 | text = " ".join(text_content_for_analysis) 114 | 115 | word_count = len(text.split()) if text else 0 116 | 117 | readability_score = textstat.flesch_kincaid_grade(text) if text else 0 118 | sentiment = TextBlob(text).sentiment.polarity if text else 0 119 | 120 | keyword_density = {} 121 | if text: 122 | text_clean = re.sub(r'[^\w\s]', '', text.lower()) 123 | tokens = nltk.word_tokenize(text_clean) 124 | stop_words = set(stopwords.words('english')) 125 | filtered_tokens = [word for word in tokens if word not in stop_words and word.isalpha() and len(word) > 1] 126 | if filtered_tokens: 127 | word_freq = Counter(filtered_tokens) 128 | total_filtered_words = sum(word_freq.values()) 129 | most_common = word_freq.most_common(10) 130 | keyword_density = {word: round(count / total_filtered_words, 4) for word, count in most_common} 131 | 132 | 133 | image_count = len(soup.find_all('img')) 134 | script_count = len(soup.find_all('script')) 135 | stylesheet_count = len(soup.find_all('link', rel='stylesheet')) 136 | has_viewport_meta = bool(soup.find('meta', attrs={'name': 'viewport'})) 137 | heading_count = len(soup.find_all(['h2', 'h3', 'h4', 'h5', 'h6'])) 138 | paragraph_count = len(soup.find_all('p')) 139 | 140 | semantic_elements = check_semantic_elements(soup) 141 | heading_issues = check_heading_structure(soup) 142 | unlabeled_inputs = check_form_labels(soup) 143 | images_without_alt = check_image_alts(soup) 144 | 145 | internal_links_found = [] 146 | external_links_found = [] 147 | 148 | for link_tag in soup.find_all('a', href=True): 149 | href = link_tag.get('href') 150 | if not href or href.startswith('#') or href.startswith('mailto:') or href.startswith('tel:'): 151 | continue 152 | 153 | absolute_href = urljoin(normalized_url, href).split('#')[0].rstrip('/') 154 | 155 | if is_internal(absolute_href, start_url): 156 | internal_links_found.append(absolute_href) 157 | if absolute_href not in visited and absolute_href not in to_visit and len(visited) + len(to_visit) < max_links : 158 | to_visit.append(absolute_href) 159 | else: 160 | external_links_found.append(absolute_href) 161 | 162 | site_structure[normalized_url] = { 163 | "url": normalized_url, 164 | "title": page_title, 165 | "meta_description": meta_description, 166 | "meta_keywords": meta_keywords, 167 | "h1_tags": h1_tags, 168 | "word_count": word_count, 169 | "readability_score": readability_score, 170 | "sentiment": sentiment, 171 | "keyword_density": keyword_density, 172 | "image_count": image_count, 173 | "script_count": script_count, 174 | "stylesheet_count": stylesheet_count, 175 | "has_viewport_meta": has_viewport_meta, 176 | "heading_count": heading_count, 177 | "paragraph_count": paragraph_count, 178 | "status_code": status_code, 179 | "response_time": round(response_time, 2), 180 | "internal_links": list(set(internal_links_found)), 181 | "external_links": list(set(external_links_found)), 182 | "semantic_elements": semantic_elements, 183 | "heading_issues": heading_issues, 184 | "unlabeled_inputs": unlabeled_inputs, 185 | "images_without_alt": images_without_alt 186 | } 187 | 188 | except requests.exceptions.Timeout: 189 | print(f"Timeout crawling {normalized_url}") 190 | site_structure[normalized_url] = {"status_code": "Timeout", "error": "Request timed out"} 191 | except requests.exceptions.RequestException as e: 192 | print(f"Failed to crawl {normalized_url}: {e}") 193 | site_structure[normalized_url] = {"status_code": "Error", "error": str(e)} 194 | except Exception as e: 195 | print(f"An unexpected error occurred while processing {normalized_url}: {e}") 196 | site_structure[normalized_url] = {"status_code": "Processing Error", "error": str(e)} 197 | 198 | 199 | return site_structure 200 | 201 | 202 | def save_links_as_json(site_structure, filename='links.json'): 203 | with open(filename, 'w') as file: 204 | json.dump(site_structure, file, indent=2) 205 | print(f"Site structure saved to {filename}") 206 | 207 | if __name__ == "__main__": 208 | crawled_site_structure = crawl_site(WEBSITE_TO_CRAWL, MAX_PAGES_TO_CRAWL) 209 | save_links_as_json(crawled_site_structure) 210 | print("Crawling complete. Starting Flask server subprocess...") 211 | subprocess.run(["python", "flask_server.py"]) 212 | print("Flask server subprocess has been initiated.") -------------------------------------------------------------------------------- /flask_server.py: -------------------------------------------------------------------------------- 1 | from flask import Flask, request, jsonify 2 | from flask_cors import CORS 3 | from anthropic_api import analyze_with_anthropic 4 | import json 5 | 6 | site_structure = {} 7 | 8 | app = Flask(__name__) 9 | CORS(app) 10 | 11 | def load_crawled_data(filename='links.json'): 12 | global site_structure 13 | try: 14 | with open(filename, 'r') as f: 15 | site_structure = json.load(f) 16 | print(f"Successfully loaded {len(site_structure)} URLs from {filename}") 17 | except FileNotFoundError: 18 | print(f"ERROR: {filename} not found. Make sure app.py has run and created it.") 19 | site_structure = {} 20 | except json.JSONDecodeError: 21 | print(f"ERROR: Could not decode JSON from {filename}. It might be corrupted.") 22 | site_structure = {} 23 | 24 | @app.route('/api/analyze', methods=['POST']) 25 | def analyze(): 26 | data = request.json 27 | if not data or 'url' not in data: 28 | return jsonify({"error": "Missing 'url' in request body"}), 400 29 | 30 | requested_url = data['url'].rstrip("/") 31 | page_data = site_structure.get(requested_url) or site_structure.get(requested_url + "/") 32 | 33 | if not page_data: 34 | print(f"Debug: URL '{requested_url}' not found in site_structure.") 35 | print(f"Debug: Available keys: {list(site_structure.keys())[:5]}") 36 | return jsonify({"error": "No data found for this URL"}), 404 37 | 38 | try: 39 | analysis = analyze_with_anthropic(page_data) 40 | return jsonify({"analysis": analysis}) 41 | except Exception as e: 42 | print(f"Error during analysis for {requested_url}: {e}") 43 | return jsonify({"error": f"An error occurred during analysis: {str(e)}"}), 500 44 | 45 | @app.route('/api/urls') 46 | def list_urls(): 47 | return jsonify(list(site_structure.keys())) 48 | 49 | def attach_data(structure): 50 | global site_structure 51 | print("attach_data called. Note: Server primarily loads data from links.json on startup.") 52 | site_structure = structure 53 | 54 | if __name__ == "__main__": 55 | load_crawled_data() 56 | app.run(debug=True) -------------------------------------------------------------------------------- /index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 |
4 |