├── package.json ├── requirements.txt ├── .gitignore ├── LICENSE ├── README.md ├── singlefile.py └── export.py /package.json: -------------------------------------------------------------------------------- 1 | { 2 | "dependencies": { 3 | "single-file-cli": "2.0.75" 4 | } 5 | } 6 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | beautifulsoup4 2 | requests 3 | jsonpickle 4 | canvasapi 5 | python-dateutil 6 | PyYAML 7 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .vscode 2 | __pycache__/ 3 | node_modules/ 4 | output/ 5 | 6 | credentials.yaml 7 | cookies.txt 8 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 David Katsandres 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | The Canvas Student Data Export Tool exports nearly all of a student's data from the Instructure Canvas Learning Management System (Canvas LMS). 4 | This is useful when you are graduating or leaving your college or university, and would like to have a backup of all the data you had in canvas. 5 | 6 | The tool exports the following data: 7 | - Course Assignments (including submissions and attachments) 8 | - Course Announcements 9 | - Course Discussions 10 | - Course Pages 11 | - Course Files 12 | - Course Modules 13 | - (Optional) HTML snapshots of: 14 | - Course Home Page 15 | - Grades Page 16 | - Assignments 17 | - Announcements 18 | - Discussions 19 | - Modules 20 | 21 | Data is saved in JSON (and optionally HTML) format and organized into folders by academic term and course. 22 | 23 | Example output structure: 24 | - Fall 2023 25 | - CS 101 26 | - announcements/ 27 | - First Announcement/ 28 | - announcement_1.html 29 | - announcement_list.html 30 | - assignments/ 31 | - Sample Assignment/ 32 | - assignment.html 33 | - submission.html 34 | - assignment_list.html 35 | - course files/ 36 | - file_1.docx 37 | - file_2.png 38 | - discussions/ 39 | - Sample Discussion 40 | - discussion_1.html 41 | - discussion_list.html 42 | - modules/ 43 | - Sample Module 44 | - Sample Assignment.html 45 | - Sample Discussion.html 46 | - Sample Page.html 47 | - Sample Quiz.html 48 | - modules_list.html 49 | - grades.html 50 | - homepage.html 51 | - CS 101.json 52 | - ENGL 101 53 | - ... 54 | - Spring 2024 55 | - ... 56 | - all_output.json 57 | 58 | # Getting Started 59 | 60 | ## Dependencies 61 | - Python 3.8 or newer 62 | - Node.js 16 or newer (only needed for HTML snapshots) 63 | 64 | 1. **Install Python dependencies:** 65 | ```bash 66 | pip install -r requirements.txt 67 | ``` 68 | 69 | 2. **(Optional) Install SingleFile for HTML snapshots:** 70 | This step requires Node.js. 71 | ```bash 72 | npm install 73 | ``` 74 | 75 | ## Configuration 76 | 77 | To use the tool, you must create a `credentials.yaml` file in the project root. You can also specify a different path using the `-c` or `--config` command-line option. 78 | 79 | Create the `credentials.yaml` file with the following content: 80 | 81 | ```yaml 82 | # The URL of your Canvas instance (e.g., https://your-school.instructure.com) 83 | API_URL: https://example.instructure.com 84 | # Your Canvas API token 85 | API_KEY: 86 | # Your Canvas User ID 87 | USER_ID: 123456 88 | # Path to your browser cookies file (Netscape format). 89 | # This is only required when using the --singlefile flag. 90 | COOKIES_PATH: ./cookies.txt 91 | # (Optional) Path to your Chrome/Chromium executable if SingleFile cannot find it. 92 | # CHROME_PATH: C:\Program Files\Google\Chrome\Application\chrome.exe 93 | # (Optional) Timeout in seconds for SingleFile to capture a page. Default: 60 94 | # Increase this if you see "Capture timeout" errors during HTML snapshots. 95 | # SINGLEFILE_TIMEOUT: 180 96 | # (Optional) A list of course IDs to skip when exporting data. 97 | # COURSES_TO_SKIP: 98 | # - 12345 99 | # - 67890 100 | ``` 101 | 102 | ### Finding Your Credentials 103 | 104 | - **`API_URL`**: Your institution's Canvas URL. 105 | - **`API_KEY`**: In Canvas, go to `Account` > `Settings`, scroll down to `Approved Integrations`, and click `+ New Access Token`. 106 | - **`USER_ID`**: After logging into Canvas, visit `https:///api/v1/users/self`. Your browser will show a JSON response; find the `id` field. 107 | - **`COOKIES_PATH`**: Required **only if** you use the `--singlefile` flag. Browser cookies are needed to download complete HTML pages as if you were logged in. The script will now detect if your cookies are expired or invalid and will stop downloading HTML pages to prevent errors. For best results, log into Canvas and then export your cookies right before running the script. Use a browser extension like "Get cookies.txt Clean" for Chrome to export them in Netscape format. 108 | - **`CHROME_PATH`** (Optional): The script attempts to auto-detect Chrome/Chromium on Windows, macOS, and Linux. If it fails, you can specify the path here. 109 | - **`SINGLEFILE_TIMEOUT`** (Optional): Maximum time in seconds to wait for SingleFile to capture a single HTML page. Default is `60` seconds. If you have a slow connection or a busy computer and see "Capture timeout" errors, increase this value. 110 | - **`COURSES_TO_SKIP`** (Optional): A list of course IDs to exclude from the export. To find a course ID, go to the course's homepage and look at the URL for the number that follows `/courses/`. 111 | 112 | ## Running the Exporter 113 | 114 | Once your `credentials.yaml` is set up, run the script: 115 | 116 | ```bash 117 | python export.py [options] 118 | ``` 119 | 120 | **Options:** 121 | 122 | | Flag | Description | Default | 123 | | ----------------------- | --------------------------------------------- | ------------------ | 124 | | `-c`, `--config ` | Path to your YAML credentials file. | `credentials.yaml` | 125 | | `-o`, `--output ` | Directory to store exported data. | `./output` | 126 | | `--singlefile` | Enable HTML snapshot capture with SingleFile. | Disabled | 127 | | `-v`, `--verbose` | Enable verbose output for debugging. | Disabled | 128 | | `--version` | Show the version of the tool and exit. | N/A | 129 | 130 | **Example:** 131 | 132 | ```bash 133 | # Run with default settings (uses ./credentials.yaml, outputs to ./output) 134 | python export.py 135 | 136 | # Run with a custom output directory and enable HTML snapshots 137 | python export.py -o /path/to/my-canvas-backup --singlefile 138 | ``` 139 | 140 | After the export is complete, the tool will display a detailed summary of all the data that was successfully extracted, including counts of assignments, files, and pages, as well as any warnings or errors encountered. 141 | 142 | # Contribute 143 | 144 | I would love to see this script's functionality expanded and improved! I welcome all pull requests 🙂 145 | Thank you! 146 | -------------------------------------------------------------------------------- /singlefile.py: -------------------------------------------------------------------------------- 1 | from subprocess import CalledProcessError, run 2 | import os 3 | import platform 4 | import shutil 5 | import time 6 | 7 | if platform.system() == "Windows": 8 | SINGLEFILE_BINARY_PATH = os.path.join("node_modules", ".bin", "single-file.cmd") 9 | else: 10 | SINGLEFILE_BINARY_PATH = os.path.join("node_modules", ".bin", "single-file") 11 | 12 | # Prefer calling the Node entry directly for reliable cross-platform arg passing 13 | SINGLEFILE_NODE_ENTRY = os.path.join("node_modules", "single-file-cli", "single-file-node.js") 14 | 15 | # Default Chrome/Chromium executable path is determined heuristically per-OS. 16 | 17 | 18 | def _detect_chrome_path() -> str: 19 | """Return a best-guess path to a Chrome/Chromium executable for the current OS.""" 20 | system = platform.system().lower() 21 | 22 | candidates = [] 23 | 24 | if system == "windows": 25 | candidates = [ 26 | r"C:\Program Files\Google\Chrome\Application\chrome.exe", 27 | r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe", 28 | r"C:\Program Files\Chromium\Application\chrome.exe", 29 | ] 30 | elif system == "darwin": # macOS 31 | candidates = [ 32 | "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome", 33 | "/Applications/Chromium.app/Contents/MacOS/Chromium", 34 | "/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary", 35 | ] 36 | else: # assume Linux/Unix 37 | for name in ["google-chrome", "google-chrome-stable", "chromium-browser", "chromium", "chrome"]: 38 | path = shutil.which(name) 39 | if path: 40 | return path 41 | 42 | for path in candidates: 43 | if os.path.exists(path): 44 | return path 45 | 46 | # Fallback – rely on SingleFile auto-detect; returns empty string 47 | return "" 48 | 49 | 50 | # Mutable global – can be overridden at runtime by export.py 51 | CHROME_PATH = _detect_chrome_path() 52 | 53 | 54 | # Default timeout in seconds for SingleFile to complete. Can be overridden. 55 | SINGLEFILE_TIMEOUT = 60.0 # 1 minute 56 | 57 | 58 | def override_chrome_path(path: str): 59 | """Allow callers to override the detected Chrome path at runtime.""" 60 | global CHROME_PATH 61 | CHROME_PATH = path.strip() 62 | 63 | 64 | def override_singlefile_timeout(timeout: float): 65 | """Allow callers to override the SingleFile timeout at runtime.""" 66 | global SINGLEFILE_TIMEOUT 67 | if timeout > 0: 68 | SINGLEFILE_TIMEOUT = timeout 69 | 70 | 71 | def addQuotes(str): 72 | return "\"" + str.strip("\"") + "\"" 73 | 74 | 75 | def download_page(url, cookies_path, output_path, output_name_template = "", additional_args = (), verbose=False): 76 | # Build full output path we expect SingleFile to create 77 | expected_output = os.path.join(output_path, output_name_template) if output_name_template else output_path 78 | 79 | # Prepare argument list for robust cross-platform execution 80 | node_path = shutil.which("node") 81 | use_shell_string = False 82 | 83 | # Convert timeout to milliseconds for SingleFile CLI argument 84 | timeout_ms = str(int(SINGLEFILE_TIMEOUT * 1000)) 85 | 86 | if node_path and os.path.exists(SINGLEFILE_NODE_ENTRY): 87 | cmd_args = [ 88 | node_path, 89 | SINGLEFILE_NODE_ENTRY, 90 | url, 91 | expected_output, 92 | "--filename-conflict-action=overwrite", 93 | "--browser-capture-max-time=" + timeout_ms, 94 | ] 95 | if CHROME_PATH: 96 | cmd_args.append("--browser-executable-path=" + CHROME_PATH.strip("\"")) 97 | if cookies_path: 98 | cmd_args.append("--browser-cookies-file=" + cookies_path) 99 | # Append any additional CLI args as-is 100 | cmd_args.extend(list(additional_args)) 101 | else: 102 | # Fallback to the shim in node_modules/.bin using a shell command 103 | use_shell_string = True 104 | args = [ 105 | addQuotes(SINGLEFILE_BINARY_PATH), 106 | addQuotes(url), 107 | addQuotes(expected_output), 108 | "--filename-conflict-action=overwrite", 109 | "--browser-capture-max-time=" + timeout_ms, 110 | ] 111 | if CHROME_PATH: 112 | args.append("--browser-executable-path=" + addQuotes(CHROME_PATH.strip("\""))) 113 | if cookies_path: 114 | args.append("--browser-cookies-file=" + addQuotes(cookies_path)) 115 | args.extend(additional_args) 116 | cmd_args = " ".join(args) 117 | 118 | try: 119 | if verbose: 120 | if isinstance(cmd_args, list): 121 | print(f" Executing: {' '.join(cmd_args)}") 122 | else: 123 | print(f" Executing: {cmd_args}") 124 | 125 | proc = run(cmd_args, shell=use_shell_string, check=True, capture_output=True) 126 | 127 | # Decode outputs immediately so we can surface them even if the file check fails 128 | stdout_text = proc.stdout.decode("utf-8", errors="replace").strip() 129 | stderr_text = proc.stderr.decode("utf-8", errors="replace").strip() 130 | 131 | # Optionally show SingleFile logs right after the process exits 132 | if verbose: 133 | if stdout_text: 134 | print(stdout_text) 135 | if stderr_text: 136 | # SingleFile prints non-error info to stderr; show only in verbose mode 137 | print(stderr_text) 138 | 139 | # Wait for the file to exist and be readable (handles Windows write/lock delays) 140 | start_time = time.monotonic() 141 | deadline = start_time + SINGLEFILE_TIMEOUT + 5.0 # seconds, add buffer 142 | delay = 0.1 143 | while True: 144 | try: 145 | if not os.path.exists(expected_output): 146 | raise FileNotFoundError(expected_output) 147 | with open(expected_output, "r", encoding="utf-8") as f: 148 | content = f.read() 149 | 150 | # Detect login page content 151 | login_indicators = [ 152 | "Log in to Canvas", 153 | 'id="new_login_data"', 154 | 'autocomplete="current-password"', 155 | ] 156 | if any(indicator in content for indicator in login_indicators): 157 | # Clean up the invalid file 158 | try: 159 | os.remove(expected_output) 160 | except Exception: 161 | pass 162 | raise Exception("Authentication failed, downloaded a login page. Please update your cookies.") 163 | 164 | break # success 165 | except (PermissionError, FileNotFoundError) as e: 166 | now = time.monotonic() 167 | if now >= deadline: 168 | # Enrich the error with SingleFile logs for better diagnostics 169 | elapsed = now - start_time 170 | details = [ 171 | f"SingleFile produced no readable output within {elapsed:.1f}s", 172 | f"URL: {url}", 173 | f"Expected path: {expected_output}", 174 | f"Exit code: {proc.returncode}", 175 | ] 176 | if stdout_text: 177 | details.append(f"stdout:\n{stdout_text}") 178 | if stderr_text: 179 | details.append(f"stderr:\n{stderr_text}") 180 | raise Exception("\n".join(details)) from e 181 | time.sleep(min(delay, deadline - now)) 182 | delay = min(delay * 1.5, 1.0) 183 | 184 | except CalledProcessError as e: 185 | # Re-raise with more context including both stdout and stderr 186 | stderr_text = "" 187 | stdout_text = "" 188 | try: 189 | stderr_text = e.stderr.decode('utf-8', errors='replace') if e.stderr is not None else "" 190 | except Exception: 191 | pass 192 | try: 193 | stdout_text = e.stdout.decode('utf-8', errors='replace') if e.stdout is not None else "" 194 | except Exception: 195 | pass 196 | msg_parts = [f"SingleFile failed for {url}."] 197 | if stdout_text: 198 | msg_parts.append(f"stdout:\n{stdout_text}") 199 | if stderr_text: 200 | msg_parts.append(f"stderr:\n{stderr_text}") 201 | raise Exception("\n".join(msg_parts)) from e 202 | except Exception as e: 203 | # Propagate our own exceptions 204 | raise e 205 | 206 | #if __name__ == "__main__": 207 | #download_page("https://www.google.com/", "", "./output/test", "test.html") 208 | -------------------------------------------------------------------------------- /export.py: -------------------------------------------------------------------------------- 1 | # built in 2 | import json 3 | import os 4 | import itertools 5 | import re 6 | import string 7 | import unicodedata 8 | import argparse 9 | import sys 10 | 11 | # external 12 | from bs4 import BeautifulSoup 13 | from canvasapi import Canvas 14 | from canvasapi.exceptions import ResourceDoesNotExist, Unauthorized, Forbidden, InvalidAccessToken, CanvasException 15 | from singlefile import download_page, override_chrome_path, override_singlefile_timeout 16 | import dateutil.parser 17 | import jsonpickle 18 | import requests 19 | import yaml 20 | 21 | # Canvas API Error Handling Utility 22 | class CanvasErrorHandler: 23 | @staticmethod 24 | def handle_canvas_exception(e, operation_description="operation"): 25 | """ 26 | Handle Canvas API exceptions with appropriate messaging and classification. 27 | Returns (error_type, message) 28 | """ 29 | if isinstance(e, InvalidAccessToken): 30 | return "authentication", f"Invalid Canvas API token. Please check your credentials.yaml file." 31 | 32 | elif isinstance(e, Unauthorized): 33 | # Check if this is a known student limitation 34 | if "submissions" in operation_description.lower(): 35 | return "student_limitation", f"Not authorized to download every student's assignment submission. This is normal for student accounts." 36 | elif "file" in operation_description.lower(): 37 | return "student_limitation", f"Not authorized to download some course files. This is normal for student accounts." 38 | else: 39 | return "authorization", f"Not authorized to perform {operation_description}. Check your Canvas permissions." 40 | 41 | elif isinstance(e, Forbidden): 42 | return "student_limitation", f"Access forbidden for {operation_description}. This may be normal for student accounts." 43 | 44 | elif isinstance(e, ResourceDoesNotExist): 45 | return "not_found", f"Resource not found for {operation_description}. It may have been deleted or moved." 46 | 47 | elif isinstance(e, CanvasException): 48 | return "canvas_error", f"Canvas API error during {operation_description}: {str(e)}" 49 | 50 | else: 51 | return "unknown_error", f"Unexpected error during {operation_description}: {str(e)}" 52 | 53 | @staticmethod 54 | def log_error(error_type, message, show_details=True, verbose=False): 55 | """Log error messages with appropriate formatting""" 56 | if error_type == "student_limitation": 57 | if show_details: 58 | print(f" Note: {message}") 59 | elif error_type == "not_found": 60 | print(f" Skipping: {message}") 61 | elif error_type in ["authentication", "authorization", "canvas_error", "unknown_error"]: 62 | print(f" ERROR: {message}") 63 | if verbose: 64 | import traceback 65 | traceback.print_exc() 66 | else: 67 | print(f" {message}") 68 | 69 | @staticmethod 70 | def is_fatal_error(error_type): 71 | """Check if an error type should stop execution""" 72 | return error_type in ["authentication", "canvas_error", "authorization"] 73 | 74 | # Add counters for tracking successful extractions 75 | class ExtractionStats: 76 | def __init__(self): 77 | self.assignments_found = 0 78 | self.submissions_found = 0 79 | self.announcements_found = 0 80 | self.discussions_found = 0 81 | self.pages_found = 0 82 | self.modules_found = 0 83 | self.module_items_found = 0 84 | self.files_downloaded = 0 85 | self.attachments_downloaded = 0 86 | self.html_pages_downloaded = 0 87 | self.json_files_created = 0 88 | self.student_limitation_warnings = 0 89 | self.error_count = 0 90 | 91 | def summary(self, dl_location, singlefile_enabled=False): 92 | summary_text = f""" 93 | Data Extraction Summary: 94 | • {self.assignments_found} assignments found 95 | • {self.submissions_found} submissions found (your own) 96 | • {self.announcements_found} announcements found 97 | • {self.discussions_found} discussions found 98 | • {self.pages_found} pages found 99 | • {self.modules_found} modules found 100 | • {self.module_items_found} module items found 101 | 102 | Files Downloaded: 103 | • {self.files_downloaded} course files downloaded 104 | • {self.attachments_downloaded} assignment attachments downloaded""" 105 | 106 | if singlefile_enabled: 107 | summary_text += f"\n • {self.html_pages_downloaded} HTML pages captured" 108 | 109 | summary_text += f""" 110 | 111 | Data Exports Created: 112 | • {self.json_files_created} JSON data files created 113 | • Individual course data: {dl_location}/[Term]/[Course]/[Course].json 114 | • Combined data: {dl_location}/all_output.json 115 | 116 | Student Account Limitations: {self.student_limitation_warnings} (expected) 117 | Errors Encountered: {self.error_count} 118 | """ 119 | return summary_text 120 | 121 | # Global stats tracker 122 | extraction_stats = ExtractionStats() 123 | 124 | def _load_credentials(path: str) -> dict: 125 | """Return a dict with API_URL, API_KEY, USER_ID, COOKIES_PATH or empty dict if file missing.""" 126 | try: 127 | with open(path, "r", encoding="utf-8") as f: 128 | return yaml.full_load(f) or {} 129 | except FileNotFoundError: 130 | return {} 131 | 132 | # Placeholder globals – will be overwritten in __main__ once we have parsed CLI args. 133 | API_URL = "" 134 | API_KEY = "" 135 | USER_ID = 0 136 | COOKIES_PATH = "" 137 | 138 | # Directory in which to download course information to (will be created if not 139 | # present) 140 | DL_LOCATION = "./output" 141 | # List of Course IDs that should be skipped 142 | COURSES_TO_SKIP = [] 143 | 144 | DATE_TEMPLATE = "%B %d, %Y %I:%M %p" 145 | 146 | # Max PATH length is 260 characters on Windows. 70 is just an estimate for a reasonable max folder name to prevent the chance of reaching the limit 147 | # Applies to modules, assignments, announcements, and discussions 148 | # If a folder exceeds this limit, a "-" will be added to the end to indicate it was shortened ("..." not valid) 149 | MAX_FOLDER_NAME_SIZE = 70 150 | 151 | # Global flag to stop HTML downloads if cookies are invalid 152 | stop_html_downloads = False 153 | 154 | 155 | class moduleItemView(): 156 | id = 0 157 | 158 | title = "" 159 | content_type = "" 160 | 161 | url = "" 162 | external_url = "" 163 | 164 | 165 | class moduleView(): 166 | id = 0 167 | 168 | name = "" 169 | items = [] 170 | 171 | def __init__(self): 172 | self.items = [] 173 | 174 | 175 | class pageView(): 176 | id = 0 177 | 178 | title = "" 179 | body = "" 180 | created_date = "" 181 | last_updated_date = "" 182 | 183 | 184 | class topicReplyView(): 185 | id = 0 186 | 187 | author = "" 188 | posted_date = "" 189 | body = "" 190 | 191 | 192 | class topicEntryView(): 193 | id = 0 194 | 195 | author = "" 196 | posted_date = "" 197 | body = "" 198 | topic_replies = [] 199 | 200 | def __init__(self): 201 | self.topic_replies = [] 202 | 203 | 204 | class discussionView(): 205 | id = 0 206 | 207 | title = "" 208 | author = "" 209 | posted_date = "" 210 | body = "" 211 | topic_entries = [] 212 | 213 | url = "" 214 | amount_pages = 0 215 | 216 | def __init__(self): 217 | self.topic_entries = [] 218 | 219 | 220 | class submissionView(): 221 | id = 0 222 | 223 | attachments = [] 224 | grade = "" 225 | raw_score = "" 226 | submission_comments = "" 227 | total_possible_points = "" 228 | attempt = 0 229 | user_id = "no-id" 230 | 231 | preview_url = "" 232 | ext_url = "" 233 | 234 | def __init__(self): 235 | self.attachments = [] 236 | 237 | class attachmentView(): 238 | id = 0 239 | 240 | filename = "" 241 | url = "" 242 | 243 | class assignmentView(): 244 | id = 0 245 | 246 | title = "" 247 | description = "" 248 | assigned_date = "" 249 | due_date = "" 250 | submissions = [] 251 | 252 | html_url = "" 253 | ext_url = "" 254 | updated_url = "" 255 | 256 | def __init__(self): 257 | self.submissions = [] 258 | 259 | 260 | class courseView(): 261 | course_id = 0 262 | 263 | term = "" 264 | course_code = "" 265 | name = "" 266 | assignments = [] 267 | announcements = [] 268 | discussions = [] 269 | modules = [] 270 | 271 | def __init__(self): 272 | self.assignments = [] 273 | self.announcements = [] 274 | self.discussions = [] 275 | self.modules = [] 276 | 277 | def makeValidFilename(input_str): 278 | if(not input_str): 279 | return input_str 280 | 281 | # Normalize Unicode and whitespace 282 | input_str = unicodedata.normalize('NFKC', input_str) 283 | input_str = input_str.replace("\u00A0", " ") # NBSP to space 284 | input_str = re.sub(r"\s+", " ", input_str) 285 | 286 | # Remove invalid characters 287 | valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits) 288 | input_str = input_str.replace("+"," ") # Canvas default for spaces 289 | input_str = input_str.replace(":","-") 290 | input_str = input_str.replace("/","-") 291 | input_str = "".join(c for c in input_str if c in valid_chars) 292 | 293 | # Remove leading and trailing whitespace 294 | input_str = input_str.lstrip().rstrip() 295 | 296 | # Remove trailing periods 297 | input_str = input_str.rstrip(".") 298 | 299 | return input_str 300 | 301 | def makeValidFolderPath(input_str): 302 | # Normalize Unicode and whitespace 303 | input_str = unicodedata.normalize('NFKC', input_str) 304 | input_str = input_str.replace("\u00A0", " ") # NBSP to space 305 | input_str = re.sub(r"\s+", " ", input_str) 306 | 307 | # Remove invalid characters 308 | valid_chars = "-_.()/ %s%s" % (string.ascii_letters, string.digits) 309 | input_str = input_str.replace("+"," ") # Canvas default for spaces 310 | input_str = input_str.replace(":","-") 311 | input_str = "".join(c for c in input_str if c in valid_chars) 312 | 313 | # Remove leading and trailing whitespace, separators 314 | input_str = input_str.lstrip().rstrip().strip("/").strip("\\") 315 | 316 | # Remove trailing periods 317 | input_str = input_str.rstrip(".") 318 | 319 | # Replace path separators with OS default 320 | input_str=input_str.replace("/",os.sep) 321 | 322 | return input_str 323 | 324 | def shortenFileName(string, shorten_by) -> str: 325 | if (not string or shorten_by <= 0): 326 | return string 327 | 328 | # Shorten string by specified value + 1 for "-" to indicate incomplete file name (trailing periods not allowed) 329 | string = string[:len(string)-(shorten_by + 1)] 330 | 331 | string = string.rstrip().rstrip(".").rstrip("-") 332 | string += "-" 333 | 334 | return string 335 | 336 | 337 | def findCourseModules(course, course_view): 338 | modules_dir = os.path.join(DL_LOCATION, course_view.term, 339 | course_view.course_code, "modules") 340 | 341 | # Create modules directory if not present 342 | if not os.path.exists(modules_dir): 343 | os.makedirs(modules_dir) 344 | 345 | module_views = [] 346 | 347 | try: 348 | modules = course.get_modules() 349 | modules_list = list(modules) # Convert to list to get count 350 | 351 | if not modules_list: 352 | print(" No modules found in this course") 353 | else: 354 | print(f" Found {len(modules_list)} modules") 355 | 356 | for module in modules_list: 357 | module_view = moduleView() 358 | 359 | # ID 360 | module_view.id = module.id if hasattr(module, "id") else 0 361 | 362 | # Name 363 | module_view.name = str(module.name) if hasattr(module, "name") else "" 364 | print(f" Processing module: {module_view.name}") 365 | 366 | try: 367 | # Get module items 368 | module_items = module.get_module_items() 369 | module_items_list = list(module_items) 370 | 371 | if module_items_list: 372 | print(f" Found {len(module_items_list)} items") 373 | 374 | for module_item in module_items_list: 375 | module_item_view = moduleItemView() 376 | 377 | # ID 378 | module_item_view.id = module_item.id if hasattr(module_item, "id") else 0 379 | 380 | # Title 381 | module_item_view.title = str(module_item.title) if hasattr(module_item, "title") else "" 382 | # Type 383 | module_item_view.content_type = str(module_item.type) if hasattr(module_item, "type") else "" 384 | 385 | # URL 386 | module_item_view.url = str(module_item.html_url) if hasattr(module_item, "html_url") else "" 387 | # External URL 388 | module_item_view.external_url = str(module_item.external_url) if hasattr(module_item, "external_url") else "" 389 | 390 | if module_item_view.content_type == "File": 391 | # If problems arise due to long pathnames, changing module.name to module.id might help 392 | # A change would also have to be made in downloadCourseModulePages(api_url, course_view, cookies_path) 393 | module_name = makeValidFilename(str(module.name)) 394 | module_name = shortenFileName(module_name, len(module_name) - MAX_FOLDER_NAME_SIZE) 395 | module_dir = os.path.join(modules_dir, module_name, "files") 396 | 397 | try: 398 | # Create directory for current module if not present 399 | if not os.path.exists(module_dir): 400 | os.makedirs(module_dir) 401 | 402 | # Get the file object 403 | module_file = course.get_file(str(module_item.content_id)) 404 | 405 | # Create path for module file download 406 | module_file_path = os.path.join(module_dir, makeValidFilename(str(module_file.display_name))) 407 | 408 | # Download file if it doesn't already exist 409 | if not os.path.exists(module_file_path): 410 | module_file.download(module_file_path) 411 | extraction_stats.files_downloaded += 1 412 | print(f" Downloaded: {module_file.display_name}") 413 | else: 414 | print(f" File already exists: {module_file.display_name}") 415 | except Exception as e: 416 | error_type, message = CanvasErrorHandler.handle_canvas_exception( 417 | e, "module file download" 418 | ) 419 | if error_type == "student_limitation": 420 | extraction_stats.student_limitation_warnings += 1 421 | elif error_type == "not_found": 422 | pass # Already handled by log_error 423 | else: 424 | extraction_stats.error_count += 1 425 | CanvasErrorHandler.log_error(error_type, message) 426 | 427 | module_view.items.append(module_item_view) 428 | extraction_stats.module_items_found += 1 429 | except Exception as e: 430 | error_type, message = CanvasErrorHandler.handle_canvas_exception( 431 | e, "module item processing" 432 | ) 433 | CanvasErrorHandler.log_error(error_type, message, verbose=args.verbose) 434 | extraction_stats.error_count += 1 435 | 436 | module_views.append(module_view) 437 | extraction_stats.modules_found += 1 438 | 439 | except Exception as e: 440 | error_type, message = CanvasErrorHandler.handle_canvas_exception( 441 | e, "module processing" 442 | ) 443 | CanvasErrorHandler.log_error(error_type, message, verbose=args.verbose) 444 | extraction_stats.error_count += 1 445 | 446 | return module_views 447 | 448 | 449 | def downloadCourseFiles(course, course_view): 450 | # file full_name starts with "course files" 451 | dl_dir = os.path.join(DL_LOCATION, course_view.term, 452 | course_view.course_code) 453 | 454 | # Create directory if not present 455 | if not os.path.exists(dl_dir): 456 | os.makedirs(dl_dir) 457 | 458 | try: 459 | files = course.get_files() 460 | files_list = list(files) # Convert to list for consistency and count 461 | 462 | for file in files_list: 463 | file_folder=course.get_folder(file.folder_id) 464 | 465 | folder_dl_dir=os.path.join(dl_dir, makeValidFolderPath(file_folder.full_name)) 466 | 467 | if not os.path.exists(folder_dl_dir): 468 | os.makedirs(folder_dl_dir) 469 | 470 | dl_path = os.path.join(folder_dl_dir, makeValidFilename(str(file.display_name))) 471 | 472 | print(f" Downloading: {file.display_name}...") 473 | if not os.path.exists(dl_path): 474 | try: 475 | file.download(dl_path) 476 | extraction_stats.files_downloaded += 1 477 | print(f" ✓ Saved: {file.display_name}") 478 | except Exception as e: 479 | error_type, message = CanvasErrorHandler.handle_canvas_exception(e, f"file download for {file.display_name}") 480 | CanvasErrorHandler.log_error(error_type, message, verbose=args.verbose) 481 | extraction_stats.error_count += 1 482 | else: 483 | print(f" ✓ Already exists: {file.display_name}") 484 | 485 | except Exception as e: 486 | error_type, message = CanvasErrorHandler.handle_canvas_exception( 487 | e, "course file download" 488 | ) 489 | if error_type == "student_limitation": 490 | extraction_stats.student_limitation_warnings += 1 491 | else: 492 | extraction_stats.error_count += 1 493 | CanvasErrorHandler.log_error(error_type, message, verbose=args.verbose) 494 | 495 | 496 | def download_submission_attachments(course, course_view): 497 | course_dir = os.path.join(DL_LOCATION, course_view.term, 498 | course_view.course_code) 499 | 500 | # Create directory if not present 501 | if not os.path.exists(course_dir): 502 | os.makedirs(course_dir) 503 | 504 | for assignment in course_view.assignments: 505 | for submission in assignment.submissions: 506 | assignment_title = makeValidFilename(str(assignment.title)) 507 | assignment_title = shortenFileName(assignment_title, len(assignment_title) - MAX_FOLDER_NAME_SIZE) 508 | attachment_dir = os.path.join(course_dir, "assignments", assignment_title) 509 | if(len(assignment.submissions)!=1): 510 | attachment_dir = os.path.join(attachment_dir,str(submission.user_id)) 511 | if (not os.path.exists(attachment_dir)) and (submission.attachments): 512 | os.makedirs(attachment_dir) 513 | for attachment in submission.attachments: 514 | filepath = os.path.join(attachment_dir, makeValidFilename(str(attachment.id) + 515 | "_" + attachment.filename)) 516 | 517 | print(f" Downloading attachment: {attachment.filename}...") 518 | if not os.path.exists(filepath): 519 | try: 520 | r = requests.get(attachment.url, allow_redirects=True) 521 | r.raise_for_status() 522 | with open(filepath, 'wb') as f: 523 | f.write(r.content) 524 | extraction_stats.attachments_downloaded += 1 525 | print(f" ✓ Saved: {attachment.filename}") 526 | except Exception as e: 527 | print(f" ❌ Failed to download {attachment.filename}: {e}") 528 | extraction_stats.error_count += 1 529 | else: 530 | print(f" ✓ Already exists: {attachment.filename}") 531 | 532 | 533 | def getCoursePageUrls(course): 534 | page_urls = [] 535 | 536 | try: 537 | # Get all pages 538 | pages = course.get_pages() 539 | 540 | for page in pages: 541 | if hasattr(page, "url"): 542 | page_urls.append(str(page.url)) 543 | except Exception as e: 544 | error_msg = str(e) 545 | if "Not Found" not in error_msg: 546 | error_type, message = CanvasErrorHandler.handle_canvas_exception( 547 | e, "page URL retrieval" 548 | ) 549 | CanvasErrorHandler.log_error(error_type, message, verbose=args.verbose) 550 | if error_type != "student_limitation": 551 | extraction_stats.error_count += 1 552 | else: 553 | extraction_stats.student_limitation_warnings += 1 554 | 555 | return page_urls 556 | 557 | 558 | def findCoursePages(course): 559 | page_views = [] 560 | 561 | try: 562 | # Get all page URLs 563 | page_urls = getCoursePageUrls(course) 564 | 565 | for url in page_urls: 566 | page = course.get_page(url) 567 | 568 | page_view = pageView() 569 | 570 | # ID 571 | page_view.id = page.id if hasattr(page, "id") else 0 572 | 573 | # Title 574 | page_view.title = str(page.title) if hasattr(page, "title") else "" 575 | # Body 576 | page_view.body = str(page.body) if hasattr(page, "body") else "" 577 | # Date created 578 | try: 579 | page_view.created_date = dateutil.parser.parse(page.created_at).strftime(DATE_TEMPLATE) if \ 580 | hasattr(page, "created_at") else "" 581 | except (ValueError, TypeError): 582 | page_view.created_date = "" 583 | 584 | # Date last updated 585 | try: 586 | page_view.last_updated_date = dateutil.parser.parse(page.updated_at).strftime(DATE_TEMPLATE) if \ 587 | hasattr(page, "updated_at") else "" 588 | except (ValueError, TypeError): 589 | page_view.last_updated_date = "" 590 | 591 | page_views.append(page_view) 592 | extraction_stats.pages_found += 1 593 | except Exception as e: 594 | error_type, message = CanvasErrorHandler.handle_canvas_exception( 595 | e, "page download" 596 | ) 597 | CanvasErrorHandler.log_error(error_type, message, verbose=args.verbose) 598 | extraction_stats.error_count += 1 599 | 600 | return page_views 601 | 602 | 603 | def findCourseAssignments(course): 604 | assignment_views = [] 605 | 606 | # Get all assignments 607 | assignments = course.get_assignments() 608 | assignments_list = list(assignments) # Convert to list for consistency 609 | 610 | try: 611 | for assignment in assignments_list: 612 | # Create a new assignment view 613 | assignment_view = assignmentView() 614 | 615 | #ID 616 | assignment_view.id = assignment.id if \ 617 | hasattr(assignment, "id") else 0 618 | 619 | # Title 620 | assignment_view.title = makeValidFilename(str(assignment.name)) if \ 621 | hasattr(assignment, "name") else "" 622 | # Description 623 | assignment_view.description = str(assignment.description) if \ 624 | hasattr(assignment, "description") else "" 625 | 626 | # Assigned date 627 | try: 628 | assignment_view.assigned_date = dateutil.parser.parse(assignment.created_at).strftime(DATE_TEMPLATE) if \ 629 | hasattr(assignment, "created_at") and assignment.created_at else "" 630 | except (ValueError, TypeError): 631 | assignment_view.assigned_date = "" 632 | 633 | # Due date 634 | try: 635 | assignment_view.due_date = dateutil.parser.parse(assignment.due_at).strftime(DATE_TEMPLATE) if \ 636 | hasattr(assignment, "due_at") and assignment.due_at else "" 637 | except (ValueError, TypeError): 638 | assignment_view.due_date = "" 639 | 640 | # HTML Url 641 | assignment_view.html_url = assignment.html_url if \ 642 | hasattr(assignment, "html_url") else "" 643 | # External URL 644 | assignment_view.ext_url = str(assignment.url) if \ 645 | hasattr(assignment, "url") else "" 646 | # Other URL (more up-to-date) 647 | assignment_view.updated_url = str(assignment.submissions_download_url).split("submissions?")[0] if \ 648 | hasattr(assignment, "submissions_download_url") else "" 649 | 650 | try: 651 | try: # Download all submissions for entire class 652 | submissions = assignment.get_submissions() 653 | submissions[0] # Trigger Unauthorized if not allowed 654 | except (Unauthorized, Forbidden) as e: 655 | error_type, message = CanvasErrorHandler.handle_canvas_exception( 656 | e, "class submission download" 657 | ) 658 | if error_type == "student_limitation": 659 | extraction_stats.student_limitation_warnings += 1 660 | if extraction_stats.student_limitation_warnings == 1: 661 | print(f" Note: Not authorized to download every student's assignment submission. Downloading submission for user {USER_ID} only.") 662 | else: 663 | CanvasErrorHandler.log_error(error_type, message, verbose=args.verbose) 664 | extraction_stats.error_count += 1 665 | 666 | # Download submission for this user only 667 | submissions = [assignment.get_submission(USER_ID)] 668 | submissions[0] #throw error if no submissions found at all but without error 669 | except (ResourceDoesNotExist, NameError, IndexError) as e: 670 | error_type, message = CanvasErrorHandler.handle_canvas_exception( 671 | e, "submission retrieval" 672 | ) 673 | CanvasErrorHandler.log_error(error_type, message, verbose=args.verbose) 674 | extraction_stats.error_count += 1 675 | except Exception as e: 676 | error_type, message = CanvasErrorHandler.handle_canvas_exception( 677 | e, "submission retrieval" 678 | ) 679 | CanvasErrorHandler.log_error(error_type, message, verbose=args.verbose) 680 | extraction_stats.error_count += 1 681 | else: 682 | try: 683 | for submission in submissions: 684 | 685 | sub_view = submissionView() 686 | 687 | # Submission ID 688 | sub_view.id = submission.id if \ 689 | hasattr(submission, "id") else 0 690 | 691 | # My grade 692 | sub_view.grade = str(submission.grade) if \ 693 | hasattr(submission, "grade") else "" 694 | # My raw score 695 | sub_view.raw_score = str(submission.score) if \ 696 | hasattr(submission, "score") else "" 697 | # Total possible score 698 | sub_view.total_possible_points = str(assignment.points_possible) if \ 699 | hasattr(assignment, "points_possible") else "" 700 | # Submission comments 701 | sub_view.submission_comments = str(submission.submission_comments) if \ 702 | hasattr(submission, "submission_comments") else "" 703 | # Attempt 704 | sub_view.attempt = submission.attempt if \ 705 | hasattr(submission, "attempt") and submission.attempt is not None else 0 706 | # User ID 707 | sub_view.user_id = str(submission.user_id) if \ 708 | hasattr(submission, "user_id") else "" 709 | 710 | # Submission URL 711 | sub_view.preview_url = str(submission.preview_url) if \ 712 | hasattr(submission, "preview_url") else "" 713 | # External URL 714 | sub_view.ext_url = str(submission.url) if \ 715 | hasattr(submission, "url") else "" 716 | 717 | try: 718 | submission.attachments 719 | except AttributeError: 720 | pass # No attachments message removed for cleaner output 721 | else: 722 | attachment_count = len(submission.attachments) if submission.attachments else 0 723 | if attachment_count > 0: 724 | print(f" Found {attachment_count} attachments") 725 | for attachment in submission.attachments: 726 | attach_view = attachmentView() 727 | attach_view.url = attachment.url 728 | attach_view.id = attachment.id 729 | attach_view.filename = attachment.filename 730 | sub_view.attachments.append(attach_view) 731 | assignment_view.submissions.append(sub_view) 732 | extraction_stats.submissions_found += 1 733 | except Exception as e: 734 | error_type, message = CanvasErrorHandler.handle_canvas_exception( 735 | e, "submission processing" 736 | ) 737 | CanvasErrorHandler.log_error(error_type, message, verbose=args.verbose) 738 | extraction_stats.error_count += 1 739 | 740 | assignment_views.append(assignment_view) 741 | extraction_stats.assignments_found += 1 742 | except Exception as e: 743 | error_type, message = CanvasErrorHandler.handle_canvas_exception( 744 | e, "course assignments processing" 745 | ) 746 | CanvasErrorHandler.log_error(error_type, message, verbose=args.verbose) 747 | extraction_stats.error_count += 1 748 | 749 | return assignment_views 750 | 751 | 752 | def findCourseAnnouncements(course): 753 | announcement_views = [] 754 | 755 | try: 756 | announcements = course.get_discussion_topics(only_announcements=True) 757 | 758 | for announcement in announcements: 759 | discussion_view = getDiscussionView(announcement) 760 | 761 | announcement_views.append(discussion_view) 762 | extraction_stats.announcements_found += 1 763 | except Exception as e: 764 | error_type, message = CanvasErrorHandler.handle_canvas_exception( 765 | e, "announcement processing" 766 | ) 767 | CanvasErrorHandler.log_error(error_type, message, verbose=args.verbose) 768 | extraction_stats.error_count += 1 769 | 770 | return announcement_views 771 | 772 | 773 | def getDiscussionView(discussion_topic): 774 | # Create discussion view 775 | discussion_view = discussionView() 776 | 777 | #ID 778 | discussion_view.id = discussion_topic.id if hasattr(discussion_topic, "id") else 0 779 | 780 | # Title 781 | discussion_view.title = str(discussion_topic.title) if hasattr(discussion_topic, "title") else "" 782 | # Author 783 | discussion_view.author = str(discussion_topic.user_name) if hasattr(discussion_topic, "user_name") else "" 784 | # Posted date 785 | try: 786 | discussion_view.posted_date = dateutil.parser.parse(discussion_topic.created_at).strftime("%B %d, %Y %I:%M %p") if \ 787 | hasattr(discussion_topic, "created_at") and discussion_topic.created_at else "" 788 | except (ValueError, TypeError): 789 | discussion_view.posted_date = "" 790 | # Body 791 | discussion_view.body = str(discussion_topic.message) if hasattr(discussion_topic, "message") else "" 792 | 793 | # URL 794 | discussion_view.url = str(discussion_topic.html_url) if hasattr(discussion_topic, "html_url") else "" 795 | 796 | # Keeps track of how many topic_entries there are. 797 | topic_entries_counter = 0 798 | 799 | # Topic entries 800 | if hasattr(discussion_topic, "discussion_subentry_count") and discussion_topic.discussion_subentry_count > 0: 801 | # Need to get replies to entries recursively? 802 | 803 | discussion_topic_entries = discussion_topic.get_topic_entries() 804 | 805 | try: 806 | for topic_entry in discussion_topic_entries: 807 | topic_entries_counter += 1 808 | 809 | # Create new discussion view for the topic_entry 810 | topic_entry_view = topicEntryView() 811 | 812 | # ID 813 | topic_entry_view.id = topic_entry.id if hasattr(topic_entry, "id") else 0 814 | # Author 815 | topic_entry_view.author = str(topic_entry.user_name) if hasattr(topic_entry, "user_name") else "" 816 | # Posted date 817 | try: 818 | topic_entry_view.posted_date = dateutil.parser.parse(topic_entry.created_at).strftime("%B %d, %Y %I:%M %p") if \ 819 | hasattr(topic_entry, "created_at") and topic_entry.created_at else "" 820 | except (ValueError, TypeError): 821 | topic_entry_view.posted_date = "" 822 | # Body 823 | topic_entry_view.body = str(topic_entry.message) if hasattr(topic_entry, "message") else "" 824 | 825 | # Get this topic's replies 826 | topic_entry_replies = topic_entry.get_replies() 827 | 828 | try: 829 | for topic_reply in topic_entry_replies: 830 | # Create new topic reply view 831 | topic_reply_view = topicReplyView() 832 | 833 | # ID 834 | topic_reply_view.id = topic_reply.id if hasattr(topic_reply, "id") else 0 835 | 836 | # Author 837 | topic_reply_view.author = str(topic_reply.user_name) if hasattr(topic_reply, "user_name") else "" 838 | # Posted Date 839 | try: 840 | topic_reply_view.posted_date = dateutil.parser.parse(topic_reply.created_at).strftime("%B %d, %Y %I:%M %p") if \ 841 | hasattr(topic_reply, "created_at") and topic_reply.created_at else "" 842 | except (ValueError, TypeError): 843 | topic_reply_view.posted_date = "" 844 | # Body 845 | topic_reply_view.body = str(topic_reply.message) if hasattr(topic_reply, "message") else "" 846 | 847 | topic_entry_view.topic_replies.append(topic_reply_view) 848 | except Exception as e: 849 | error_type, message = CanvasErrorHandler.handle_canvas_exception( 850 | e, "discussion topic reply processing" 851 | ) 852 | CanvasErrorHandler.log_error(error_type, message, verbose=args.verbose) 853 | if error_type == "student_limitation": 854 | extraction_stats.student_limitation_warnings += 1 855 | elif error_type == "not_found": 856 | pass # Already handled by log_error 857 | else: 858 | extraction_stats.error_count += 1 859 | 860 | discussion_view.topic_entries.append(topic_entry_view) 861 | except Exception as e: 862 | error_type, message = CanvasErrorHandler.handle_canvas_exception( 863 | e, "discussion topic entry processing" 864 | ) 865 | CanvasErrorHandler.log_error(error_type, message, verbose=args.verbose) 866 | if error_type == "student_limitation": 867 | extraction_stats.student_limitation_warnings += 1 868 | elif error_type == "not_found": 869 | pass # Already handled by log_error 870 | else: 871 | extraction_stats.error_count += 1 872 | 873 | # Amount of pages 874 | discussion_view.amount_pages = int(topic_entries_counter/50) + 1 # Typically 50 topic entries are stored on a page before it creates another page. 875 | 876 | return discussion_view 877 | 878 | 879 | def findCourseDiscussions(course): 880 | discussion_views = [] 881 | 882 | try: 883 | discussion_topics = course.get_discussion_topics() 884 | 885 | for discussion_topic in discussion_topics: 886 | discussion_view = None 887 | discussion_view = getDiscussionView(discussion_topic) 888 | 889 | discussion_views.append(discussion_view) 890 | extraction_stats.discussions_found += 1 891 | except Exception as e: 892 | error_type, message = CanvasErrorHandler.handle_canvas_exception( 893 | e, "discussion processing" 894 | ) 895 | CanvasErrorHandler.log_error(error_type, message, verbose=args.verbose) 896 | extraction_stats.error_count += 1 897 | 898 | return discussion_views 899 | 900 | 901 | def getCourseView(course): 902 | course_view = courseView() 903 | 904 | # Course ID 905 | course_view.course_id = course.id if hasattr(course, "id") else 0 906 | 907 | # Course term 908 | course_view.term = makeValidFilename(course.term.name if hasattr(course, "term") and hasattr(course.term, "name") else "") 909 | 910 | # Course code 911 | course_view.course_code = makeValidFilename(course.course_code if hasattr(course, "course_code") else "") 912 | 913 | # Course name 914 | course_view.name = course.name if hasattr(course, "name") else "" 915 | 916 | print(f"Working on: {course_view.term}: {course_view.name}") 917 | 918 | # Track HTML pages saved per course 919 | html_pages_saved_in_course = 0 920 | 921 | # Course assignments 922 | print(" Getting assignments") 923 | course_view.assignments = findCourseAssignments(course) 924 | print(f" Found {len(course_view.assignments)} assignments") 925 | 926 | # Course announcements 927 | print(" Getting announcements") 928 | course_view.announcements = findCourseAnnouncements(course) 929 | print(f" Found {len(course_view.announcements)} announcements") 930 | 931 | # Course discussions 932 | print(" Getting discussions") 933 | course_view.discussions = findCourseDiscussions(course) 934 | print(f" Found {len(course_view.discussions)} discussions") 935 | 936 | # Course pages 937 | print(" Getting pages") 938 | course_view.pages = findCoursePages(course) 939 | print(f" Found {len(course_view.pages)} pages") 940 | 941 | return course_view 942 | 943 | 944 | def exportAllCourseData(course_view): 945 | json_str = json.dumps(json.loads(jsonpickle.encode(course_view, unpicklable = False)), indent = 4) 946 | 947 | course_output_dir = os.path.join(DL_LOCATION, course_view.term, 948 | course_view.course_code) 949 | 950 | # Create directory if not present 951 | if not os.path.exists(course_output_dir): 952 | os.makedirs(course_output_dir) 953 | 954 | course_output_path = os.path.join(course_output_dir, 955 | course_view.course_code + ".json") 956 | 957 | print(f" Exporting JSON data for {course_view.course_code}...") 958 | with open(course_output_path, "w") as out_file: 959 | out_file.write(json_str) 960 | 961 | extraction_stats.json_files_created += 1 962 | print(f" ✓ Data saved to: {course_output_path}") 963 | 964 | def _download_page_if_not_exists(url, output_path, cookies_path, additional_args=(), verbose=False): 965 | """ 966 | Downloads a single HTML page if it doesn't exist, updating stats. 967 | Returns True if downloaded, False otherwise. 968 | """ 969 | global stop_html_downloads 970 | if stop_html_downloads: 971 | return False 972 | 973 | filename = os.path.basename(output_path) 974 | print(f" Downloading: {filename}...") 975 | 976 | if not os.path.exists(output_path): 977 | output_dir = os.path.dirname(output_path) 978 | os.makedirs(output_dir, exist_ok=True) 979 | 980 | try: 981 | download_page(url, cookies_path, output_dir, filename, additional_args, verbose) 982 | extraction_stats.html_pages_downloaded += 1 983 | print(f" ✓ Saved: {filename}") 984 | return True 985 | except Exception as e: 986 | print(f" ❌ Failed: {e}") 987 | extraction_stats.error_count += 1 988 | if "Authentication failed" in str(e): 989 | print(" Stopping all subsequent HTML downloads.") 990 | stop_html_downloads = True 991 | return False 992 | else: 993 | print(f" ✓ Already exists: {filename}") 994 | return True # Return True because the file exists, which is a success condition for the caller 995 | 996 | def downloadCourseHTML(api_url, cookies_path, verbose=False): 997 | if not cookies_path or stop_html_downloads: 998 | return 0 999 | 1000 | course_list_path = os.path.join(DL_LOCATION, "course_list.html") 1001 | url = f"{api_url}/courses/" 1002 | 1003 | if _download_page_if_not_exists(url, course_list_path, cookies_path, verbose=verbose): 1004 | return 1 1005 | return 0 1006 | 1007 | def downloadCourseHomePageHTML(api_url, course_view, cookies_path, verbose=False): 1008 | if not cookies_path or stop_html_downloads: 1009 | return 0 1010 | 1011 | dl_dir = os.path.join(DL_LOCATION, course_view.term, course_view.course_code) 1012 | homepage_path = os.path.join(dl_dir, "homepage.html") 1013 | url = f"{api_url}/courses/{course_view.course_id}" 1014 | 1015 | if _download_page_if_not_exists(url, homepage_path, cookies_path, verbose=verbose): 1016 | return 1 1017 | return 0 1018 | 1019 | def downloadCourseGradesHTML(api_url, course_view, cookies_path, verbose=False): 1020 | if not cookies_path or stop_html_downloads: 1021 | return 0 1022 | 1023 | dl_dir = os.path.join(DL_LOCATION, course_view.term, 1024 | course_view.course_code) 1025 | grades_path = os.path.join(dl_dir, "grades.html") 1026 | url = f"{api_url}/courses/{course_view.course_id}/grades" 1027 | additional_args=("--remove-hidden-elements=false",) 1028 | 1029 | if _download_page_if_not_exists(url, grades_path, cookies_path, additional_args, verbose=verbose): 1030 | # We only proceed with BeautifulSoup modifications if the file was newly downloaded or already existed. 1031 | with open(grades_path, "r+t", encoding="utf-8") as grades_file: 1032 | grades_html = BeautifulSoup(grades_file, "html.parser") 1033 | 1034 | button = grades_html.select_one("#show_all_details_button") 1035 | if button is not None: 1036 | button_class = button.get_attribute_list("class", []) 1037 | if "showAll" not in button_class: 1038 | button_class.append("showAll") 1039 | button["class"] = button_class 1040 | button.string = "Hide All Details" # Unfortunately this cannot handle i18n. 1041 | 1042 | assignments = grades_html.select("tr.student_assignment.editable") 1043 | for assignment in assignments: 1044 | assignment_id = str(assignment.get("id", "")).removeprefix("submission_") 1045 | muted = str(assignment.get("data-muted", "")).casefold() in {"true"} 1046 | if not muted: 1047 | for element in itertools.chain( 1048 | grades_html.select(f"#comments_thread_{assignment_id}"), 1049 | grades_html.select(f"#rubric_{assignment_id}"), 1050 | grades_html.select(f"#grade_info_{assignment_id}"), 1051 | grades_html.select(f"#final_grade_info_{assignment_id}"), 1052 | grades_html.select(f".parent_assignment_id_{assignment_id}"), 1053 | ): 1054 | element_style = str(element.get("style", "")) 1055 | element_style = re.sub(r"display:\s*none", "", element_style) 1056 | element["style"] = element_style 1057 | 1058 | assignment_arrow = grades_html.select_one(f"#parent_assignment_id_{assignment_id} i") 1059 | if assignment_arrow is not None: 1060 | assignment_arrow_class = assignment_arrow.get_attribute_list("class", []) 1061 | assignment_arrow_class.remove("icon-arrow-open-end") 1062 | assignment_arrow_class.append("icon-arrow-open-down") 1063 | assignment_arrow["class"] = assignment_arrow_class 1064 | 1065 | grades_file.seek(0) 1066 | grades_file.write(grades_html.prettify(formatter="html")) 1067 | grades_file.truncate() 1068 | return 1 1069 | return 0 1070 | 1071 | def downloadAssignmentPages(api_url, course_view, cookies_path, verbose=False): 1072 | pages_saved = 0 1073 | if not cookies_path or not course_view.assignments or stop_html_downloads: 1074 | return pages_saved 1075 | 1076 | base_assign_dir = os.path.join(DL_LOCATION, course_view.term, 1077 | course_view.course_code, "assignments") 1078 | 1079 | # Download assignment list page 1080 | assignment_list_path = os.path.join(base_assign_dir, "assignment_list.html") 1081 | list_url = f"{api_url}/courses/{course_view.course_id}/assignments/" 1082 | if _download_page_if_not_exists(list_url, assignment_list_path, cookies_path, verbose=verbose): 1083 | pages_saved += 1 1084 | 1085 | for assignment in course_view.assignments: 1086 | assignment_title = makeValidFilename(str(assignment.title)) 1087 | assignment_title = shortenFileName(assignment_title, len(assignment_title) - MAX_FOLDER_NAME_SIZE) 1088 | assign_dir = os.path.join(base_assign_dir, assignment_title) 1089 | 1090 | if assignment.html_url: 1091 | assignment_page_path = os.path.join(assign_dir, "assignment.html") 1092 | if _download_page_if_not_exists(assignment.html_url, assignment_page_path, cookies_path, verbose=verbose): 1093 | pages_saved += 1 1094 | 1095 | for submission in assignment.submissions: 1096 | submission_dir = assign_dir 1097 | 1098 | if len(assignment.submissions) != 1: 1099 | submission_dir = os.path.join(assign_dir, str(submission.user_id)) 1100 | 1101 | if submission.preview_url: 1102 | submission_page_path = os.path.join(submission_dir, "submission.html") 1103 | if _download_page_if_not_exists(submission.preview_url, submission_page_path, cookies_path, verbose=verbose): 1104 | pages_saved += 1 1105 | 1106 | if (submission.attempt and submission.attempt > 1 and assignment.updated_url and assignment.html_url 1107 | and assignment.html_url.rstrip("/") != assignment.updated_url.rstrip("/")): 1108 | attempts_dir = os.path.join(assign_dir, "attempts") 1109 | 1110 | for i in range(submission.attempt): 1111 | filename = f"attempt_{i+1}.html" 1112 | attempt_path = os.path.join(attempts_dir, filename) 1113 | attempt_url = f"{assignment.updated_url}/history?version={i+1}" 1114 | if _download_page_if_not_exists(attempt_url, attempt_path, cookies_path, verbose=verbose): 1115 | pages_saved += 1 1116 | return pages_saved 1117 | 1118 | def downloadCourseModulePages(api_url, course_view, cookies_path, verbose=False): 1119 | pages_saved = 0 1120 | if not cookies_path or not course_view.modules or stop_html_downloads: 1121 | return pages_saved 1122 | 1123 | modules_dir = os.path.join(DL_LOCATION, course_view.term, 1124 | course_view.course_code, "modules") 1125 | 1126 | # Downloads the modules page 1127 | module_list_path = os.path.join(modules_dir, "modules_list.html") 1128 | list_url = f"{api_url}/courses/{course_view.course_id}/modules/" 1129 | if _download_page_if_not_exists(list_url, module_list_path, cookies_path, verbose=verbose): 1130 | pages_saved += 1 1131 | 1132 | for module in course_view.modules: 1133 | for item in module.items: 1134 | module_name = makeValidFilename(str(module.name)) 1135 | module_name = shortenFileName(module_name, len(module_name) - MAX_FOLDER_NAME_SIZE) 1136 | items_dir = os.path.join(modules_dir, module_name) 1137 | 1138 | if item.url: 1139 | filename = makeValidFilename(str(item.title)) + ".html" 1140 | module_item_path = os.path.join(items_dir, filename) 1141 | if _download_page_if_not_exists(item.url, module_item_path, cookies_path, verbose=verbose): 1142 | pages_saved += 1 1143 | return pages_saved 1144 | 1145 | def downloadCourseAnnouncementPages(api_url, course_view, cookies_path, verbose=False): 1146 | pages_saved = 0 1147 | if not cookies_path or not course_view.announcements or stop_html_downloads: 1148 | return pages_saved 1149 | 1150 | base_announce_dir = os.path.join(DL_LOCATION, course_view.term, 1151 | course_view.course_code, "announcements") 1152 | 1153 | # Download announcement list 1154 | announcement_list_path = os.path.join(base_announce_dir, "announcement_list.html") 1155 | list_url = f"{api_url}/courses/{course_view.course_id}/announcements/" 1156 | if _download_page_if_not_exists(list_url, announcement_list_path, cookies_path, verbose=verbose): 1157 | pages_saved += 1 1158 | 1159 | for announcement in course_view.announcements: 1160 | if not announcement.url: 1161 | continue 1162 | 1163 | announcements_title = makeValidFilename(str(announcement.title)) 1164 | announcements_title = shortenFileName(announcements_title, len(announcements_title) - MAX_FOLDER_NAME_SIZE) 1165 | announce_dir = os.path.join(base_announce_dir, announcements_title) 1166 | 1167 | if not os.path.exists(announce_dir): 1168 | os.makedirs(announce_dir) 1169 | 1170 | for i in range(announcement.amount_pages): 1171 | filename = f"announcement_{i+1}.html" 1172 | page_path = os.path.join(announce_dir, filename) 1173 | page_url = f"{announcement.url}/page-{i+1}" 1174 | if _download_page_if_not_exists(page_url, page_path, cookies_path, verbose=verbose): 1175 | pages_saved += 1 1176 | return pages_saved 1177 | 1178 | def downloadCourseDiscussionPages(api_url, course_view, cookies_path, verbose=False): 1179 | pages_saved = 0 1180 | if not cookies_path or not course_view.discussions or stop_html_downloads: 1181 | return pages_saved 1182 | 1183 | base_discussion_dir = os.path.join(DL_LOCATION, course_view.term, 1184 | course_view.course_code, "discussions") 1185 | 1186 | # Download discussion list 1187 | discussion_list_path = os.path.join(base_discussion_dir, "discussion_list.html") 1188 | list_url = f"{api_url}/courses/{course_view.course_id}/discussion_topics/" 1189 | if _download_page_if_not_exists(list_url, discussion_list_path, cookies_path, verbose=verbose): 1190 | pages_saved += 1 1191 | 1192 | for discussion in course_view.discussions: 1193 | if not discussion.url: 1194 | continue 1195 | 1196 | discussion_title = makeValidFilename(str(discussion.title)) 1197 | discussion_title = shortenFileName(discussion_title, len(discussion_title) - MAX_FOLDER_NAME_SIZE) 1198 | discussion_dir = os.path.join(base_discussion_dir, discussion_title) 1199 | 1200 | if not os.path.exists(discussion_dir): 1201 | os.makedirs(discussion_dir) 1202 | 1203 | for i in range(discussion.amount_pages): 1204 | filename = f"discussion_{i+1}.html" 1205 | page_path = os.path.join(discussion_dir, filename) 1206 | page_url = f"{discussion.url}/page-{i+1}" 1207 | if _download_page_if_not_exists(page_url, page_path, cookies_path, verbose=verbose): 1208 | pages_saved += 1 1209 | return pages_saved 1210 | 1211 | if __name__ == "__main__": 1212 | 1213 | print("Welcome to the Canvas Student Data Export Tool\n") 1214 | 1215 | parser = argparse.ArgumentParser(description="Export nearly all of a student's Canvas LMS data.") 1216 | parser.add_argument("-c", "--config", default="credentials.yaml", help="Path to YAML credentials file (default: credentials.yaml)") 1217 | parser.add_argument("-o", "--output", default="./output", help="Directory to store exported data (default: ./output)") 1218 | parser.add_argument("--singlefile", action="store_true", help="Enable HTML snapshot capture with SingleFile.") 1219 | parser.add_argument("-v", "--verbose", action="store_true", help="Enable verbose output for debugging.") 1220 | parser.add_argument("--version", action="version", version="Canvas Student Data Export Tool 1.0") 1221 | 1222 | args = parser.parse_args() 1223 | 1224 | # Load credentials from YAML 1225 | creds = _load_credentials(args.config) 1226 | 1227 | # Validate credentials 1228 | required = ["API_URL", "API_KEY", "USER_ID"] 1229 | missing = [k for k in required if not creds.get(k)] 1230 | 1231 | # COOKIES_PATH is required if singlefile is active, but it can be missing. 1232 | if args.singlefile: 1233 | print("Note: --singlefile is enabled. Please ensure your browser cookies") 1234 | print(" are fresh by logging into Canvas and then re-exporting") 1235 | print(" them using the chrome extension right before running this script.\n") 1236 | input("Press Enter to continue...") 1237 | if "COOKIES_PATH" not in creds or not creds["COOKIES_PATH"]: 1238 | missing.append("COOKIES_PATH") 1239 | 1240 | if missing: 1241 | print(f"Error: {args.config} is missing required field(s): {', '.join(missing)}.") 1242 | print("Please create the YAML file with the following structure:\n" 1243 | "API_URL: https://.instructure.com\n" 1244 | "API_KEY: \n" 1245 | "USER_ID: 123456\n" 1246 | "COOKIES_PATH: path/to/cookies.txt\n") 1247 | sys.exit(1) 1248 | 1249 | # Populate globals expected throughout the script 1250 | API_URL = creds["API_URL"].strip().rstrip('/') 1251 | API_KEY = creds["API_KEY"].strip() # Remove leading/trailing whitespace which is a common issue 1252 | USER_ID = creds["USER_ID"] 1253 | # Use .get() to safely access optional/conditionally required keys 1254 | COOKIES_PATH = creds.get("COOKIES_PATH", "") 1255 | COURSES_TO_SKIP = creds.get("COURSES_TO_SKIP", []) 1256 | 1257 | chrome_path_override = creds.get("CHROME_PATH") 1258 | if chrome_path_override: 1259 | override_chrome_path(chrome_path_override) 1260 | 1261 | # Optional: Override SingleFile capture timeout (in seconds) 1262 | singlefile_timeout_override = creds.get("SINGLEFILE_TIMEOUT") 1263 | if singlefile_timeout_override is not None: 1264 | try: 1265 | override_singlefile_timeout(float(singlefile_timeout_override)) 1266 | except (ValueError, TypeError): 1267 | print(f"Warning: Invalid SINGLEFILE_TIMEOUT value in {args.config}; using default.") 1268 | 1269 | # Update output directory 1270 | DL_LOCATION = args.output 1271 | 1272 | print("\nConnecting to Canvas…\n") 1273 | 1274 | # Initialize a new Canvas object 1275 | canvas = Canvas(API_URL, API_KEY) 1276 | 1277 | # Test the connection and API key 1278 | try: 1279 | user = canvas.get_current_user() 1280 | print(f"Successfully authenticated as: {user.name} (ID: {user.id})") 1281 | if user.id != USER_ID: 1282 | print(f"Warning: Authenticated user ID ({user.id}) does not match configured USER_ID ({USER_ID})") 1283 | except Exception as e: 1284 | error_type, message = CanvasErrorHandler.handle_canvas_exception( 1285 | e, "Canvas authentication" 1286 | ) 1287 | if CanvasErrorHandler.is_fatal_error(error_type): 1288 | print(f"FATAL: {message}") 1289 | sys.exit(1) 1290 | else: 1291 | CanvasErrorHandler.log_error(error_type, message, verbose=args.verbose) 1292 | 1293 | print(f"Creating output directory: {DL_LOCATION}\n") 1294 | os.makedirs(DL_LOCATION, exist_ok=True) 1295 | 1296 | all_courses_views = [] 1297 | 1298 | print("Getting list of all courses\n") 1299 | courses_list = [ 1300 | canvas.get_courses(enrollment_state = "active", include="term"), 1301 | canvas.get_courses(enrollment_state = "completed", include="term") 1302 | ] 1303 | 1304 | skip = set(COURSES_TO_SKIP) 1305 | 1306 | 1307 | if COOKIES_PATH and args.singlefile: 1308 | print(" Downloading course list page") 1309 | downloadCourseHTML(API_URL, COOKIES_PATH, verbose=args.verbose) 1310 | 1311 | for courses in courses_list: 1312 | for course in courses: 1313 | if course.id in skip or not hasattr(course, "name") or not hasattr(course, "term"): 1314 | continue 1315 | 1316 | html_pages_saved_in_course = 0 1317 | 1318 | course_view = getCourseView(course) 1319 | 1320 | all_courses_views.append(course_view) 1321 | 1322 | print(" Downloading all files") 1323 | downloadCourseFiles(course, course_view) 1324 | 1325 | print(" Downloading submission attachments") 1326 | download_submission_attachments(course, course_view) 1327 | 1328 | print(" Getting modules and downloading module files") 1329 | course_view.modules = findCourseModules(course, course_view) 1330 | 1331 | if COOKIES_PATH and args.singlefile: 1332 | print(" Downloading course home page") 1333 | html_pages_saved_in_course += downloadCourseHomePageHTML(API_URL, course_view, COOKIES_PATH, verbose=args.verbose) 1334 | 1335 | print(" Downloading course grades") 1336 | html_pages_saved_in_course += downloadCourseGradesHTML(API_URL, course_view, COOKIES_PATH, verbose=args.verbose) 1337 | 1338 | print(" Downloading assignment pages") 1339 | html_pages_saved_in_course += downloadAssignmentPages(API_URL, course_view, COOKIES_PATH, verbose=args.verbose) 1340 | 1341 | print(" Downloading course module pages") 1342 | html_pages_saved_in_course += downloadCourseModulePages(API_URL, course_view, COOKIES_PATH, verbose=args.verbose) 1343 | 1344 | print(" Downloading course announcements pages") 1345 | html_pages_saved_in_course += downloadCourseAnnouncementPages(API_URL, course_view, COOKIES_PATH, verbose=args.verbose) 1346 | 1347 | print(" Downloading course discussion pages") 1348 | html_pages_saved_in_course += downloadCourseDiscussionPages(API_URL, course_view, COOKIES_PATH, verbose=args.verbose) 1349 | 1350 | print(" Exporting all course data") 1351 | exportAllCourseData(course_view) 1352 | 1353 | # Show mini-summary for this course 1354 | assignments_count = len(course_view.assignments) 1355 | submissions_count = sum(len(a.submissions) for a in course_view.assignments) 1356 | modules_count = len(course_view.modules) 1357 | pages_count = len(course_view.pages) 1358 | announcements_count = len(course_view.announcements) 1359 | discussions_count = len(course_view.discussions) 1360 | 1361 | print(f" ✓ Course data exported:") 1362 | print(f" • {assignments_count} assignments with {submissions_count} submissions (JSON)") 1363 | print(f" • {modules_count} modules (JSON)") 1364 | print(f" • {pages_count} pages (JSON)") 1365 | print(f" • {announcements_count} announcements (JSON)") 1366 | print(f" • {discussions_count} discussions (JSON)") 1367 | if COOKIES_PATH and args.singlefile: 1368 | print(f" • {html_pages_saved_in_course} HTML snapshots saved") 1369 | print() 1370 | 1371 | print("Exporting data from all courses combined as one file: " 1372 | "all_output.json") 1373 | json_str = jsonpickle.encode(all_courses_views, unpicklable=False, indent=4) 1374 | 1375 | all_output_path = os.path.join(DL_LOCATION, "all_output.json") 1376 | 1377 | with open(all_output_path, "w") as out_file: 1378 | out_file.write(json_str) 1379 | 1380 | extraction_stats.json_files_created += 1 1381 | print(f"Combined JSON data exported to: {all_output_path}") 1382 | 1383 | print("\nProcess complete. All canvas data exported!") 1384 | print(extraction_stats.summary(DL_LOCATION, singlefile_enabled=args.singlefile)) 1385 | --------------------------------------------------------------------------------