├── .gitignore ├── assets ├── application-tab.png ├── obtaining-token.png ├── sequential-ids.png ├── discord-icon.svg └── youtube-icon.svg ├── README.md └── canvas.py /.gitignore: -------------------------------------------------------------------------------- 1 | env/ 2 | __pycache__/ 3 | output/ 4 | **.DS_Store 5 | -------------------------------------------------------------------------------- /assets/application-tab.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/erict963/canvas-tool/HEAD/assets/application-tab.png -------------------------------------------------------------------------------- /assets/obtaining-token.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/erict963/canvas-tool/HEAD/assets/obtaining-token.png -------------------------------------------------------------------------------- /assets/sequential-ids.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/erict963/canvas-tool/HEAD/assets/sequential-ids.png -------------------------------------------------------------------------------- /assets/discord-icon.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /assets/youtube-icon.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Canvas File Explorer 2 | 3 | A tool to explore a canvas course, and potentially find homework solutions and exams from previous semesters. **This does not work for every course.** This only works if a professor re-initializes a course that he/she has taught before. 4 | 5 | **[NEW - WebUI that executes this exact procedure](https://studysavers.com/schoolhacks/canvas)** 6 | 7 | **[![YouTube Icon](assets/youtube-icon.svg) Video demo](https://youtu.be/7f0Lu8lJ3iI)** 8 | 9 | **[![Discord Icon](assets/discord-icon.svg) Discord Server for tech support](https://discord.gg/k7yNftGEAA)** 10 | 11 | ## Disclaimer 12 | This tool is not intended for malicious use. You will be sending network requests to canvas servers at a rate faster than a human user would to accomplish the "scan". By settings `--num-files` and `--num-workers` to very high values, you may inadvertently launch a denial-of-service attack on the target server. This is **NOT** the intended use of this tool. While canvas has built-in protections against such attacks, it is your responsibility to ensure that your use of this tool complies with all applicable laws and terms of service. Use this tool responsibly and ethically. 13 | 14 | ## Why does this work? 15 | 16 | This script works because Canvas uses a sequential numbering system for file IDs. This is known as an **auto-incrementing primary key**. When a file is uploaded to Canvas, it is assigned a unique ID that is one greater than the previous file's ID. This means that the IDs are sequential and predictable. So if you know the ID of one file, you can easily find the IDs of other files by simply incrementing or decrementing the known ID. 17 | As an example, consider this: 18 | 19 | ``` 20 | https://canvas.example.edu/courses/123/files/455000 // Homework 10 Solutions 21 | https://canvas.example.edu/courses/123/files/455001 // Midterm Exam 22 | https://canvas.example.edu/courses/123/files/455002 // Final Exam Solutions 23 | 24 | 25 | https://canvas.example.edu/courses/123/files/456000 // Syllabus 26 | ``` 27 | 28 | If you have the location of the syllabus (`456000`), you can easily find the location of the other files by simply decrementing the file ID. This is what this script does. It starts at a given file ID and checks as many files as you want (specified by the `--num-files` flag) by decrementing the file ID. 29 | 30 | Our goal is to find the old files when a professor re-initializes a course. If a professor re-initializes a course, the old file IDs will appear concentrated and in close proximity due to the sequential, auto-incrementing nature of the file IDs in the database. 31 | 32 | ![Sequential IDs](assets/sequential-ids.png) 33 | 34 | ## Prerequisites 35 | 36 | - Install Python 3 ([Awesome Tutorial](https://realpython.com/installing-python/)) 37 | - Open a terminal (On macOS, press `Command + Space` and type `Terminal`. On Windows, press `Windows + R` and type `cmd`) 38 | - Clone this repo and navigate to the directory where you cloned it. You can do this by running the following commands in your terminal: 39 | 40 | ```bash 41 | git clone https://github.com/erict963/canvas-tool.git 42 | cd canvas-tool 43 | ``` 44 | 45 | This script was built with only the standard library, so no additional packages are required! You should be able to run it without any additional installations. 46 | 47 | ## Usage 48 | 49 | ### Example 50 | 51 | The following example command will start at `https://canvas.example.edu/courses/123/files/456000` 52 | and search all the URLs from `456000` to `(456000 - 10000 = 446000)`. 53 | 54 | Example command to paste into your terminal: 55 | 56 | ``` 57 | python3 canvas.py --canvas-session Ggx-OQY... \ 58 | --url https://canvas.example.edu/courses/123/files/456000 \ 59 | --num-files 10000 \ 60 | --use-api 61 | ``` 62 | 63 | The `--use-api` flag is optional. It will use the Canvas API to get file names instead of the default method, which is to scrape the frontend page. The API method is faster, but it may not work for all files. See the Canvas API section on why this is the case. 64 | 65 | So to be perfectly clear, the above command will check these URLs: 66 | 67 | ``` 68 | https://canvas.example.edu/api/v1/courses/123/files/456000 69 | https://canvas.example.edu/api/v1/courses/123/files/455999 70 | https://canvas.example.edu/api/v1/courses/123/files/455998 71 | ... 72 | https://canvas.example.edu/api/v1/courses/123/files/446000 73 | ``` 74 | 75 | If you do not pass in the `--use-api` flag, it will resort to checking the frontend URLs instead. The URLs will look like this: 76 | 77 | ``` 78 | https://canvas.example.edu/courses/123/files/456000 79 | https://canvas.example.edu/courses/123/files/455999 80 | https://canvas.example.edu/courses/123/files/455998 81 | ... 82 | https://canvas.example.edu/courses/123/files/446000 83 | ``` 84 | 85 | **Important:** The script assumes that the URL matches this ending pattern: `/files/{file_id}` where `{file_id}` is an integer. 86 | 87 | ### Full Command Line Options 88 | 89 | To see all the command line options, you can run the following command: 90 | 91 | ``` 92 | python3 canvas.py -h 93 | ``` 94 | 95 | This will show you all the available options and their descriptions. Here is a summary of the options: 96 | 97 | ``` 98 | usage: canvas.py [-h] -u URL [-f NUM_FILES] [-s CANVAS_SESSION] [-w NUM_WORKERS] 99 | [-l LOG_EVERY] [--use-api] 100 | 101 | Canvas file sweeper 102 | 103 | options: 104 | -h, --help show this help message and exit 105 | -u, --url URL The URL of the file to start from, e.g. 106 | https://canvas.example.edu/courses/123/files/456 107 | -f, --num-files NUM_FILES 108 | Number of files to scan (default 10000, min 1) 109 | -s, --canvas-session CANVAS_SESSION 110 | The Canvas API canvas_session, provided as an environment 111 | variable or command line argument. If not provided, the 112 | script will use the CANVAS_SESSION environment variable. 113 | -w, --num-workers NUM_WORKERS 114 | Number of threads to use (default 16, max 32) 115 | -l, --log-every LOG_EVERY 116 | Log every (X) files found (default 1000, min 1) 117 | --use-api Experimental: Use the Canvas API instead of the frontend 118 | (default: False) - this will be faster but may not 119 | necessarily find all files. See README for more details. 120 | ``` 121 | 122 | ## Canvas session 123 | 124 | The `canvas_session` is what allows this Python script to check for your Canvas files. It's like a password (hence why you **shouldn't share it with anyone**). Whenever you log into Canvas, this `canvas_session` is created and stored securely in your browser. Applications use tokens to authenticate users and authorize access to resources. 125 | 126 | In short, it gives the script permissions to act on your behalf. Don't worry, all this script is doing is checking if your professor's files are available or not. See the code in `canvas.py` for more details, and feel free to ask ChatGPT if you think any lines are suspicious. 127 | 128 | ### Obtaining 129 | 130 | To obtain a `canvas_session`, first make sure you're logged into your school's Canvas. Then, visit any page in Canvas. For example, go to 131 | 132 | ``` 133 | https://canvas.example.edu/courses/123/files/456 134 | ``` 135 | 136 | Then right-click on the page and select "Inspect". This will open the developer tools. In the developer tools, select the "Application" tab. 137 | 138 | ![Application Tab](assets/application-tab.png) 139 | 140 | On the left side, you should see a list of items. Click on "Cookies" and then select your school's Canvas domain. You should see a list of cookies. Look for a cookie called `canvas_session`. Copy the value of this cookie. 141 | 142 | ![Obtaining Token](assets/obtaining-token.png) 143 | 144 | ### Using an environment variable 145 | 146 | If you prefer to not repeatedly have to enter the `canvas_session` with the `--canvas-session` flag, you can certainly export it as an environment variable. This is done by running the following command in your terminal: 147 | 148 | ```bash 149 | export CANVAS_SESSION=Ggx-OQY... 150 | ``` 151 | 152 | Now, you can run the script without the `--canvas-session` flag. The script will automatically use the `canvas_session` from the environment variable each time you run it. 153 | 154 | ```bash 155 | python3 canvas.py --url https://canvas.example.edu/courses/123/files/456000 --num-files 10000 --use-api 156 | ``` 157 | 158 | ## Canvas API 159 | 160 | ### Why use the API? 161 | 162 | Great question! You're asking what's the difference between visiting this URL: 163 | 164 | `https://canvas.example.edu/api/v1/courses/123/files/456000` versus this URL `https://canvas.example.edu/courses/123/files/45600` 165 | 166 | Let's take a look at the api URL first. If you enter this `https://canvas.example.edu/api/v1/courses/123/files/456000` into your browser (just an example, but do try it with your courses!), you should see something like this (truncated for brevity): 167 | 168 | ```json 169 | { 170 | "id": 123, 171 | "uuid": "tRuCMqv9QumS9OqBhXe2gPs0SdtUl4RHFcY5hdmo", 172 | ... 173 | } 174 | ``` 175 | 176 | Now, let's take a look at the frontend URL. Simply delete the `api/v1/` part of the URL and enter it into your browser. 177 | So the URL would look like this: `https://canvas.example.edu/courses/123/files/456000`, and the response would be something like this (truncated for brevity): 178 | 179 | ```html 180 | 181 | 182 | 183 | ... 184 | 185 | 186 | ``` 187 | 188 | The main difference is the **size** of the responses. A typical example would be something like this: The API response is `300B` while the frontend response is `13kB`. This means the server will take longer to send the frontend response. 189 | 190 | ### Why does the API "miss" some files (not necessarily a bad thing) ? 191 | 192 | I observed this "missing" behavior empirically, so this explanation is slightly speculative. Please let me know if you have a better explanation. 193 | 194 | If you go to the [actual code of how Canvas is implemented](https://github.com/instructure/canvas-lms/blob/ddaaa0089cb3e83783056404d44106527dfe5ef1/app/models/attachment.rb#L1756) 195 | 196 | You may see something like this: 197 | 198 | ```ruby 199 | def destroy_content_and_replace(deleted_by_user = nil) 200 | ``` 201 | 202 | This means that when a professor re-initializes a course, the old files may start like this: 203 | 204 | ``` 205 | https://canvas.example.edu/courses/123/files/1 // Homework 10 Solutions 206 | https://canvas.example.edu/courses/123/files/2 // Midterm Exam 207 | https://canvas.example.edu/courses/123/files/3 // Final Exam Solutions 208 | ``` 209 | 210 | However, now let's say the professor makes a correction to Homework 10 Solutions before making them available to everyone. Because the content is destroyed but replaced with a new ID, and the fact that the IDs are sequential, the new file ID may look like this: 211 | 212 | ``` 213 | https://canvas.example.edu/courses/123/files/4 // Updated Homework 10 Solutions 214 | ``` 215 | 216 | So the API will no longer detect that the old file ID lives at `1`, but rather at `4`. 217 | 218 | However, the frontend will still show the old file ID `1` as a valid file, and redirect to the new file ID `4` via the API. 219 | 220 | This is why using this script to search via API may "miss" some files - the frontend relies on the API to report the correct data. Note, the frontend search will always "hit" every ID that exists. 221 | 222 | This is not necessarily a bad thing, because our entire goal is to get the "old files" that aren't currently available to the public. 223 | 224 | ## Using the `--forward` flag 225 | 226 | The `--forward` flag allows you to search for files in the forward direction. This means that instead of starting at a given file ID and decrementing, it will start at a given file ID and increment. Using the above example, if you start at `456000`, it will check the following URLs: 227 | 228 | ``` 229 | https://canvas.example.edu/api/v1/courses/123/files/456001 230 | https://canvas.example.edu/api/v1/courses/123/files/456002 231 | ... 232 | https://canvas.example.edu/api/v1/courses/123/files/466000 // Potential Final Exam that hasn't been released yet 233 | ``` 234 | -------------------------------------------------------------------------------- /canvas.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import json 3 | import logging 4 | import os 5 | import re 6 | import time 7 | import urllib.request 8 | import urllib.error 9 | 10 | from http.cookiejar import CookieJar 11 | from queue import Queue 12 | from threading import Thread, Event 13 | from urllib.parse import urlparse 14 | 15 | logging.basicConfig(level=logging.INFO, format='[%(asctime)s %(levelname)s] %(message)s', datefmt='%I:%M%p') 16 | 17 | def parse_response(response: urllib.response): 18 | """ 19 | Helpers to parse the response from the Canvas API. May be html or json. 20 | """ 21 | if 'application/json' in response.headers.get('Content-Type'): 22 | return json.loads(response.read().decode('utf-8')) 23 | elif 'text/html' in response.headers.get('Content-Type'): # need to parse the html, hence why it's slower than api (bigger payloads) 24 | ret = response.read().decode('utf-8') 25 | display_name = re.search(r'Download (.*?)', ret) 26 | if not display_name: 27 | raise Exception('Error parsing display name from response') 28 | display_name = display_name.group(1) 29 | 30 | path = re.search(r'> sweep_files(start=10, increment=2, url='https://example.com/files', stop=0) 60 | would create the following threads: 61 | Thread 1: will explore the odd file_ids: 62 | https://example.com/files/9 63 | https://example.com/files/7 64 | ... 65 | Thread 2: will explore the even file_ids: 66 | https://example.com/files/8 67 | https://example.com/files/6 68 | ... 69 | Both will stop when one of them reaches file_id = 0. That is, 70 | Thread 1 will stop at file_id = 1 71 | Thread 2 will stop at file_id = 0 72 | 73 | By decrementing by the number of threads, we guarantee that 74 | we can explore all file_ids in the range (stop, start]. 75 | 76 | If the `forward` argument is set to True, the function will 77 | explore the file_ids in the range (start, stop], incrementing by the number of threads. 78 | """ 79 | if kwargs.get('forward'): 80 | assert start < stop, f"Starting id ({start}) must be less than stopping id ({stop})." 81 | else: 82 | assert start > stop, f"Starting id ({start}) must be greater than stopping id ({stop})." 83 | assert stop >= 0, f"Cannot have negative file ids (ending id: {stop}). Try decreasing --num-files" 84 | assert increment > 0, "increment must be greater than 0" 85 | 86 | stop_signal = Event() 87 | queue = Queue() 88 | 89 | def process(start: int, increment: int, stop: int, url: str): 90 | start_time = time.time() 91 | for i in range(start, stop, increment): 92 | if stop_signal.is_set(): return 93 | 94 | try: 95 | request = urllib.request.Request(f'{url}/{i}') 96 | response = urllib.request.urlopen(request) 97 | 98 | response_data = parse_response(response) 99 | 100 | logging.info(f'FOUND: {url}/{i}') 101 | 102 | queue.put({ 103 | 'id': i, 104 | 'url': f"{url.replace('/api/v1', '')}/{i}", 105 | 'display_name': response_data.get('display_name'), 106 | 'created_at': response_data.get('created_at'), 107 | 'download_url': response_data.get('url') 108 | }) 109 | 110 | except urllib.error.HTTPError as e: 111 | if e.code != 404: # we expect majority of errors to be 404 112 | if e.code == 401: logging.error('Unauthorized access. Please get a new token.') 113 | if e.code == 403 and '(Rate Limit Exceeded)' in e.read().decode('utf-8'): logging.error('Rate limit exceeded. Please reduce number of threads.') 114 | logging.error(f'Status Code: {e.code} for {url}/{i}') 115 | # print(f'Error body: {e.read().decode("utf-8")}') 116 | stop_signal.set() 117 | return 118 | if i % kwargs.get('log_every') == 0: 119 | if kwargs.get('forward'): 120 | time_per_item = (time.time() - start_time) / max(i - start, 1) 121 | logging.info(f'Status: {e.code} -- File Id: {i} -- TMR: {time_per_item * (stop - i) / 60:.2f} min') 122 | else: 123 | time_per_item = (time.time() - start_time) / max(start - i, 1) 124 | logging.info(f'Status: {e.code} -- File Id: {i} -- TMR: {time_per_item * (i - stop) / 60:.2f} min') 125 | continue 126 | except Exception as e: 127 | logging.error(f'Unexpected error for File Id: {i} --> {str(e)}') 128 | stop_signal.set() 129 | return 130 | 131 | if kwargs.get('forward'): 132 | threads = [ 133 | Thread(target=process, 134 | args=( 135 | start + (i * 1) + 1, # skip start 136 | increment * 1, 137 | stop + 1, 138 | url 139 | )) for i in range(increment) ] 140 | else: 141 | threads = [ 142 | Thread(target=process, 143 | args=( 144 | start + (i * -1) - 1, # skip start 145 | increment * -1, 146 | stop - 1, 147 | url 148 | )) for i in range(increment) 149 | ] 150 | 151 | for thread in threads: 152 | thread.start() 153 | for thread in threads: 154 | thread.join() 155 | 156 | found_files = [] 157 | while not queue.empty(): 158 | found_files.append(queue.get()) 159 | return found_files 160 | 161 | def try_frontend(url: str): 162 | try: 163 | request = urllib.request.Request(url) 164 | response = urllib.request.urlopen(request) 165 | response_data = response.read().decode('utf-8') 166 | if "Log into Canvas" in response_data: 167 | logging.error('Unauthorized access. Please check your canvas session token.') 168 | exit(1) 169 | # print(f'Frontend URL is valid, status code: {response.status}') 170 | logging.info(f'Frontend URL is valid, status code: {response.status}') 171 | except urllib.error.HTTPError as e: 172 | if e.code == 401: 173 | logging.error('Unauthorized access. Please check your canvas session token.') 174 | exit(1) 175 | if e.code == 404: 176 | logging.info(f'Frontend URL is valid, status code: {e.code}') # TODO: don't let 404s pass 177 | 178 | 179 | 180 | def main(): 181 | def validate_url(url): 182 | try: 183 | parsed_url = urlparse(url) 184 | if not all([parsed_url.scheme, parsed_url.netloc]): 185 | raise argparse.ArgumentTypeError("Invalid URL format") 186 | except ValueError: 187 | raise argparse.ArgumentTypeError("Invalid URL format") 188 | 189 | pattern = re.compile(r'files/(\d+)') 190 | match = pattern.search(url) 191 | if not match: 192 | raise argparse.ArgumentTypeError("URL does not contain a valid file ID (e.g. /files/123)") 193 | return url 194 | 195 | env_canvas_session = os.environ.get('CANVAS_SESSION') 196 | 197 | parser = argparse.ArgumentParser(description='Canvas file sweeper') 198 | parser.add_argument('-u', '--url', 199 | type=validate_url, 200 | required=True, 201 | help='The URL of the file to start from, e.g. https://canvas.example.edu/courses/123/files/456') 202 | parser.add_argument('-f', '--num-files', 203 | type=lambda x: max(int(x), 1), # limit to 1 or more 204 | default=10000, 205 | help='Number of files to scan (default 10000, min 1)') 206 | parser.add_argument('-s', '--canvas-session', 207 | type=str, 208 | required=not bool(env_canvas_session), 209 | default=env_canvas_session, 210 | help='The Canvas API canvas session, provided as an environment variable or command line argument. If not provided, the script will use the CANVAS_SESSION environment variable.') 211 | parser.add_argument('-w', '--num-workers', 212 | type=lambda x: min(max(int(x), 1), 32), # limit to 1-32 workers 213 | default=16, 214 | help='Number of threads to use (default 16, max 32)') 215 | parser.add_argument('-l', '--log-every', 216 | type=lambda x: max(int(x), 1), # limit to 1 or more 217 | default=1000, 218 | help='Log every (X) files found (default 1000, min 1)') 219 | parser.add_argument('--use-api', 220 | action='store_true', 221 | help='Experimental: Use the Canvas API instead of the frontend (default: False) - this will be faster but may not necessarily find all files. See README for more details.') 222 | parser.add_argument('--forward', 223 | action='store_true', 224 | help='use forward search instead of reverse search (default: False). See README for more details.') 225 | args = parser.parse_args() 226 | 227 | # check if canvas session is provided 228 | if not args.canvas_session: 229 | parser.error("Canvas session token must be provided either via --canvas-session or the CANVAS_SESSION environment variable") 230 | 231 | 232 | # processing the URL 233 | url = args.url.rstrip('/') # remove trailing slash if present 234 | if args.use_api: 235 | logging.info('Using API, this may be faster but may not find all files.') 236 | parsed_url = urlparse(url) 237 | url = f'https://{parsed_url.netloc}/api/v1{parsed_url.path}' 238 | 239 | start = int(re.search(r'files/(\d+)', url).group(1)) 240 | url = re.sub(r'files/\d+', '', url) # remove the file id from the URL 241 | url = url.rstrip('/') 242 | url = f'{url}/files' 243 | 244 | 245 | # inject the canvas session 246 | opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(CookieJar())) 247 | opener.addheaders = [ 248 | ('Cookie', f'canvas_session={args.canvas_session}'), 249 | ('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36'), 250 | ] 251 | urllib.request.install_opener(opener) 252 | 253 | # print( 254 | # f'Using URL: {url}\n', 255 | # f'Using canvas session: {args.canvas_session}\n', 256 | # f'Using num files: {args.num_files}\n', 257 | # f'Using log every: {args.log_every}\n', 258 | # f'Using num workers: {args.num_workers}\n', 259 | # f'Using use_api: {args.use_api}\n' 260 | # ) 261 | 262 | try_frontend(args.url) 263 | 264 | results = sweep_files( 265 | start=start, 266 | increment=args.num_workers, 267 | stop=start + args.num_files if args.forward else start - args.num_files, 268 | url=url, 269 | log_every=args.log_every, 270 | forward=args.forward 271 | ) 272 | 273 | results = sorted(results, key=lambda x: x.get('id'), reverse=True) 274 | os.makedirs('output', exist_ok=True) 275 | output_file = os.path.join('output', f'{time.strftime("%Y%m%d-%H%M%S")}-canvas-files.json') 276 | with open(output_file, 'w') as f: 277 | json.dump(results, f, indent=4) 278 | 279 | 280 | 281 | if __name__ == "__main__": 282 | main() --------------------------------------------------------------------------------