├── .gitignore ├── LICENSE.txt ├── README.md ├── WPJsonScraper.py ├── doc ├── Interactive.md └── WPJsonScraperCapture.png ├── lib ├── __init__.py ├── console.py ├── exceptions.py ├── exporter.py ├── infodisplayer.py ├── interactive.py ├── plugins │ └── plugin_list.csv ├── requestsession.py ├── utils.py └── wpapi.py └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | */__pycache__/* 2 | .venv/* -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Copyright (c) 2018-2020 Mickaël "Kilawyn" Walter 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in all 11 | copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 19 | SOFTWARE. 20 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # WPJsonScraper 2 | 3 | ## Introduction 4 | 5 | ![WPJsonScraper capture](doc/WPJsonScraperCapture.png) 6 | 7 | WPJsonScraper is a tool for dumping a maximum of the content available on a 8 | WordPress installation. It uses the wp-json API to retrieve all important 9 | information and enumerate every user, post, comment, media and more. 10 | 11 | This allows to get information about sensitive files or pages which may be not 12 | protected enough from external access. 13 | 14 | WPJsonScraper has 2 operation modes: command line arguments and interactive. 15 | The latest offers a command prompt allowing to do more complex operations on 16 | the WP-JSON API. 17 | 18 | ## Prerequises 19 | 20 | WPJsonScraper is written in Python and should work with any Python 3 21 | environment given that the following packages are installed: 22 | 23 | * Python 3 24 | * requests 25 | 26 | ## Installation 27 | 28 | Just clone the repository with git and run `pip install -r requirements.txt`. 29 | 30 | You may want to use a virtualenv for keeping your dependencies consistent across 31 | Python projects. 32 | 33 | ## Usage 34 | 35 | ### Interactive mode 36 | 37 | See [Interactive mode](doc/Interactive.md) for more details. 38 | 39 | ### Command line arguments mode 40 | 41 | The tool needs the definition of a target WordPress installation and a flag 42 | instructing which action to do. 43 | 44 | You may want to have all available information using the -a flag. But this is 45 | maybe a bit verbose, so you can select which categories of information you need 46 | in these ones : 47 | 48 | * -h, --help: display the help and exit 49 | * -v, --version: display the version number and exit 50 | * -a, --all: display all data available 51 | * -i, --info: dump basic information about the target 52 | * -e, --endpoints: dump full endpoint documentation 53 | * -p, --posts: list all published posts 54 | * -u, --users: list all users 55 | * -t, --tags: list all tags 56 | * -c, --categories: list all categories 57 | * -m, --media: list all public media objects 58 | * --download-media MEDIA_FOLDER: download media to the designated folder 59 | * -g, --pages: list all public pages 60 | * -o, --comments: lists comments 61 | * -S, --search SEARCH_TERMS: performs a search on SEARCH_TERMS 62 | * -r, --crawl-ns: crawl plugin namespaces for collections. Set it to all to 63 | crawl all namespaces 64 | * --proxy PROXY_URL force the data to pass through a specified proxy server 65 | * --auth CREDENTIALS use the specified credentials as basic HTTP auth for the 66 | server 67 | * --cookies COOKIES add specified Cookies to the requests 68 | * --no-color: remove color (for example to redirect the output to a file) 69 | * --interactive: start an interactive session 70 | 71 | Moreover, you can export contents of pages and posts to a folder in separate 72 | files: 73 | 74 | * --export-pages PAGE_EXPORT_FOLDER 75 | * --export-posts POST_EXPORT_FOLDER 76 | * --export-comments COMMENT_EXPORT_FOLDER 77 | 78 | You can set the proxy server with the --proxy flag. It can be an HTTP or HTTPS 79 | as described in Python requests documentation. By default the proxy servers of 80 | the system are used. 81 | 82 | Example: 83 | 84 | http://user:password@example.com:8080/ 85 | 86 | Using the -r option, you can crawl collections of the specified namespace. This 87 | allows you to get a set of objects from the API and maybe confidential data ;) 88 | 89 | #### Search feature 90 | 91 | WordPress WP-JSON API allows to search in posts, pages, media objects, tags, 92 | categories, comments and users. 93 | 94 | The -S (--search) option allows to use this functionnality with 95 | wp-json-scraper. 96 | 97 | It can be used on a specific item type or on several at once. 98 | 99 | Examples: 100 | 101 | # Search for "lorem" for all item types specified 102 | ./WPJsonScraper.py -S lorem https://demo.wp-api.org/ 103 | # Search for "hello world" in posts, users and pages only 104 | ./WPJsonScraper.py -S "hello world" -p -u -g https://demo.wp-api.org/ 105 | 106 | ## Features to implement 107 | 108 | WPJsonScraper is not a mature project yet and its features are pretty basic for 109 | the moment. Some of the features that could be implemented in the future are: 110 | 111 | * Posts revisions retrieval 112 | * Plugins support 113 | * Authentication support with NTLM 114 | * WordPress instance save as JSON (limited to the accessible scope) and restore? 115 | * Password-protected content handling 116 | * Support new endpoints added in version 5.0: autosaves, block type, blocks, block_renderer, themes (authenticated access required but WTF?) 117 | * Write tests duh! -------------------------------------------------------------------------------- /WPJsonScraper.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | """ 4 | Copyright (c) 2018-2020 Mickaël "Kilawyn" Walter 5 | 6 | Permission is hereby granted, free of charge, to any person obtaining a copy 7 | of this software and associated documentation files (the "Software"), to deal 8 | in the Software without restriction, including without limitation the rights 9 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 10 | copies of the Software, and to permit persons to whom the Software is 11 | furnished to do so, subject to the following conditions: 12 | 13 | The above copyright notice and this permission notice shall be included in all 14 | copies or substantial portions of the Software. 15 | 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 17 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 18 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 19 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 20 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 21 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 22 | SOFTWARE. 23 | """ 24 | 25 | import argparse 26 | import requests 27 | import re 28 | import os 29 | 30 | from lib.console import Console 31 | from lib.wpapi import WPApi 32 | from lib.infodisplayer import InfoDisplayer 33 | from lib.exceptions import NoWordpressApi, WordPressApiNotV2, \ 34 | NSNotFoundException 35 | from lib.exporter import Exporter 36 | from lib.requestsession import RequestSession 37 | from lib.interactive import start_interactive 38 | 39 | version = '0.5' 40 | 41 | def main(): 42 | parser = argparse.ArgumentParser(description= 43 | """Reads a WP-JSON API on a WordPress installation to retrieve a maximum of 44 | publicly available information. These information comprise, but not only: 45 | posts, comments, pages, medias or users. As this tool could allow to access 46 | confidential (but not well-protected) data, it is recommended that you get 47 | first a written permission from the site owner. The author won\'t endorse any 48 | liability for misuse of this software""", 49 | epilog= 50 | """(c) 2018-2020 Mickaël "Kilawyn" Walter. This program is licensed under the MIT 51 | license, check LICENSE.txt for more information""") 52 | parser.add_argument('-v', 53 | '--version', 54 | action='version', 55 | version='%(prog)s ' + version) 56 | parser.add_argument('target', 57 | type=str, 58 | help='the base path of the WordPress installation to ' 59 | 'examine') 60 | parser.add_argument('-i', 61 | '--info', 62 | dest='info', 63 | action='store_true', 64 | help='dumps basic information about the WordPress ' 65 | 'installation') 66 | parser.add_argument('-e', 67 | '--endpoints', 68 | dest='endpoints', 69 | action='store_true', 70 | help='dumps full endpoint documentation') 71 | parser.add_argument('-p', 72 | '--posts', 73 | dest='posts', 74 | action='store_true', 75 | help='lists published posts') 76 | parser.add_argument('--export-posts', 77 | dest='post_export_folder', 78 | action='store', 79 | help='export posts to a specified destination folder') 80 | parser.add_argument('-u', 81 | '--users', 82 | dest='users', 83 | action='store_true', 84 | help='lists users') 85 | parser.add_argument('-t', 86 | '--tags', 87 | dest='tags', 88 | action='store_true', 89 | help='lists tags') 90 | parser.add_argument('-c', 91 | '--categories', 92 | dest='categories', 93 | action='store_true', 94 | help='lists categories') 95 | parser.add_argument('-m', 96 | '--media', 97 | dest='media', 98 | action='store_true', 99 | help='lists media objects') 100 | parser.add_argument('-g', 101 | '--pages', 102 | dest='pages', 103 | action='store_true', 104 | help='lists pages') 105 | parser.add_argument('-o', 106 | '--comments', 107 | dest='comments', 108 | action='store_true', 109 | help="lists comments") 110 | parser.add_argument('--export-pages', 111 | dest='page_export_folder', 112 | action='store', 113 | help='export pages to a specified destination folder') 114 | parser.add_argument('--export-comments', 115 | dest='comment_export_folder', 116 | action='store', 117 | help='export comments to a specified destination folder') 118 | parser.add_argument('--download-media', 119 | dest='media_folder', 120 | action='store', 121 | help='download media to the designated folder') 122 | parser.add_argument('-r', 123 | '--crawl-ns', 124 | dest='crawl_ns', 125 | action='store', 126 | help='crawl all GET routes of the specified namespace ' 127 | 'or all namespaces if all is specified') 128 | parser.add_argument('-a', 129 | '--all', 130 | dest='all', 131 | action='store_true', 132 | help='dumps all available information from the ' 133 | 'target API') 134 | parser.add_argument('-S', 135 | '--search', 136 | dest='search', 137 | action='store', 138 | help='search for a string on the WordPress instance. ' 139 | 'If one or several flag in agpmctu are set, search ' 140 | 'only on these') 141 | parser.add_argument('--proxy', 142 | dest='proxy_server', 143 | action='store', 144 | help='define a proxy server to use, e.g. for ' 145 | 'enterprise network or debugging') 146 | parser.add_argument('--auth', 147 | dest='credentials', 148 | action='store', 149 | help='define a username and a password separated by ' 150 | 'a colon to use them as basic authentication') 151 | parser.add_argument('--cookies', 152 | dest='cookies', 153 | action='store', 154 | help='define specific cookies to send with the request ' 155 | 'in the format cookie1=foo; cookie2=bar') 156 | parser.add_argument('--no-color', 157 | dest='nocolor', 158 | action='store_true', 159 | help='remove color in the output (e.g. to pipe it)') 160 | parser.add_argument('--interactive', 161 | dest='interactive', 162 | action='store_true', 163 | help='start an interactive session') 164 | 165 | 166 | args = parser.parse_args() 167 | 168 | motd = """ 169 | _ _______ ___ _____ 170 | | | | | ___ \\|_ | / ___| 171 | | | | | |_/ / | | ___ ___ _ __ \\ `--. ___ _ __ __ _ _ __ ___ _ __ 172 | | |/\\| | __/ | |/ __|/ _ \\| '_ \\ `--. \\/ __| '__/ _` | '_ \\ / _ \\ '__| 173 | \\ /\\ / | /\\__/ /\\__ \\ (_) | | | /\\__/ / (__| | | (_| | |_) | __/ | 174 | \\/ \\/\\_| \\____/ |___/\\___/|_| |_\\____/ \\___|_| \\__,_| .__/ \\___|_| 175 | | | 176 | |_| 177 | WPJsonScraper v%s 178 | By Mickaël \"Kilawyn\" Walter 179 | 180 | Make sure you use this tool with the approval of the site owner. Even if 181 | these information are public or available with proper authentication, this 182 | could be considered as an intrusion. 183 | 184 | Target: %s 185 | 186 | """ % (version, args.target) 187 | 188 | print(motd) 189 | 190 | if args.nocolor: 191 | Console.wipe_color() 192 | 193 | Console.log_info("Testing connectivity with the server") 194 | 195 | target = args.target 196 | if re.match(r'^https?://.*$', target) is None: 197 | target = "http://" + target 198 | if re.match(r'^.+/$', target) is None: 199 | target += "/" 200 | 201 | proxy = None 202 | if args.proxy_server is not None: 203 | proxy = args.proxy_server 204 | cookies = None 205 | if args.cookies is not None: 206 | cookies = args.cookies 207 | authorization = None 208 | if args.credentials is not None: 209 | authorization_list = args.credentials.split(':') 210 | if len(authorization_list) == 1: 211 | authorization = (authorization_list[0], '') 212 | elif len(authorization_list) >= 2: 213 | authorization = (authorization_list[0], 214 | ':'.join(authorization_list[1:])) 215 | session = RequestSession(proxy=proxy, cookies=cookies, 216 | authorization=authorization) 217 | try: 218 | session.get(target) 219 | Console.log_success("Connection OK") 220 | except Exception as e: 221 | Console.log_error("Failed to connect to the server") 222 | exit(0) 223 | 224 | # Quite an ugly check to launch a search on all parameters edible 225 | # Should find something better (maybe in argparser doc?) 226 | if args.search is not None and not (args.all | args.posts | args.pages | 227 | args.users | args.categories | args.tags | args.media): 228 | Console.log_info("Searching on all available sources") 229 | args.posts = True 230 | args.pages = True 231 | args.users = True 232 | args.categories = True 233 | args.tags = True 234 | args.media = True 235 | 236 | if args.interactive: 237 | start_interactive(target, session, version) 238 | return 239 | 240 | scanner = WPApi(target, session=session, search_terms=args.search) 241 | if args.info or args.all: 242 | try: 243 | basic_info = scanner.get_basic_info() 244 | Console.log_info("General information on the target") 245 | InfoDisplayer.display_basic_info(basic_info) 246 | except NoWordpressApi: 247 | Console.log_error("No WordPress API available at the given URL " 248 | "(too old WordPress or not WordPress?)") 249 | exit() 250 | 251 | if args.posts or args.all: 252 | try: 253 | if args.comments: 254 | Console.log_info("Post list with comments") 255 | else: 256 | Console.log_info("Post list") 257 | posts_list = scanner.get_posts(args.comments) 258 | InfoDisplayer.display_posts(posts_list, scanner.get_orphans_comments()) 259 | except WordPressApiNotV2: 260 | Console.log_error("The API does not support WP V2") 261 | 262 | if args.pages or args.all: 263 | try: 264 | Console.log_info("Page list") 265 | pages_list = scanner.get_pages() 266 | InfoDisplayer.display_pages(pages_list) 267 | except WordPressApiNotV2: 268 | Console.log_error("The API does not support WP V2") 269 | 270 | if args.users or args.all: 271 | try: 272 | Console.log_info("User list") 273 | users_list = scanner.get_users() 274 | InfoDisplayer.display_users(users_list) 275 | except WordPressApiNotV2: 276 | Console.log_error("The API does not support WP V2") 277 | 278 | if args.endpoints or args.all: 279 | try: 280 | Console.log_info("API endpoints") 281 | basic_info = scanner.get_basic_info() 282 | InfoDisplayer.display_endpoints(basic_info) 283 | except NoWordpressApi: 284 | Console.log_error("No WordPress API available at the given URL " 285 | "(too old WordPress or not WordPress?)") 286 | exit() 287 | 288 | if args.categories or args.all: 289 | try: 290 | Console.log_info("Category list") 291 | categories_list = scanner.get_categories() 292 | InfoDisplayer.display_categories(categories_list) 293 | except WordPressApiNotV2: 294 | Console.log_error("The API does not support WP V2") 295 | 296 | if args.tags or args.all: 297 | try: 298 | Console.log_info("Tags list") 299 | tags_list = scanner.get_tags() 300 | InfoDisplayer.display_tags(tags_list) 301 | except WordPressApiNotV2: 302 | Console.log_error("The API does not support WP V2") 303 | 304 | media_list = None 305 | if args.media or args.all: 306 | try: 307 | Console.log_info("Media list") 308 | media_list = scanner.get_media() 309 | InfoDisplayer.display_media(media_list) 310 | except WordPressApiNotV2: 311 | Console.log_error("The API does not support WP V2") 312 | 313 | if args.crawl_ns is None and args.all: 314 | args.crawl_ns = "all" 315 | 316 | if args.crawl_ns is not None: 317 | try: 318 | if args.crawl_ns == "all": 319 | Console.log_info("Crawling all namespaces") 320 | else: 321 | Console.log_info("Crawling %s namespace" % args.crawl_ns) 322 | ns_data = scanner.crawl_namespaces(args.crawl_ns) 323 | InfoDisplayer.display_crawled_ns(ns_data) 324 | except NSNotFoundException: 325 | Console.log_error("The specified namespace was not found") 326 | except Exception as e: 327 | print(e) 328 | 329 | if args.post_export_folder is not None: 330 | try: 331 | posts_list = scanner.get_posts() 332 | tags_list = scanner.get_tags() 333 | categories_list = scanner.get_categories() 334 | users_list = scanner.get_users() 335 | print() 336 | post_number = Exporter.export_posts_html(posts_list, 337 | args.post_export_folder, 338 | tags_list, 339 | categories_list, 340 | users_list) 341 | if post_number> 0: 342 | Console.log_success("Exported %d posts to %s" % 343 | (post_number, args.post_export_folder)) 344 | except WordPressApiNotV2: 345 | Console.log_error("The API does not support WP V2") 346 | 347 | if args.page_export_folder is not None: 348 | try: 349 | pages_list = scanner.get_pages() 350 | users_list = scanner.get_users() 351 | print() 352 | page_number = Exporter.export_posts_html(pages_list, 353 | args.page_export_folder, 354 | None, 355 | None, 356 | users_list) 357 | if page_number> 0: 358 | Console.log_success("Exported %d pages to %s" % 359 | (page_number, args.page_export_folder)) 360 | except WordPressApiNotV2: 361 | Console.log_error("The API does not support WP V2") 362 | 363 | if args.comment_export_folder is not None: 364 | try: 365 | post_list = scanner.get_posts(True) 366 | orphan_list = scanner.get_orphans_comments() 367 | print() 368 | page_number = Exporter.export_comments(post_list, orphan_list, args.comment_export_folder) 369 | if page_number > 0: 370 | Console.log_success("Exported %d comments to %s" % 371 | (page_number, args.comment_export_folder)) 372 | except WordPressApiNotV2: 373 | Console.log_error("The API does not support WP V2") 374 | 375 | if args.media_folder is not None: 376 | Console.log_info("Downloading media files") 377 | if not os.path.isdir(args.media_folder): 378 | Console.log_error("The destination is not a folder or does not exist") 379 | else: 380 | print("Pulling the media URLs") 381 | 382 | media, _ = scanner.get_media_urls('all', True) 383 | if len(media) == 0: 384 | Console.log_error("No media found") 385 | return 386 | print("%d media URLs found" % len(media)) 387 | 388 | print("Note: Only files over 10MB are logged here") 389 | number_downloaded = Exporter.download_media(media, args.media_folder) 390 | Console.log_success('Downloaded %d media to %s' % (number_downloaded, args.media_folder)) 391 | 392 | 393 | if __name__ == "__main__": 394 | main() 395 | -------------------------------------------------------------------------------- /doc/Interactive.md: -------------------------------------------------------------------------------- 1 | # Interactive mode 2 | 3 | To help with more complex interactions with WP-JSON API, WPJsonScraper implements an interactive mode. 4 | 5 | In interactive mode, the same session is used between requests. So every cookies set by the server and other parameters are kept 6 | from one request to another. 7 | 8 | Typing `command -h` or `command --help` will bring a detailed help message for specific commands. 9 | 10 | Tab autocompletes the command name, up and down browse the command history. 11 | 12 | ## Commands 13 | 14 | ### help 15 | 16 | Lists commands and displays a brief help message about specified commands. 17 | 18 | Example 1: display the command list 19 | 20 | help 21 | 22 | Example 2: display a brief help message about the command goals. 23 | 24 | help show 25 | 26 | ### exit 27 | 28 | Exits the interactive mode and goes back to the user's shell. 29 | 30 | ### show 31 | 32 | Shows details about global parameters stored in WPJsonScraper memory. 33 | 34 | Example: show all parameters 35 | 36 | show all 37 | 38 | ### set 39 | 40 | Sets a specific global parameter. 41 | 42 | Note that in cases of proxy and cookies, the command updates the entries. 43 | Check the resulting parameter with show if you don't know what that means. 44 | 45 | **Note:** changing the target resets the cache but keeps proxies, cookies and authorization headers. Be aware 46 | of data leakage risks. If you need to keep things apart between targets, relaunch WPJsonScraper or make sure 47 | all is correctly set up with the `show all` command. 48 | 49 | Example 1: change the target 50 | 51 | set target http://example.com 52 | 53 | Example 2: add or modify the cookies PHPSESSID and JSESSIONID (because why not?) 54 | 55 | set cookie "PHPSESSID=deadbeef; JSESSIONID=badc0ffee" 56 | 57 | ### list 58 | 59 | Lists specified data from the server. 60 | 61 | This command gets data from the server and displays it as a simple list (with no details). 62 | 63 | It also can export full scraped data (with all details available) to specified JSON file 64 | (see --csv and --json options). If a file extension is not specified, WPJsonScraper will append one. 65 | The export options will try to join data with other API endpoint data (e.g. users with posts). CSV files 66 | imply that most of the data is removed to ensure human readability. Use this option only to export a list of 67 | posts. 68 | 69 | **Note:** to avoid having too much noise on the target, WPJsonScraper won't fetch automatically any other 70 | endpoint to complete the exported data. If you want all information to be gathered, you have to build the 71 | cache first by requesting the data beforehand (for example, getting the user list before exporting the posts). 72 | 73 | By default, WPJsonScraper caches data to avoid requesting the server too often. To get the lastest updates, 74 | run this command with the --no-cache option. 75 | 76 | Use the --limit and --start options to retrieve a subset of all data selected. 77 | 78 | In the case of media files, the files themselves **are not downloaded**. 79 | 80 | Example 1: get all posts 81 | 82 | list posts 83 | 84 | Example 2: get maximum 10 pages starting at page 15 85 | 86 | list pages --start 15 --limit 10 87 | 88 | Example 3: export all listeable content to json files (including for example all-data-posts.json) 89 | 90 | list all --json all-data 91 | 92 | Example 4: list namespaces 93 | 94 | list namespaces 95 | 96 | ### fetch 97 | 98 | Fetches a specific piece of data from the server given its type and its ID. By default, if the data is cached, 99 | the data is returned from the cache. Use the --no-cache argument to force its retrieval from the server. 100 | 101 | The data displayed is more complete than the data displayed by the list command. But some metadata is still not 102 | displayed. Only the JSON export is a full data dump (with additional mapping when relevant). 103 | 104 | **Note:** like in the list function, the data that could complete the displayed information is not automatically 105 | fetched. You have to get it into cache first or to fetch it separately based on its ID. Moreover, the data 106 | retrieved by ID is not yet pushed into the cache. It may be in a later version. 107 | 108 | Example 1 : display the post with the ID 1 109 | 110 | fetch post 1 111 | 112 | Example 2 : display the page with the ID 42 and export it in a JSON file, don't use the cache 113 | 114 | fetch page 42 --no-cache 115 | 116 | ### search 117 | 118 | Looks for data based on the specified keywords. This command doesn't use the cache and systematically uses the 119 | WordPress API to do searches. One or several object types may be provided to narrow the search scope. 120 | 121 | Example 1: look for keyword test in all object types 122 | 123 | search test 124 | 125 | Example 2: look for keyword foo in posts and pages 126 | 127 | search --type post --type page foo 128 | 129 | Example 3: --limit and --start also work for search results 130 | 131 | search --limit 5 --start 4 bar 132 | 133 | ### dl 134 | 135 | Downloads media based on the provided ID. The ID can be specified as an integer (or list of integers), `all` or 136 | `cache`. In the first case, only media with the specified IDs will be downloaded. `all` will trigger a fetch from 137 | the API to list all medias then a download session for each file. `cache` will get media URLs from the cache and 138 | then download the files. 139 | 140 | Note that if all the IDs specified are in the cache, no lookup will be made on the API. If you want to override 141 | this behaviour, set the `--no-cache` flag. 142 | 143 | Example 1: download the media with the IDs 42 and 63 to the current folder 144 | 145 | dl 42,63 . 146 | 147 | Example 2: download all media to user's home folder 148 | 149 | dl all /home/user 150 | 151 | Example 3: only media present in the cache (e.g. previously requested with list or fetch) are downloaded 152 | 153 | dl cache . -------------------------------------------------------------------------------- /doc/WPJsonScraperCapture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MickaelWalter/wp-json-scraper/677ddeea6437f24302855652756e11c89ebeaf84/doc/WPJsonScraperCapture.png -------------------------------------------------------------------------------- /lib/__init__.py: -------------------------------------------------------------------------------- 1 | pass 2 | -------------------------------------------------------------------------------- /lib/console.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) 2018-2020 Mickaël "Kilawyn" Walter 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining a copy 5 | of this software and associated documentation files (the "Software"), to deal 6 | in the Software without restriction, including without limitation the rights 7 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 8 | copies of the Software, and to permit persons to whom the Software is 9 | furnished to do so, subject to the following conditions: 10 | 11 | The above copyright notice and this permission notice shall be included in all 12 | copies or substantial portions of the Software. 13 | 14 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 15 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 16 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 17 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 18 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 19 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 20 | SOFTWARE. 21 | """ 22 | 23 | 24 | class Console: 25 | """ 26 | A little helper class to allow console management (like color) 27 | """ 28 | normal = "\033[0m" 29 | blue = "\033[94m" 30 | green = "\033[92m" 31 | red = "\033[31m" 32 | 33 | @staticmethod 34 | def wipe_color(): 35 | """ 36 | Deactivates color in terminal 37 | """ 38 | Console.normal = "" 39 | Console.blue = "" 40 | Console.green = "" 41 | Console.red = "" 42 | 43 | @staticmethod 44 | def log_info(text): 45 | """ 46 | Prints information log to the console 47 | param text: the text to display 48 | """ 49 | print() 50 | print(Console.blue + "[*] " + text + Console.normal) 51 | 52 | @staticmethod 53 | def log_error(text): 54 | """ 55 | Prints error log to the console 56 | param text: the text to display 57 | """ 58 | print() 59 | print(Console.red + "[!] " + text + Console.normal) 60 | 61 | @staticmethod 62 | def log_success(text): 63 | """ 64 | Prints error log to the console 65 | param text: the text to display 66 | """ 67 | print(Console.green + "[+] " + text + Console.normal) 68 | -------------------------------------------------------------------------------- /lib/exceptions.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) 2018-2020 Mickaël "Kilawyn" Walter 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining a copy 5 | of this software and associated documentation files (the "Software"), to deal 6 | in the Software without restriction, including without limitation the rights 7 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 8 | copies of the Software, and to permit persons to whom the Software is 9 | furnished to do so, subject to the following conditions: 10 | 11 | The above copyright notice and this permission notice shall be included in all 12 | copies or substantial portions of the Software. 13 | 14 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 15 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 16 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 17 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 18 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 19 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 20 | SOFTWARE. 21 | """ 22 | 23 | class NoWordpressApi (Exception): 24 | """ 25 | No API is available at the given URL 26 | """ 27 | pass 28 | 29 | class WordPressApiNotV2 (Exception): 30 | """ 31 | The WordPress V2 API is not available 32 | """ 33 | pass 34 | 35 | class NSNotFoundException (Exception): 36 | """ 37 | The specified namespace does not exist 38 | """ 39 | pass 40 | -------------------------------------------------------------------------------- /lib/exporter.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) 2018-2020 Mickaël "Kilawyn" Walter 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining a copy 5 | of this software and associated documentation files (the "Software"), to deal 6 | in the Software without restriction, including without limitation the rights 7 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 8 | copies of the Software, and to permit persons to whom the Software is 9 | furnished to do so, subject to the following conditions: 10 | 11 | The above copyright notice and this permission notice shall be included in all 12 | copies or substantial portions of the Software. 13 | 14 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 15 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 16 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 17 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 18 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 19 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 20 | SOFTWARE. 21 | """ 22 | 23 | import os 24 | import copy 25 | import html 26 | import json 27 | import csv 28 | from datetime import datetime 29 | from urllib import parse as urlparse 30 | import mimetypes 31 | import requests 32 | 33 | from lib.console import Console 34 | from lib.utils import get_by_id, print_progress_bar 35 | 36 | class Exporter: 37 | """ 38 | Utility functions to export data 39 | """ 40 | JSON = 1 41 | """ 42 | Represents the JSON format for format choice 43 | """ 44 | CSV = 2 45 | """ 46 | Represents the CSV format for format choice 47 | """ 48 | CHUNK_SIZE = 2048 49 | """ 50 | The size of chunks to download large files 51 | """ 52 | 53 | @staticmethod 54 | def download_media(media, output_folder, slugs=None): 55 | """ 56 | Downloads the media files based on the given URLs 57 | 58 | :param media: the URLs as a list 59 | :param output_folder: the path to the folder where the files are being saved, it is assumed as existing 60 | :param slugs: list of slugs to associate with media. The list must be ordered the same as media and should be the same size 61 | :return: the number of files wrote 62 | """ 63 | files_number = 0 64 | media_length = len(media) 65 | progress = 0 66 | for m in media: 67 | r = requests.get(m, stream=True) 68 | if r.status_code == 200: 69 | http_path = urlparse.urlparse(m).path.split("/") 70 | local_path = output_folder 71 | if len(http_path) > 1: 72 | for el in http_path[:-1]: 73 | local_path = os.path.join(local_path, el) 74 | if not os.path.isdir(local_path): 75 | os.mkdir(local_path) 76 | if slugs is None: 77 | local_path = os.path.join(local_path, http_path[-1]) 78 | else: 79 | ext = mimetypes.guess_extension(r.headers['Content-Type']) 80 | local_path = os.path.join(local_path, slugs[progress]) 81 | if ext is not None: 82 | local_path += ext 83 | with open(local_path, "wb") as f: 84 | i = 0 85 | content_size = int(r.headers['Content-Length']) 86 | for chunk in r.iter_content(Exporter.CHUNK_SIZE): 87 | if content_size > 10485706: # 10Mo 88 | print_progress_bar(i*Exporter.CHUNK_SIZE, content_size, prefix=http_path[-1], length=70) 89 | f.write(chunk) 90 | i += 1 91 | if content_size > 10485706: # 10Mo 92 | print_progress_bar(content_size, content_size, prefix=http_path[-1], length=70) 93 | files_number += 1 94 | progress += 1 95 | if progress % 10 == 1: 96 | print("Downloaded file %d of %d" % (progress, media_length)) 97 | return files_number 98 | 99 | @staticmethod 100 | def map_params(el, parameters_to_map): 101 | """ 102 | Maps params to ids recursively. 103 | 104 | This method automatically maps IDs with the correponding objects given in parameters_to_map. 105 | The mapping is made in place as el is passed as a reference. 106 | 107 | :param el: the element that have ID references 108 | :param parameters_to_map: a dict containing lists of elements to map by ids with el 109 | """ 110 | for key, value in el.items(): 111 | if key in parameters_to_map.keys() and parameters_to_map[key] is not None: 112 | if type(value) is int: # Only one ID to map 113 | obj = get_by_id(parameters_to_map[key], value) 114 | if obj is not None: 115 | el[key] = { 116 | 'id': value, 117 | 'details': obj 118 | } 119 | elif type(value) is list: # The object is a list of IDs, we map each one 120 | vlist = [] 121 | for v in value: 122 | obj = get_by_id(parameters_to_map[key], v) 123 | vlist.append(obj) 124 | el[key] = { 125 | 'ids': value, 126 | 'details': vlist 127 | } 128 | elif value is dict: 129 | Exporter.map_params(value, parameters_to_map) 130 | 131 | @staticmethod 132 | def setup_export(vlist, parameters_to_unescape, parameters_to_map): 133 | """ 134 | Sets up the right values for a list export. 135 | 136 | This function flattens alist of objects before its serialization in the expected format. 137 | It also makes a deepcopy to ensure that the original vlist is not altered. 138 | 139 | :param vlist: the list to prepare for exporting 140 | :param parameters_to_unescape: parameters to unescape (ex. ["param1", ["param2"]["rendered"]]) 141 | :param parameters_to_map: parameters to map to another (ex. {"param_to_map": param_values_list}) 142 | """ 143 | exported_list = [] 144 | 145 | for el in vlist: 146 | if el is not None: 147 | # First copy the object 148 | exported_el = copy.deepcopy(el) 149 | # Look for parameters to HTML unescape 150 | for key in parameters_to_unescape: 151 | if type(key) is str: # If the parameter is at the root 152 | exported_el[key] = html.unescape(exported_el[key]) 153 | elif type(key) is list: # If the parameter is nested 154 | selected = exported_el 155 | siblings = [] 156 | fullpath = {} 157 | # We look for the leaf first, not forgetting sibling branches for rebuilding the tree later 158 | for k in key: 159 | if type(selected) is dict and k in selected.keys(): 160 | sib = {} 161 | for e in selected.keys(): 162 | if e != k: 163 | sib[e] = selected[e] 164 | selected = selected[k] 165 | siblings.append(sib) 166 | else: 167 | selected = None 168 | break 169 | # If we can unescape the parameter, we do it and rebuild the tree starting from the leaf 170 | if selected is not None and type(selected) is str: 171 | selected = html.unescape(selected) 172 | key.reverse() 173 | fullpath[key[0]] = selected 174 | s = len(siblings) - 1 175 | for e in siblings[s].keys(): 176 | fullpath[e] = siblings[s][e] 177 | for k in key[1:]: 178 | fullpath = {k: fullpath} 179 | s -= 1 180 | for e in siblings[s].keys(): 181 | fullpath[e] = siblings[s][e] 182 | key.reverse() 183 | exported_el[key[0]] = fullpath[key[0]] 184 | # If there is any parameter to map, we do it here 185 | Exporter.map_params(exported_el, parameters_to_map) 186 | # The resulting element is appended to the list of exported elements 187 | exported_list.append(exported_el) 188 | 189 | return exported_list 190 | 191 | @staticmethod 192 | def prepare_filename(filename, fmt): 193 | """ 194 | Returns a filename with the proper extension according to the given format 195 | 196 | :param filename: the filename to clean 197 | :param fmt: the file format 198 | :return: the cleaned filename 199 | """ 200 | if filename[-5:] != ".json" and fmt == Exporter.JSON: 201 | filename += ".json" 202 | elif filename[-4:] != ".csv" and fmt == Exporter.CSV: 203 | filename += ".csv" 204 | return filename 205 | 206 | @staticmethod 207 | def write_file(filename, fmt, csv_keys, data, details=None): 208 | """ 209 | Writes content to the given file using the given format. 210 | 211 | The key mapping must be a dict of keys or lists of keys to ensure proper mapping. 212 | 213 | :param filename: the path of the file 214 | :param fmt: the format of the file 215 | :param csv_keys: the key mapping 216 | :param data: the actual data to export 217 | :param details: the details keys to look for 218 | """ 219 | with open(filename, "w", encoding="utf-8") as f: 220 | if fmt == Exporter.JSON: 221 | # The JSON format is straightforward, we dump the flattened objects to JSON 222 | json.dump(data, f, ensure_ascii=False, indent=4) 223 | else: 224 | # The CSV format requires some work, to select the most relevant information 225 | fieldnames = csv_keys.keys() 226 | w = csv.DictWriter(f, fieldnames=fieldnames) 227 | w.writeheader() 228 | for el in data: 229 | el_csv = {} 230 | for key in csv_keys: 231 | # First we look for the key specified by csv_keys and select the corresponding leaf 232 | k = csv_keys[key] 233 | selected = None 234 | last_key = None 235 | if type(k) is str: 236 | last_key = k 237 | k = [k] 238 | if k[0] in el.keys(): 239 | selected = el[k[0]] 240 | else: 241 | el_csv[key] = "" 242 | continue 243 | if len(k) > 1: 244 | for subkey in k[1:]: 245 | if subkey in selected.keys(): 246 | selected = selected[subkey] 247 | last_key = subkey 248 | # Once the leaf is selected, we verify if there is any kind of ID mapping and act accordingly 249 | if type(selected) is dict and 'id' in selected.keys() and 'details' in selected.keys() and last_key in details.keys(): 250 | el_csv[key] = "%s (%d)" % (selected["details"][details[last_key]], selected["id"]) 251 | elif type(selected) is not dict and type(selected) is not list: 252 | el_csv[key] = selected 253 | else: 254 | el_csv[key] = "unknown" 255 | # And we write the row 256 | w.writerow(el_csv) 257 | 258 | @staticmethod 259 | def export_posts(posts, fmt, filename, tags_list=None, categories_list=None, users_list=None): 260 | """ 261 | Exports posts in specified format to specified file 262 | 263 | :param posts: the posts to export 264 | :param fmt: the export format (JSON or CSV) 265 | :param tags_list: a list of tags to associate them with tag ids 266 | :param categories_list: a list of categories to associate them with 267 | category ids 268 | :param user_list: a list of users to associate them with author id 269 | :return: the length of the list written to the file 270 | """ 271 | exported_posts = Exporter.setup_export(posts, 272 | [['title', 'rendered'], ['content', 'rendered'], ['excerpt', 'rendered']], 273 | { 274 | 'author': users_list, 275 | 'categories': categories_list, 276 | 'tags': tags_list, 277 | }) 278 | 279 | filename = Exporter.prepare_filename(filename, fmt) 280 | csv_keys = { 281 | 'id': 'id', 282 | 'date': 'date', 283 | 'modified': 'modified', 284 | 'status': 'status', 285 | 'link': 'link', 286 | 'title': ['title', 'rendered'], 287 | 'author': 'author' 288 | } 289 | details = { 290 | 'author': 'name', 291 | } 292 | Exporter.write_file(filename, fmt, csv_keys, exported_posts, details) 293 | return len(exported_posts) 294 | 295 | @staticmethod 296 | def export_categories(categories, fmt, filename, category_list=None): 297 | """ 298 | Exports categories in specified format to specified file. 299 | 300 | :param categories: the categories to export 301 | :param fmt: the export format (JSON or CSV) 302 | :param filename: the path to the file to write 303 | :param category_list: the list of categories to be used as parents 304 | :return: the length of the list written to the file 305 | """ 306 | exported_categories = Exporter.setup_export(categories, # TODO 307 | [], 308 | { 309 | 'parent': category_list, 310 | }) 311 | 312 | filename = Exporter.prepare_filename(filename, fmt) 313 | 314 | csv_keys = { 315 | 'id': 'id', 316 | 'name': 'name', 317 | 'post_count': 'count', 318 | 'description': 'description', 319 | 'parent': 'parent' 320 | } 321 | details = { 322 | 'parent': 'name' 323 | } 324 | Exporter.write_file(filename, fmt, csv_keys, exported_categories, details) 325 | return len(exported_categories) 326 | 327 | @staticmethod 328 | def export_tags(tags, fmt, filename): 329 | """ 330 | Exports tags in specified format to specified file 331 | 332 | :param tags: the tags to export 333 | :param fmt: the export format (JSON or CSV) 334 | :param filename: the path to the file to write 335 | :return: the length of the list written to the file 336 | """ 337 | filename = Exporter.prepare_filename(filename, fmt) 338 | 339 | exported_tags = tags # It seems that no modification will be done for this one, so no deepcopy 340 | csv_keys = { 341 | 'id': 'id', 342 | 'name': 'name', 343 | 'post_count': 'post_count', 344 | 'description': 'description' 345 | } 346 | Exporter.write_file(filename, fmt, csv_keys, exported_tags) 347 | return len(exported_tags) 348 | 349 | @staticmethod 350 | def export_users(users, fmt, filename): 351 | """ 352 | Exports users in specified format to specified file. 353 | 354 | :param users: the users to export 355 | :param fmt: the export format (JSON or CSV) 356 | :param filename: the path to the file to write 357 | :return: the length of the list written to the file 358 | """ 359 | filename = Exporter.prepare_filename(filename, fmt) 360 | 361 | exported_users = users # It seems that no modification will be done for this one, so no deepcopy 362 | csv_keys = { 363 | 'id': 'id', 364 | 'name': 'name', 365 | 'link': 'link', 366 | 'description': 'description' 367 | } 368 | Exporter.write_file(filename, fmt, csv_keys, exported_users) 369 | return len(exported_users) 370 | 371 | @staticmethod 372 | def export_pages(pages, fmt, filename, parent_pages=None, users=None): 373 | """ 374 | Exports pages in specified format to specified file. 375 | 376 | :param pages: the pages to export 377 | :param fmt: the export format (JSON or CSV) 378 | :param filename: the path to the file to write 379 | :param parent_pages: the list of all cached pages, to get parents 380 | :param users: the list of all cached users, to get users 381 | :return: the length of the list written to the file 382 | """ 383 | exported_pages = Exporter.setup_export(pages, 384 | [["guid", "rendered"], ["title", "rendered"], ["content", "rendered"], ["excerpt", "rendered"]], 385 | { 386 | 'parent': parent_pages, 387 | 'author': users, 388 | }) 389 | 390 | filename = Exporter.prepare_filename(filename, fmt) 391 | csv_keys = { 392 | 'id': 'id', 393 | 'title': ['title', 'rendered'], 394 | 'date': 'date', 395 | 'modified': 'modified', 396 | 'status': 'status', 397 | 'link': 'link', 398 | 'author': 'author', 399 | 'protected': ['content', 'protected'] 400 | } 401 | details = { 402 | 'author': 'name' 403 | } 404 | Exporter.write_file(filename, fmt, csv_keys, exported_pages, details) 405 | return len(exported_pages) 406 | 407 | @staticmethod 408 | def export_media(media, fmt, filename, users=None): 409 | """ 410 | Exports media in specified format to specified file. 411 | 412 | :param media: the media to export 413 | :param fmt: the export format (JSON or CSV) 414 | :param users: a list of users to associate them with author ids 415 | :return: the length of the list written to the file 416 | """ 417 | exported_media = Exporter.setup_export(media, 418 | [ 419 | ['guid', 'rendered'], 420 | ['title', 'rendered'], 421 | ['description', 'rendered'], 422 | ['caption', 'rendered'], 423 | ], 424 | { 425 | 'author': users, 426 | }) 427 | 428 | filename = Exporter.prepare_filename(filename, fmt) 429 | csv_keys = { 430 | 'id': 'id', 431 | 'title': ['title', 'rendered'], 432 | 'date': 'date', 433 | 'modified': 'modified', 434 | 'status': 'status', 435 | 'link': 'link', 436 | 'author': 'author', 437 | 'media_type': 'media_type' 438 | } 439 | details = { 440 | 'author': 'name' 441 | } 442 | Exporter.write_file(filename, fmt, csv_keys, exported_media, details) 443 | return len(exported_media) 444 | 445 | @staticmethod 446 | def export_namespaces(namespaces, fmt, filename): 447 | """ 448 | **NOT IMPLEMENTED** Exports namespaces in specified format to specified file. 449 | 450 | :param namespaces: the namespaces to export 451 | :param fmt: the export format (JSON or CSV) 452 | :return: the length of the list written to the file 453 | """ 454 | Console.log_info("Namespaces export not available yet") 455 | return 0 456 | 457 | # FIXME to be refactored 458 | @staticmethod 459 | def export_comments_interactive(comments, fmt, filename, parent_posts=None, users=None): 460 | """ 461 | Exports comments in specified format to specified file. 462 | 463 | :param comments: the comments to export 464 | :param fmt: the export format (JSON or CSV) 465 | :param filename: the path to the file to write 466 | :param parent_posts: the list of all cached posts, to get parent posts (not used yet because this could be too verbose) 467 | :param users: the list of all cached users, to get users 468 | :return: the length of the list written to the file 469 | """ 470 | exported_comments = Exporter.setup_export(comments, 471 | [["content", "rendered"]], 472 | { 473 | 'post': parent_posts, 474 | 'author': users, 475 | }) 476 | 477 | # FIXME replacing the post ID by the post title in CSV mode doesn't work yet (nested keys) 478 | filename = Exporter.prepare_filename(filename, fmt) 479 | csv_keys = { 480 | 'id': 'id', 481 | 'post': 'post', 482 | 'date': 'date', 483 | 'status': 'status', 484 | 'link': 'link', 485 | 'author': 'author_name', 486 | } 487 | details = { 488 | 'post': ['title', 'rendered'] 489 | } 490 | Exporter.write_file(filename, fmt, csv_keys, exported_comments, details) 491 | return len(exported_comments) 492 | 493 | # TODO deprecated, to be moved to export_posts when HTML will be supported 494 | @staticmethod 495 | def export_posts_html(posts, folder, tags_list=None, categories_list=None, 496 | users_list=None): 497 | """ 498 | Exports posts as HTML to specified export folder. 499 | 500 | :param posts: the posts to export 501 | :param folder: the export folder 502 | :param tags_list: a list of tags to associate them with tag ids 503 | :param categories_list: a list of categories to associate them with category ids 504 | :param user_list: a list of users to associate them with author id 505 | :return: the length of the list written to the file 506 | """ 507 | exported_posts = 0 508 | 509 | date_format = "%Y-%m-%dT%H:%M:%S-%Z" 510 | 511 | if not os.path.isdir(folder): 512 | os.makedirs(folder) 513 | for post in posts: 514 | post_file = None 515 | if 'slug' in post.keys(): 516 | post_file = open(os.path.join(folder, post['slug'])+".html", 517 | "wt", encoding="utf-8") 518 | else: 519 | post_file = open(os.path.join(folder, str(post['id']))+".html", 520 | "wt", encoding="utf-8") 521 | 522 | title = "Unknown" 523 | if 'title' in post.keys() and 'rendered' in post['title'].keys(): 524 | title = post['title']['rendered'] 525 | 526 | date_gmt = "Unknown" 527 | if 'date_gmt' in post.keys(): 528 | date_gmt = datetime.strptime(post['date_gmt'] + 529 | "-GMT", date_format) 530 | modified_gmt = "Unknown" 531 | if 'modified_gmt' in post.keys(): 532 | modified_gmt = datetime.strptime(post['modified_gmt'] + 533 | "-GMT", date_format) 534 | status = "Unknown" 535 | if 'status' in post.keys(): 536 | status = post['status'] 537 | 538 | post_type = "Unknown" 539 | if 'type' in post.keys(): 540 | post_type = post['type'] 541 | 542 | link = "Unknown" 543 | if 'link' in post.keys(): 544 | link = html.escape(post['link']) 545 | 546 | comments = "Unknown" 547 | if 'comment_status' in post.keys(): 548 | comments = html.escape(post['comment_status']) 549 | 550 | content = "Unknown" 551 | if 'content' in post.keys() and 'rendered' in \ 552 | post['content'].keys(): 553 | content = post['content']['rendered'] 554 | 555 | excerpt = "Unknown" 556 | if 'excerpt' in post.keys() and 'rendered' in \ 557 | post['excerpt'].keys(): 558 | excerpt = post['excerpt']['rendered'] 559 | 560 | author = "Unknown" 561 | if 'author' in post.keys() and users_list is not None: 562 | author_obj = get_by_id(users_list, post['author']) 563 | author = "%d: " % post['author'] 564 | if author_obj is not None: 565 | if 'name' in author_obj.keys(): 566 | author += author_obj['name'] 567 | if 'slug' in author_obj.keys(): 568 | author += "(%s)" % author_obj['slug'] 569 | if 'link' in author_obj.keys(): 570 | author += " - %s" % \ 571 | (author_obj['link'], author_obj['link']) 572 | elif 'author' in post.keys(): 573 | author = str(post['author']) 574 | 575 | categories = "
  • Unknown
  • " 576 | if 'categories' in post.keys() and categories_list is not None: 577 | categories = "" 578 | for cat in post['categories']: 579 | cat_obj = get_by_id(categories_list, cat) 580 | categories += "
  • %d: " % cat 581 | if cat_obj is not None: 582 | if 'name' in cat_obj.keys(): 583 | categories += cat_obj['name'] 584 | if 'link' in cat_obj.keys(): 585 | categories += " - %s" % \ 586 | (html.escape(cat_obj['link']), 587 | html.escape(cat_obj['link'])) 588 | categories += "
  • " 589 | elif 'categories' in post.keys(): 590 | categories = "" 591 | for cat in post['categories']: 592 | categories += "
  • " + str(post['categories']) + "
  • " 593 | 594 | tags = "
  • Unknown
  • " 595 | if 'tags' in post.keys() and tags_list is not None: 596 | tags = "" 597 | for tag in post['tags']: 598 | tag_obj = get_by_id(tags_list, tag) 599 | tags += "
  • %d: " % tag 600 | if tag_obj is not None: 601 | if 'name' in tag_obj.keys(): 602 | tags += tag_obj['name'] 603 | if 'link' in tag_obj.keys(): 604 | tags += " - %s" % \ 605 | (html.escape(tag_obj['link']), 606 | html.escape(tag_obj['link'])) 607 | tags += "
  • " 608 | elif 'tags' in post.keys(): 609 | tags = "" 610 | for cat in post['tags']: 611 | tags += "
  • " + str(post['categories']) + "
  • " 612 | 613 | buffer = \ 614 | """ 615 | 616 | 617 | {title} 618 | 619 | 620 |
    621 |

    Metadata

    622 | 643 |
    644 |
    645 |

    Excerpt

    646 | {excerpt} 647 |
    648 |
    649 |

    {title}

    650 | {content} 651 |
    652 | 653 | 654 | """ 655 | buffer = buffer.format( 656 | title=title, 657 | date_gmt=date_gmt.strftime("%d/%m/%Y %H:%M:%S"), 658 | modified_gmt=modified_gmt.strftime("%d/%m/%Y %H:%M:%S"), 659 | status=status, 660 | post_type=post_type, 661 | link=link, 662 | author=author, 663 | comments=comments, 664 | categories=categories, 665 | tags=tags, 666 | excerpt=excerpt, 667 | content=content 668 | ) 669 | 670 | post_file.write(buffer) 671 | post_file.close() 672 | exported_posts += 1 673 | 674 | return exported_posts 675 | 676 | @staticmethod 677 | def export_comments(posts, orphan_comments, export_folder): 678 | """ 679 | Exports comments from posts and from orphans list 680 | """ 681 | exported_comments = 0 682 | for post in posts: 683 | if 'comments' in post.keys() and len(post['comments']) > 0: 684 | for comment in post['comments']: 685 | if 'slug' in post.keys() and len(post['slug']) > 0: 686 | Exporter.export_comments_helper(comment, post['slug'], export_folder) 687 | else: 688 | Exporter.export_comments_helper(comment, post['id'], export_folder) 689 | exported_comments += 1 690 | for comment in orphan_comments: 691 | Exporter.export_comments_helper(comment, '__orphan_comments', export_folder) 692 | exported_comments += 1 693 | return exported_comments 694 | 695 | @staticmethod 696 | def export_comments_helper(comment, post, export_folder): 697 | date_format = "%Y-%m-%dT%H:%M:%S-%Z" 698 | if not os.path.isdir(export_folder): 699 | os.mkdir(export_folder) 700 | if not os.path.isdir(os.path.join(export_folder, post)): 701 | os.mkdir(os.path.join(export_folder, post)) 702 | out_file = open(os.path.join(export_folder, post, "%04d.html" % comment['id']), "wt", encoding="utf-8") 703 | date_gmt = "Unknown" 704 | if 'date_gmt' in comment.keys(): 705 | date_gmt = datetime.strptime(comment['date_gmt'] + 706 | "-GMT", date_format) 707 | post_link = "None" 708 | if '_links' in comment.keys() and 'up' in comment['_links'].keys() and len(comment['_links'].keys()) > 0 and 'href' in comment['_links']['up'][0].keys(): 709 | post_link = html.escape(comment['_links']['up'][0]['href']) 710 | buffer = """ 711 | 712 | 713 | 714 | {author} 715 | 716 | 717 |
    718 |

    Metadata

    719 | 728 |
    729 |
    730 |

    {author} on {post_title}

    731 | {content} 732 |
    733 | 734 | 735 | """ 736 | buffer = buffer.format( 737 | author=html.escape(comment["author_name"]), 738 | author_url=html.escape(comment['author_url']), 739 | date_gmt=date_gmt.strftime("%d/%m/%Y %H:%M:%S"), 740 | status=html.escape(comment['status']), 741 | link=html.escape(comment['link']), 742 | content=html.escape(comment['content']['rendered']), 743 | post_title=html.escape(post), 744 | post_id=int(comment['post']), 745 | post_link=post_link 746 | ) 747 | out_file.write(buffer) 748 | out_file.close() 749 | -------------------------------------------------------------------------------- /lib/infodisplayer.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) 2018-2020 Mickaël "Kilawyn" Walter 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining a copy 5 | of this software and associated documentation files (the "Software"), to deal 6 | in the Software without restriction, including without limitation the rights 7 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 8 | copies of the Software, and to permit persons to whom the Software is 9 | furnished to do so, subject to the following conditions: 10 | 11 | The above copyright notice and this permission notice shall be included in all 12 | copies or substantial portions of the Software. 13 | 14 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 15 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 16 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 17 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 18 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 19 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 20 | SOFTWARE. 21 | """ 22 | 23 | import html 24 | import csv 25 | from datetime import datetime 26 | 27 | from lib.console import Console 28 | 29 | class InfoDisplayer: 30 | """ 31 | Static class to display information for different categories 32 | """ 33 | 34 | @staticmethod 35 | def display_basic_info(information): 36 | """ 37 | Displays basic information about the WordPress instance 38 | param information: information as a JSON object 39 | """ 40 | print() 41 | 42 | if 'name' in information.keys(): 43 | print("Site name: %s" % html.unescape(information['name'])) 44 | 45 | if 'description' in information.keys(): 46 | print("Site description: %s" % 47 | html.unescape(information['description'])) 48 | 49 | if 'home' in information.keys(): 50 | print("Site home: %s" % html.unescape(information['home'])) 51 | 52 | if 'gmt_offset' in information.keys(): 53 | timezone_string = "" 54 | gmt_offset = str(information['gmt_offset']) 55 | if '-' not in gmt_offset: 56 | gmt_offset = '+' + gmt_offset 57 | if 'timezone_string' in information.keys(): 58 | timezone_string = information['timezone_string'] 59 | print("Site Timezone: %s (GMT%s)" % (timezone_string, gmt_offset)) 60 | 61 | if 'namespaces' in information.keys(): 62 | print('Namespaces (API provided by addons):') 63 | ns_ref = {} 64 | try: 65 | ns_ref_file = open("lib/plugins/plugin_list.csv", "rt") 66 | ns_ref_reader = csv.reader(ns_ref_file) 67 | for row in ns_ref_reader: 68 | desc = None 69 | url = None 70 | if len(row) > 1 and len(row[1]) > 0: 71 | desc = row[1] 72 | if len(row) > 2 and len(row[2]) > 0: 73 | url = row[2] 74 | ns_ref[row[0]] = {"desc": desc, "url": url} 75 | ns_ref_file.close() 76 | except: 77 | Console.log_error("Could not load namespaces reference file") 78 | for ns in information['namespaces']: 79 | tip = "" 80 | if ns in ns_ref.keys(): 81 | if ns_ref[ns]['desc'] is not None: 82 | if tip == "": 83 | tip += " - " 84 | tip += ns_ref[ns]['desc'] 85 | if ns_ref[ns]['url'] is not None: 86 | if tip == "": 87 | tip += " - " 88 | tip += " - " + ns_ref[ns]['url'] 89 | print(' %s%s' % (ns, tip)) 90 | 91 | # TODO, dive into authentication 92 | print() 93 | 94 | @staticmethod 95 | def display_namespaces(information, details=False): 96 | """ 97 | Displays namespace list of the WordPress API 98 | 99 | :param information: information as a JSON object 100 | :param details: unused, available for compatibility purposes 101 | """ 102 | print() 103 | if information is not None: 104 | for ns in information: 105 | print("* %s" % ns) 106 | print() 107 | 108 | @staticmethod 109 | def display_endpoints(information): 110 | """ 111 | Displays endpoint documentation of the WordPress API 112 | param information: information as a JSON object 113 | """ 114 | print() 115 | 116 | if 'routes' not in information.keys(): 117 | Console.log_error("Did not find the routes for endpoint discovery") 118 | return None 119 | 120 | for url, route in information['routes'].items(): 121 | print("%s (Namespace: %s)" % (url, route['namespace'])) 122 | for endpoint in route['endpoints']: 123 | methods = " " 124 | first = True 125 | for method in endpoint['methods']: 126 | if first: 127 | methods += method 128 | first = False 129 | else: 130 | methods += ", " + method 131 | print(methods) 132 | if len(endpoint['args']) > 0: 133 | for arg, props in endpoint['args'].items(): 134 | required = "" 135 | if props['required']: 136 | required = " (required)" 137 | print(" " + arg + required) 138 | if 'type' in props.keys(): 139 | print(" type: " + str(props['type'])) 140 | if 'default' in props.keys(): 141 | print(" default: " + 142 | str(props['default'])) 143 | if 'enum' in props.keys(): 144 | allowed = " allowed values: " 145 | first = True 146 | for val in props['enum']: 147 | if first: 148 | allowed += val 149 | first = False 150 | else: 151 | allowed += ", " + val 152 | print(allowed) 153 | if 'description' in props.keys(): 154 | print(" " + str(props['description'])) 155 | print() 156 | 157 | @staticmethod 158 | def display_posts(information, orphan_comments=[], details=False): 159 | """ 160 | Displays posts published on the WordPress instance 161 | param information: information as a JSON object 162 | """ 163 | print() 164 | date_format = "%Y-%m-%dT%H:%M:%S-%Z" 165 | for post in information: 166 | if post is not None: 167 | line = "" 168 | if 'id' in post.keys(): 169 | line += "ID: %d" %post['id'] 170 | if 'title' in post.keys(): 171 | line += " - " + html.unescape(post['title']['rendered']) 172 | if 'date_gmt' in post.keys(): 173 | date_gmt = datetime.strptime(post['date_gmt'] + 174 | "-GMT", date_format) 175 | line += " on %s" % \ 176 | date_gmt.strftime("%d/%m/%Y at %H:%M:%S") 177 | if 'link' in post.keys(): 178 | line += " - " + post['link'] 179 | if details: 180 | if 'slug' in post.keys(): 181 | line += "\nSlug: " + post['slug'] 182 | if 'status' in post.keys(): 183 | line += "\nStatus: " + post['status'] 184 | if 'author' in post.keys(): 185 | line += "\nAuthor ID: %d" % post['author'] 186 | if 'comment_status' in post.keys(): 187 | line += "\nComment status: " + post['comment_status'] 188 | if 'template' in post.keys() and len(post['template']) > 0: 189 | line += "\nTemplate: " + post['template'] 190 | if 'categories' in post.keys() and len(post['categories']) > 0: 191 | line += "\nCategory IDs: " 192 | for cat in post['categories']: 193 | line += "%d, " % cat 194 | line = line[:-2] 195 | if 'excerpt' in post.keys(): 196 | line += "\nExcerpt: " 197 | if 'protected' in post['excerpt'].keys() and post['excerpt']['protected']: 198 | line += "" 199 | elif 'rendered' in post['excerpt'].keys(): 200 | line += "\n" + html.unescape(post['excerpt']['rendered']) 201 | if 'content' in post.keys(): 202 | line += "\nContent: " 203 | if 'protected' in post['content'].keys() and post['content']['protected']: 204 | line += "" 205 | elif 'rendered' in post['content'].keys(): 206 | line += "\n" + html.unescape(post['content']['rendered']) 207 | if 'comments' in post.keys(): 208 | for comment in post['comments']: 209 | line += "\n\t * Comment by %s from (%s) - %s" % (comment['author_name'], comment['author_url'], comment['link']) 210 | print(line) 211 | 212 | if len(orphan_comments) > 0: 213 | # TODO: Untested code, may never be executed, I don't know how the REST API and WordPress handle post/comment link in back-end 214 | print() 215 | print("Found orphan comments! Check them right below:") 216 | for comment in post['comments']: 217 | line += "\n\t * Comment by %s from (%s) on post ID %d - %s" % (comment['author_name'], comment['author_url'], comment['post'], comment['link']) 218 | print() 219 | 220 | @staticmethod 221 | def display_comments(information, details=False): 222 | """ 223 | Displays comments published on the WordPress instance. 224 | 225 | :param information: information as a JSON object 226 | :param details: if the details should be displayed 227 | """ 228 | print() 229 | date_format = "%Y-%m-%dT%H:%M:%S-%Z" 230 | for comment in information: 231 | if comment is not None: 232 | line = "" 233 | if 'id' in comment.keys(): 234 | line += "ID: %d" % comment['id'] 235 | if 'post' in comment.keys(): 236 | line += " - Post ID: %d" % comment['post'] #html.unescape(post['title']['rendered']) 237 | if 'author_name' in comment.keys(): 238 | line += " - By %s" % comment['author_name'] 239 | if 'date' in comment.keys(): 240 | date_gmt = datetime.strptime(comment['date_gmt'] + 241 | "-GMT", date_format) 242 | line += " on %s" % \ 243 | date_gmt.strftime("%d/%m/%Y at %H:%M:%S") 244 | if details: 245 | if 'parent' in comment.keys() and comment['parent'] != 0: 246 | line += "\nParent ID: " + comment['parent'] 247 | if 'link' in comment.keys(): 248 | line += "\nLink: " + comment['link'] 249 | if 'status' in comment.keys(): 250 | line += "\nStatus: " + comment['status'] 251 | if 'author_url' in comment.keys() and len(comment['author_url']) > 0: 252 | line += "\nAuthor URL: " + comment['author_url'] 253 | if 'content' in comment.keys(): 254 | line += "\nContent: \n" + html.unescape(comment['content']['rendered']) 255 | print(line) 256 | print() 257 | 258 | @staticmethod 259 | def display_users(information, details=False): 260 | """ 261 | Displays users on the WordPress instance 262 | 263 | :param information: information as a JSON object 264 | :param details: display more details about the user 265 | """ 266 | print() 267 | for user in information: 268 | if user is not None: 269 | line = "" 270 | if 'id' in user.keys(): 271 | line += "User ID: %d\n" % user['id'] 272 | if 'name' in user.keys(): 273 | line += " Display name: %s\n" % user['name'] 274 | if 'slug' in user.keys(): 275 | line += " User name (probable): %s\n" % user['slug'] 276 | if 'description' in user.keys(): 277 | line += " User description: %s\n" % user['description'] 278 | if 'url' in user.keys(): 279 | line += " User website: %s\n" % user['url'] 280 | if 'link' in user.keys(): 281 | line += " User personal page: %s\n" % user['link'] 282 | if details: 283 | if "avatar_urls" in user.keys() and type(user["avatar_urls"]) is dict and len(user["avatar_urls"].keys()) > 0: 284 | line += " Avatars: \n" 285 | for key, value in user["avatar_urls"].items(): 286 | line += " * %s: %s\n" % (key, value) 287 | print(line) 288 | print() 289 | 290 | @staticmethod 291 | def display_categories(information, details=False): 292 | """ 293 | Displays categories of the WordPress instance 294 | param information: information as a JSON object 295 | """ 296 | print() 297 | for category in information: 298 | if category is not None: 299 | line = "" 300 | if 'id' in category.keys(): 301 | line += "Category ID: %d\n" % category['id'] 302 | if 'name' in category.keys(): 303 | line += " Name: %s\n" % category['name'] 304 | if 'description' in category.keys(): 305 | line += " Description: %s\n" % category['description'] 306 | if 'count' in category.keys(): 307 | line += " Number of posts: %d\n" % category['count'] 308 | if 'link' in category.keys(): 309 | line += " Page: %s\n" % category['link'] 310 | if details: 311 | if 'slug' in category.keys(): 312 | line += " Slug: %s\n" % category['slug'] 313 | if 'taxonomy' in category.keys(): 314 | line += " Taxonomy: %s\n" % category['slug'] 315 | if 'parent' in category.keys(): 316 | line += " Parent category: " 317 | if type(category['parent']) is str: 318 | line += category['parent'] 319 | elif type(category['parent']) is int: 320 | line += "%d" % category['parent'] 321 | else: 322 | line += "Unknown" 323 | line += "\n" 324 | print(line) 325 | print() 326 | 327 | @staticmethod 328 | def display_tags(information, details=False): 329 | """ 330 | Displays tags of the WordPress instance 331 | param information: information as a JSON object 332 | """ 333 | print() 334 | for tag in information: 335 | if tag is not None: 336 | line = "" 337 | if 'id' in tag.keys(): 338 | line += "Tag ID: %d\n" % tag['id'] 339 | if 'name' in tag.keys(): 340 | line += " Name: %s\n" % tag['name'] 341 | if 'description' in tag.keys(): 342 | line += " Description: %s\n" % tag['description'] 343 | if 'count' in tag.keys(): 344 | line += " Number of posts: %d\n" % tag['count'] 345 | if 'link' in tag.keys(): 346 | line += " Page: %s\n" % tag['link'] 347 | if details: 348 | if 'slug' in tag.keys(): 349 | line += " Slug: %s\n" % tag['slug'] 350 | if 'taxonomy' in tag.keys(): 351 | line += " Taxonomy: %s\n" % tag['slug'] 352 | print(line) 353 | print() 354 | 355 | @staticmethod 356 | def display_media(information, details=False): 357 | """ 358 | Displays media objects of the WordPress instance 359 | 360 | :param information: information as a JSON object 361 | :param details: if the details should be displayed 362 | """ 363 | print() 364 | date_format = "%Y-%m-%dT%H:%M:%S-%Z" 365 | for media in information: 366 | if media is not None: 367 | line = "" 368 | if 'id' in media.keys(): 369 | line += "Media ID: %d\n" % media['id'] 370 | if 'title' in media.keys() and 'rendered' in media['title']: 371 | line += " Media title: %s\n" % \ 372 | html.unescape(media['title']['rendered']) 373 | if 'date_gmt' in media.keys(): 374 | date_gmt = datetime.strptime(media['date_gmt'] + 375 | "-GMT", date_format) 376 | line += " Upload date (GMT): %s\n" % \ 377 | date_gmt.strftime("%d/%m/%Y %H:%M:%S") 378 | if 'media_type' in media.keys(): 379 | line += " Media type: %s\n" % media['media_type'] 380 | if 'mime_type' in media.keys(): 381 | line += " Mime type: %s\n" % media['mime_type'] 382 | if 'link' in media.keys(): 383 | line += " Page: %s\n" % media['link'] 384 | if 'source_url' in media.keys(): 385 | line += " Source URL: %s\n" % media['source_url'] 386 | if details: 387 | if 'slug' in media.keys(): 388 | line += "Slug: " + media['slug'] + "\n" 389 | if 'status' in media.keys(): 390 | line += "Status: " + media['status'] + "\n" 391 | if 'type' in media.keys(): 392 | line += "Type: " + media['type'] + "\n" 393 | if 'author' in media.keys(): 394 | line += "Author ID: %d\n" % media['author'] 395 | if 'alt_text' in media.keys(): 396 | line += "Alt text: " + media['alt_text'] + "\n" 397 | if 'comment_status' in media.keys(): 398 | line += "Comment status: " + media['comment_status'] + "\n" 399 | if 'post' in media.keys(): 400 | line += "Post or page ID: %d\n" % media['post'] 401 | if 'description' in media.keys() and media['description']['rendered']: 402 | line += "Description: \n" + html.unescape(media['description']['rendered']) + "\n" 403 | if 'caption' in media.keys() and media['caption']['rendered']: 404 | line += "Caption: \n" + html.unescape(media['caption']['rendered']) + "\n" 405 | print(line) 406 | print() 407 | 408 | @staticmethod 409 | def display_pages(information, details=False): 410 | """ 411 | Displays pages published on the WordPress instance 412 | 413 | :param information: information as a JSON object 414 | :param details: if the details should be displayed 415 | """ 416 | print() 417 | for page in information: 418 | if page is not None: 419 | line = "" 420 | if 'id' in page.keys(): 421 | line += "ID: %d" % page['id'] 422 | if 'title' in page.keys() and 'rendered' in page['title']: 423 | line += " - " + html.unescape(page['title']['rendered']) 424 | if 'link' in page.keys(): 425 | line += " - " + page['link'] 426 | if details: 427 | if 'slug' in page.keys(): 428 | line += "\nSlug: " + page['slug'] 429 | if 'status' in page.keys(): 430 | line += "\nStatus: " + page['status'] 431 | if 'author' in page.keys(): 432 | line += "\nAuthor ID: %d" % page['author'] 433 | if 'comment_status' in page.keys(): 434 | line += "\nComment status: " + page['comment_status'] 435 | if 'template' in page.keys() and len(page['template']) > 0: 436 | line += "\nTemplate: " + page['template'] 437 | if 'parent' in page.keys(): 438 | if page['parent'] == 0: 439 | line += "\nParent: none" 440 | else: 441 | line += "\nParent ID: %d" % page['parent'] 442 | if 'excerpt' in page.keys(): 443 | line += "\nExcerpt: " 444 | if 'protected' in page['excerpt'].keys() and page['excerpt']['protected']: 445 | line += "" 446 | elif 'rendered' in page['excerpt'].keys(): 447 | line += "\n" + html.unescape(page['excerpt']['rendered']) 448 | if 'content' in page.keys(): 449 | line += "\nContent: " 450 | if 'protected' in page['content'].keys() and page['content']['protected']: 451 | line += "" 452 | elif 'rendered' in page['content'].keys(): 453 | line += "\n" + html.unescape(page['content']['rendered']) 454 | print(line) 455 | print() 456 | 457 | @staticmethod 458 | def recurse_list_or_dict(data, tab): 459 | """ 460 | Helper function to generate recursive display of API data 461 | """ 462 | if type(data) is not dict and type(data) is not list: 463 | return tab + str(data) 464 | 465 | line = "" 466 | if type(data) is list: 467 | i = 0 468 | length = len(data) 469 | for value in data: 470 | do_jmp = True 471 | if type(value) is dict or type(value) is list: 472 | line += InfoDisplayer.recurse_list_or_dict(value, tab+"\t") 473 | elif type(value) is str: 474 | if "\n" in value: 475 | line += "\n" + tab + "\t" 476 | line += value.replace("\n", "\n"+tab+"\t") 477 | else: 478 | line += " " 479 | line += value.replace("\n", "\n"+tab) 480 | do_jmp = False 481 | else: 482 | line += " " + str(value) 483 | if i < length and do_jmp: 484 | line += "\n" 485 | i += 1 486 | else: 487 | for key,value in data.items(): 488 | line += "\n" + tab + key 489 | if type(value) is dict or type(value) is list: 490 | line += InfoDisplayer.recurse_list_or_dict(value, tab+"\t") 491 | elif type(value) is str: 492 | if "\n" in value: 493 | line += "\n" + tab + "\t" 494 | line += value.replace("\n", "\n"+tab+"\t") 495 | else: 496 | line += " " 497 | line += value.replace("\n", "\n"+tab) 498 | else: 499 | line += " " + str(value) 500 | return line 501 | 502 | @staticmethod 503 | def display_crawled_ns(information): 504 | """ 505 | Displays endpoints details published on the WordPress instance 506 | param information: information as a JSON object 507 | """ 508 | print() 509 | for url,data in information.items(): 510 | line = "\n" 511 | line += url 512 | tab = "\t" 513 | line += InfoDisplayer.recurse_list_or_dict(data, tab) 514 | print(line) 515 | print() 516 | -------------------------------------------------------------------------------- /lib/interactive.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) 2018-2020 Mickaël "Kilawyn" Walter 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining a copy 5 | of this software and associated documentation files (the "Software"), to deal 6 | in the Software without restriction, including without limitation the rights 7 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 8 | copies of the Software, and to permit persons to whom the Software is 9 | furnished to do so, subject to the following conditions: 10 | 11 | The above copyright notice and this permission notice shall be included in all 12 | copies or substantial portions of the Software. 13 | 14 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 15 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 16 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 17 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 18 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 19 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 20 | SOFTWARE. 21 | """ 22 | 23 | import cmd 24 | import argparse 25 | import shlex 26 | import sys 27 | import re 28 | import copy 29 | import os 30 | 31 | from lib.wpapi import WPApi, WordPressApiNotV2 32 | from lib.requestsession import RequestSession 33 | from lib.console import Console 34 | from lib.infodisplayer import InfoDisplayer 35 | from lib.exporter import Exporter 36 | from lib.utils import get_by_id 37 | 38 | class ArgumentParser(argparse.ArgumentParser): 39 | """ 40 | Wrapper for argparse.ArgumentParser (especially the help function that quits the application after display) 41 | """ 42 | def __init__(self, prog="", description=""): 43 | argparse.ArgumentParser.__init__(self, prog=prog, add_help=False, description=description) 44 | self.add_argument("--help", "-h", help="print this help", action="store_true") 45 | self.should_help = True 46 | 47 | def custom_parse_args(self, args): 48 | args = self.parse_args(shlex.split(args)) 49 | if args.help: 50 | if self.should_help: 51 | self.print_help(sys.stdout) 52 | print() 53 | self.should_help = False 54 | return None 55 | if self.should_help: 56 | return args 57 | else: 58 | return None 59 | 60 | def error(self, message): 61 | if self.should_help: 62 | self.print_help(sys.stdout) 63 | print() 64 | self.should_help = False 65 | 66 | class InteractiveShell(cmd.Cmd): 67 | """ 68 | The interactive shell for the application 69 | """ 70 | intro = """ 71 | Entering interactive session 72 | Use the 'help' command to get a list of available commands and parameters, 'exit' to quit 73 | `command -h` gives more details about a command 74 | """ 75 | prompt = "> " 76 | 77 | def __init__(self, target, session, version): 78 | cmd.Cmd.__init__(self) 79 | self.target = target 80 | InteractiveShell.prompt = Console.red + target + Console.normal + " > " 81 | self.session = session 82 | self.version = version 83 | self.scanner = WPApi(self.target, session=session) 84 | 85 | @staticmethod 86 | def export_decorator(export_func, is_all, export_str, json, csv, values, kwargs = {}): 87 | if json is not None: 88 | json_file = json 89 | if is_all: 90 | json_file = json + "-" + export_str 91 | args = [values] 92 | args.append(Exporter.JSON) 93 | args.append(json_file) 94 | export_func(*args, **kwargs) 95 | if csv is not None: 96 | csv_file = csv 97 | if is_all: 98 | csv_file = csv + "-" + export_str 99 | args = [values] 100 | args.append(Exporter.CSV) 101 | args.append(csv_file) 102 | export_func(*args, **kwargs) 103 | 104 | def get_fetch_or_list_type(self, obj_type, plural=False): 105 | """ 106 | Returns a dict containing all necessary metadata 107 | about the obj_type to list and fetch data 108 | 109 | :param obj_type: the type of the object 110 | :param plural: whether the name must be plural or not 111 | """ 112 | display_func = None 113 | export_func = None 114 | additional_info = {} 115 | obj_name = "" 116 | if obj_type == WPApi.USER: 117 | display_func = InfoDisplayer.display_users 118 | export_func = Exporter.export_users 119 | additional_info = {} 120 | obj_name = "Users" if plural else "User" 121 | elif obj_type == WPApi.TAG: 122 | display_func = InfoDisplayer.display_tags 123 | export_func = Exporter.export_tags 124 | additional_info = {} 125 | obj_name = "Tags" if plural else "Tag" 126 | elif obj_type == WPApi.CATEGORY: 127 | display_func = InfoDisplayer.display_categories 128 | export_func = Exporter.export_categories 129 | additional_info = { 130 | 'category_list': self.scanner.categories 131 | } 132 | obj_name = "Categories" if plural else "Category" 133 | elif obj_type == WPApi.POST: 134 | display_func = InfoDisplayer.display_posts 135 | export_func = Exporter.export_posts 136 | additional_info = { 137 | 'tags_list': self.scanner.tags, 138 | 'categories_list': self.scanner.categories, 139 | 'users_list': self.scanner.users 140 | } 141 | obj_name = "Posts" if plural else "Post" 142 | elif obj_type == WPApi.PAGE: 143 | display_func = InfoDisplayer.display_pages 144 | export_func = Exporter.export_pages 145 | additional_info = { 146 | 'parent_pages': self.scanner.pages, 147 | 'users': self.scanner.users 148 | } 149 | obj_name = "Pages" if plural else "Page" 150 | elif obj_type == WPApi.COMMENT: 151 | display_func = InfoDisplayer.display_comments 152 | export_func = Exporter.export_comments_interactive 153 | additional_info = { 154 | #'parent_posts': self.scanner.posts, # May be too verbose 155 | 'users': self.scanner.users 156 | } 157 | obj_name = "Comments" if plural else "Comment" 158 | elif obj_type == WPApi.MEDIA: 159 | display_func = InfoDisplayer.display_media 160 | export_func = Exporter.export_media 161 | additional_info = {'users': self.scanner.users} 162 | obj_name = "Media" 163 | elif obj_type == WPApi.NAMESPACE: 164 | display_func = InfoDisplayer.display_namespaces 165 | export_func = Exporter.export_media 166 | additional_info = {} 167 | obj_name = "Namespaces" if plural else "Namespace" 168 | 169 | return { 170 | "display_func": display_func, 171 | "export_func": export_func, 172 | "additional_info": additional_info, 173 | "obj_name": obj_name 174 | } 175 | 176 | def fetch_obj(self, obj_type, obj_id, cache=True, json=None, csv=None): 177 | """ 178 | Displays and exports (if relevant) the object fetched by ID 179 | 180 | :param obj_type: the type of the object 181 | :param obj_id: the ID of the obj 182 | :param cache: whether to use the cache of not 183 | :param json: json export filename 184 | :param csv: csv export filename 185 | """ 186 | prop = self.get_fetch_or_list_type(obj_type) 187 | print(prop["obj_name"] + " details") 188 | try: 189 | obj = self.scanner.get_obj_by_id(obj_type, obj_id, use_cache=cache) 190 | if len(obj) == 0: 191 | Console.log_info(prop["obj_name"] + " not found\n") 192 | else: 193 | prop["display_func"](obj, details=True) 194 | if len(prop["additional_info"].keys()) > 0: 195 | InteractiveShell.export_decorator(prop["export_func"], False, "", json, csv, obj, prop["additional_info"]) 196 | else: 197 | InteractiveShell.export_decorator(prop["export_func"], False, "", json, csv, obj) 198 | except WordPressApiNotV2: 199 | Console.log_error("The API does not support WP V2") 200 | except IOError as e: 201 | Console.log_error("Could not open %s for writing" % e.filename) 202 | print() 203 | 204 | def list_obj(self, obj_type, start, limit, is_all=False, cache=True, json=None, csv=None): 205 | """ 206 | Displays and exports (if relevant) the object list 207 | 208 | :param obj_type: the type of the object 209 | :param start: the offset of the first object 210 | :param limit: the maximum number of objects to list 211 | :param is_all: are all object types requested? 212 | :param cache: whether to use the cache of not 213 | :param json: json export filename 214 | :param csv: csv export filename 215 | """ 216 | prop = self.get_fetch_or_list_type(obj_type, plural=True) 217 | print(prop["obj_name"] + " details") 218 | try: 219 | kwargs = {} 220 | if obj_type == WPApi.POST: 221 | kwargs = {"comments": False} 222 | obj_list = self.scanner.get_obj_list(obj_type, start, limit, cache, kwargs=kwargs) 223 | prop["display_func"](obj_list) 224 | InteractiveShell.export_decorator(prop["export_func"], is_all, prop["obj_name"].lower(), json, csv, obj_list) 225 | except WordPressApiNotV2: 226 | Console.log_error("The API does not support WP V2") 227 | except IOError as e: 228 | Console.log_error("Could not open %s for writing" % e.filename) 229 | print() 230 | 231 | def do_exit(self, arg): 232 | 'Exit wp-json-scraper' 233 | return True 234 | 235 | def do_show(self, arg): 236 | 'Shows information about parameters in memory' 237 | parser = ArgumentParser(prog='show', description='show information about global parameters') 238 | parser.add_argument("what", choices=['all', 'target', 'proxy', 'cookies', 'credentials', 'version'], 239 | help='choose the information to be displayed', default='all') 240 | args = parser.custom_parse_args(arg) 241 | if args is None: 242 | return 243 | if args.what == 'all' or args.what == 'target': 244 | print("Target: %s" % self.target) 245 | if args.what == 'all' or args.what == 'proxy': 246 | proxies = self.session.get_proxies() 247 | if proxies is not None and len(proxies) > 0: 248 | print ("Proxies:") 249 | for key, value in proxies.items(): 250 | print("\t%s: %s" % (key, value)) 251 | else: 252 | print ("Proxy: none") 253 | if args.what == 'all' or args.what == 'cookies': 254 | cookies = self.session.get_cookies() 255 | if len(cookies) > 0: 256 | print("Cookies:") 257 | for key, value in cookies.items(): 258 | print("\t%s: %s" % (key, value)) 259 | else: 260 | print("Cookies: none") 261 | if args.what == 'all' or args.what == 'credentials': 262 | credentials = self.session.get_creds() 263 | if credentials is not None: 264 | creds_str = "Credentials: " 265 | for el in credentials: 266 | creds_str += el + ":" 267 | print(creds_str[:-1]) 268 | else: 269 | print("Credentials: none") 270 | if args.what == 'all' or args.what == 'version': 271 | print("WPJsonScraper version: %s" % self.version) 272 | print() 273 | 274 | def do_set(self, arg): 275 | 'Sets a global parameter of WPJsonScanner' 276 | parser = ArgumentParser(prog='set', description='sets global parameters for WPJsonScanner') 277 | parser.add_argument("what", choices=['target', 'proxy', 'cookies', 'credentials'], 278 | help='the parameter to set') 279 | parser.add_argument("value", type=str, help='the new value of the parameter (for cookies, set as cookie string: "n1=v1; n2=v2")') 280 | args = parser.custom_parse_args(arg) 281 | if args is None: 282 | return 283 | if args.what == 'target': 284 | self.target = args.value 285 | if re.match(r'^https?://.*$', self.target) is None: 286 | self.target = "http://" + self.target 287 | if re.match(r'^.+/$', self.target) is None: 288 | self.target += "/" 289 | InteractiveShell.prompt = Console.red + self.target + Console.normal + " > " 290 | print("target = %s" % args.value) 291 | self.scanner = WPApi(self.target, session=self.session) 292 | Console.log_info("Cache is erased but session stays the same (with cookies and authorization)") 293 | elif args.what == 'proxy': 294 | self.session.set_proxy(args.value) 295 | print("proxy = %s" % args.value) 296 | elif args.what == 'cookies': 297 | self.session.set_cookies(args.value) 298 | print("Cookies set!") 299 | elif args.what == "credentials": 300 | authorization_list = args.value.split(':') 301 | if len(authorization_list) == 1: 302 | authorization = (authorization_list[0], '') 303 | elif len(authorization_list) >= 2: 304 | authorization = (authorization_list[0], 305 | ':'.join(authorization_list[1:])) 306 | self.session.set_creds(authorization) 307 | print("Credentials set!") 308 | print() 309 | 310 | def do_list(self, arg): 311 | 'Gets the list of something from the server' 312 | parser = ArgumentParser(prog='list', description='gets a list of something from the server') 313 | parser.add_argument("what", choices=[ 314 | 'posts', 315 | #'post-revisions', 316 | #'wp-blocks', 317 | 'categories', 318 | 'tags', 319 | 'pages', 320 | 'comments', 321 | 'media', 322 | 'users', 323 | #'themes', 324 | #'search-results', 325 | 'namespaces', 326 | 'all', 327 | ], 328 | help='what to list') 329 | parser.add_argument("--json", "-j", help="list and store as json to the specified file") 330 | parser.add_argument("--csv", "-c", help="list and store as csv to the specified file") 331 | parser.add_argument("--limit", "-l", type=int, help="limit the number of results") 332 | parser.add_argument("--start", "-s", type=int, help="start at the given index") 333 | parser.add_argument("--no-cache", dest="cache", action="store_false", help="don't lookup in cache and ask the server") 334 | args = parser.custom_parse_args(arg) 335 | if args is None: 336 | return 337 | # The checks must be ordered by dependencies 338 | kwargs = { 339 | "start": args.start, 340 | "limit": args.limit, 341 | "is_all": args.what == "all", 342 | "cache": args.cache, 343 | "json": args.json, 344 | "csv": args.csv 345 | } 346 | if args.what == "all" or args.what == "users": 347 | self.list_obj(WPApi.USER, **kwargs) 348 | if args.what == "all" or args.what == "tags": 349 | self.list_obj(WPApi.TAG, **kwargs) 350 | if args.what == "all" or args.what == "categories": 351 | self.list_obj(WPApi.CATEGORY, **kwargs) 352 | if args.what == "all" or args.what == "posts": 353 | self.list_obj(WPApi.POST, **kwargs) 354 | if args.what == "all" or args.what == "pages": 355 | self.list_obj(WPApi.PAGE, **kwargs) 356 | if args.what == "all" or args.what == "comments": 357 | self.list_obj(WPApi.COMMENT, **kwargs) 358 | if args.what == "all" or args.what == "media": 359 | self.list_obj(WPApi.MEDIA, **kwargs) 360 | if args.what == "all" or args.what == "namespaces": 361 | self.list_obj(WPApi.NAMESPACE, **kwargs) 362 | 363 | def do_fetch(self, arg): 364 | 'Fetches a specific content specified by ID' 365 | parser = ArgumentParser(prog='fetch', description='fetches something from the server or the cache by ID') 366 | parser.add_argument("what", choices=[ 367 | 'post', 368 | #'post-revision', 369 | #'wp-block', 370 | 'category', 371 | 'tag', 372 | 'page', 373 | 'comment', 374 | 'media', 375 | 'user', 376 | #'theme', 377 | #'search-result', 378 | ], 379 | help='what to fetch') 380 | parser.add_argument("id", type=int, help='the ID of the content to fetch') 381 | parser.add_argument("--json", "-j", help="list and store as json to the specified file") 382 | parser.add_argument("--csv", "-c", help="list and store as csv to the specified file") 383 | parser.add_argument("--no-cache", dest="cache", action="store_false", help="don't lookup in cache and ask the server") 384 | args = parser.custom_parse_args(arg) 385 | what_type = None 386 | if args is None: 387 | return 388 | what_type = WPApi.str_type_to_native(args.what) 389 | 390 | if what_type is not None: 391 | self.fetch_obj(what_type, args.id, cache=args.cache, json=args.json, csv=args.csv) 392 | else: 393 | print("Not implemented") 394 | print() 395 | 396 | def do_search(self, arg): 397 | 'Looks for specific keywords in the WordPress API' 398 | parser = ArgumentParser(prog='search', description='searches something from the server') 399 | parser.add_argument("--type", "-t", action="append", choices=[ 400 | 'all', 401 | 'post', 402 | #'post-revision', 403 | #'wp-block', 404 | 'category', 405 | 'tag', 406 | 'page', 407 | 'comment', 408 | 'media', 409 | 'user', 410 | #'theme', 411 | #'search-result', 412 | ], 413 | help='the types to look for (default all)', 414 | dest='what' 415 | ) 416 | parser.add_argument("keywords", help='the keywords to look for') 417 | parser.add_argument("--json", "-j", help="list and store as json to the specified file(s)") 418 | parser.add_argument("--csv", "-c", help="list and store as csv to the specified file(s)") 419 | parser.add_argument("--limit", "-l", type=int, help="limit the number of results") 420 | parser.add_argument("--start", "-s", type=int, help="start at the given index") 421 | args = parser.custom_parse_args(arg) 422 | if args is None: 423 | return 424 | what_types = WPApi.convert_obj_types_to_list(args.what) 425 | results = self.scanner.search(what_types, args.keywords, args.start, args.limit) 426 | print() 427 | for k, v in results.items(): 428 | prop = self.get_fetch_or_list_type(k, plural=True) 429 | print(prop["obj_name"] + " details") 430 | if len(v) == 0: 431 | Console.log_info("No result") 432 | else: 433 | try: 434 | prop["display_func"](v) 435 | InteractiveShell.export_decorator( 436 | prop["export_func"], 437 | len(what_types) > 1 or WPApi.ALL_TYPES in what_types, 438 | prop["obj_name"].lower(), 439 | args.json, 440 | args.csv, 441 | v 442 | ) 443 | except WordPressApiNotV2: 444 | Console.log_error("The API does not support WP V2") 445 | except IOError as e: 446 | Console.log_error("Could not open %s for writing" % e.filename) 447 | print() 448 | 449 | def do_dl(self, arg): 450 | 'Downloads a media file (e.g. from /wp-content/uploads/) based on its ID' 451 | 452 | parser = ArgumentParser(prog='dl', description='downloads a media from the server') 453 | parser.add_argument("ids", help='ids to look for (comma separated), "all" or "cache"') 454 | parser.add_argument("dest", help='destination folder') 455 | parser.add_argument("--no-cache", dest="cache", action="store_false", help="don't lookup in cache and ask the server") 456 | parser.add_argument("--use-slug", dest="slug", action="store_true", help="use the slug as filename and not the source URL name") 457 | args = parser.custom_parse_args(arg) 458 | if args is None: 459 | return 460 | 461 | if not os.path.isdir(args.dest): 462 | Console.log_error("The destination is not a folder or does not exist") 463 | return 464 | 465 | print("Pulling the media URLs") 466 | media, slugs = self.scanner.get_media_urls(args.ids, args.cache) 467 | if len(media) == 0: 468 | Console.log_error("No media found corresponding to the criteria") 469 | return 470 | print("%d media URLs found" % len(media)) 471 | answer = input("Do you wish to proceed to download? (y/N)") 472 | if answer.lower() != "y": 473 | return 474 | print("Note: Only files over 10MB are logged here") 475 | 476 | number_downloaded = 0 477 | if args.slug: 478 | number_downloaded = Exporter.download_media(media, args.dest, slugs) 479 | else: 480 | number_downloaded = Exporter.download_media(media, args.dest) 481 | print('Downloaded %d media to %s' % (number_downloaded, args.dest)) 482 | 483 | def start_interactive(target, session, version): 484 | """ 485 | Starts a new interactive session 486 | """ 487 | InteractiveShell(target, session, version).cmdloop() -------------------------------------------------------------------------------- /lib/plugins/plugin_list.csv: -------------------------------------------------------------------------------- 1 | oembed/1.0,Allows embedded representation of a URL, 2 | contact-form-7/v1,Manages multiple contact forms,https://wordpress.org/plugins/contact-form-7/ 3 | wc/v1,WooCommerce is a free eCommerce plugin that allows to sell anything,https://wordpress.org/plugins/woocommerce/ 4 | wc/v2,WooCommerce is a free eCommerce plugin that allows to sell anything,https://wordpress.org/plugins/woocommerce/ 5 | facebook/v1,, 6 | regenerate-thumbnails/v1,Regenerate Thumbnails allows to regenerate all thumbnail sizes for one or more images,https://wordpress.org/plugins/regenerate-thumbnails/ 7 | wp/v2,The default API integrated since WordPress 4.7,https://developer.wordpress.org/rest-api/ 8 | akismet/v1,Akismet checks comments and contact form submissions against a global database of spam,https://wordpress.org/plugins/akismet/ 9 | yoast/v1,Yoast SEO is a WordPress SEO plugin,https://wordpress.org/plugins/wordpress-seo/ 10 | wp-super-cache/v1,This plugin generates static html files from your dynamic WordPress blog,https://wordpress.org/plugins/wp-super-cache/ 11 | script-manager/v1,, 12 | jetpack/v4,Hassle-free design and marketing,https://wordpress.org/plugins/jetpack/ 13 | redirection/v1,Redirection is the most popular redirect manager for WordPress,https://wordpress.org/plugins/redirection/ 14 | tribe/events/v1,Create and manage an events calendar,https://wordpress.org/plugins/the-events-calendar/ 15 | 2fa/v1,, 16 | wpsc/v1,, 17 | v1/products/,, 18 | v1/cart/,, 19 | v1/,, 20 | post-views-counter,Counts views of posts of the website,https://wordpress.org/plugins/post-views-counter/ 21 | frm-admin/v1,, 22 | listo/v1,Listo is a simple plugin that supplies other plugins and themes with commonly used lists,https://wordpress.org/plugins/listo/ 23 | themeisle-sdk/v1,, 24 | bogo/v1,Bogo is a straight-forward multilingual plugin for WordPress,https://wordpress.org/plugins/bogo/ 25 | envira/v1,Responsive Image Gallery for WordPress,https://wordpress.org/plugins/envira-gallery-lite/ 26 | disqus/v1,Disqus is the web’s most popular commenting system,https://wordpress.org/plugins/disqus-comment-system/ 27 | invitations-for-slack/v1,Invitations for Slack allows to show “Join us on Slack.” buttons,https://wordpress.org/plugins/invitations-for-slack/ 28 | rop/v1,Revive Old Posts helps to keep the old posts alive by automatically sharing them on Social Networks,https://wordpress.org/plugins/tweet-old-post/ 29 | cf-api/v2,, 30 | thrive,, 31 | om-cc,, 32 | om/fiw,, 33 | tatsu/v1,, 34 | semplice/v1/editor,, 35 | semplice/v1/admin,, 36 | semplice/v1/frontend,, 37 | jwt-auth/v1,, 38 | pum/v1,, 39 | deliciousbrains/v1,, 40 | sportspress/v2,Creates a professional sports website,https://wordpress.org/plugins/sportspress/ 41 | content-forms/v1,, 42 | wp_live_chat_support/v1,Fully functional Live Chat plugin,https://wordpress.org/plugins/wp-live-chat-support/ 43 | if-menu/v1,Control what menu items visitors see based on visibility rules,https://wordpress.org/plugins/if-menu/ 44 | iowd/v1,, 45 | save,, 46 | facetwp/v1/,, 47 | slimstat/v1,A web analytics plugin for WordPress,https://wordpress.org/plugins/wp-slimstat/ 48 | social-share/v1,, 49 | social-counts/v1,, 50 | swp_api,, 51 | app/v2,, 52 | alids/v1/,, 53 | template-directory,, 54 | customify/v1,With Customify developers can easily create advanced theme-specific options inside the WordPress Customizer,https://wordpress.org/plugins/customify/ 55 | pixcare/v1,, 56 | codepinch/v1,A website error correcter?,https://wordpress.org/plugins/wp-error-fix/ 57 | blc/v1,Broken Link Checker?,https://wordpress.org/plugins/broken-link-checker/ 58 | visualizer/v1,, 59 | td-composer,, 60 | tdw,, 61 | mpp/v1,, 62 | wooketing/v1,, 63 | gf/v2,, 64 | wpcsp/v1,Set the CSP settings and will add them to the page the visitor requested,https://wordpress.org/plugins/wp-content-security-policy/ 65 | instant-images,One click uploads of Unsplash photos,https://wordpress.org/plugins/instant-images/ 66 | api,, 67 | templates-directory,, 68 | rollbar/v1,Rollbar collects errors and allows to analyze them,https://wordpress.org/plugins/rollbar/ 69 | liveblog/v1,Quick and simple blogging for following fast-paced events,https://wordpress.org/plugins/liveblog/ 70 | integrity-checker/v1,Verifies that all installed code is identical to it’s original version and more,https://wordpress.org/plugins/integrity-checker/ 71 | pll/v1,, 72 | wp-post-modal/v1,, 73 | quiz-survey-master/v1,Creates surveys for the users,https://wordpress.org/plugins/quiz-master-next/ 74 | rp-wapi/v1,, 75 | wc-product-add-ons/v1,WooCommerce PPOM (Personalized Product Option Manager) Plugin adds input fields on product page,https://wordpress.org/plugins/search/wc+products/ 76 | wpglib/v1,, 77 | tcm/v1,, 78 | affwp/v1,, 79 | custom-api/v1,, 80 | wplr/v1,"Synchronizes photos, collections, keywords and metadata between Lightroom and WordPress",https://wordpress.org/plugins/search/wplr/ 81 | acf/v3,Exposes Advanced Custom Fields Endpoints in the WordPress REST API,https://wordpress.org/plugins/acf-to-rest-api/ 82 | pp/v1,, 83 | dooplay,, 84 | dbmovies,, 85 | pciextranet/v2,, 86 | cloozi/rest,, 87 | store-locator-plus/v1,Maps locations on Google Maps,https://wordpress.org/plugins/store-locator-le/ 88 | store-locator-plus/v2,Maps locations on Google Maps,https://wordpress.org/plugins/store-locator-le/ 89 | joinzee-wp/v1,, 90 | ccf/v1,Custom Contact Forms?,https://wordpress.org/plugins/custom-contact-forms/ 91 | keremiya,, 92 | pageviews/1.0,A simple and lightweight pageviews counter,https://wordpress.org/plugins/pageviews/ 93 | watchful/v1,, 94 | shortcode-change,, 95 | shortcode-insert,, 96 | upload/,, 97 | sync/,, 98 | download/,, 99 | agroopwoo,, 100 | rest-routes/v2,Building custom endpoints for WP REST API made easy,https://wordpress.org/plugins/rest-routes/ 101 | pvc/v1,, 102 | ee/v4.8.29,, 103 | ee/v4.8.33,, 104 | ee/v4.8.34,, 105 | ee/v4.8.36,, 106 | vegashero/v1,, 107 | ml-api/v2,, 108 | mwl/v1,, 109 | envira-background/v1,, 110 | api/v1,, 111 | rta,, 112 | stec/v2,, 113 | erp/v1,, 114 | autofill/v1,, 115 | /autofill/v1,, 116 | rest/v1,, 117 | wp/v2/acf,, 118 | ms/api,, 119 | siso/v1,, 120 | dp/v1,, 121 | indieauth/1.0,IndieAuth is a way for doing Web sign-in,https://wordpress.org/plugins/indieauth/ 122 | sloc_geo/1.0,, 123 | link-preview/1.0,Display a preview for a URL similar to sharing a link on Facebook,https://wordpress.org/plugins/wp-link-preview/ 124 | webmention/1.0,Enable conversation across the web,https://wordpress.org/plugins/webmention/ 125 | bballs,, 126 | logbook/v1,This plugin is for logging users' activities,https://wordpress.org/plugins/search/logbook/ 127 | child-themify/v1,Create child themes with the click of a button,https://wordpress.org/plugins/child-themify/ 128 | versionpress,, 129 | keliron/api/v3,, 130 | bablic,Translate WP with this multilingual plugin,https://wordpress.org/plugins/bablic/ 131 | eum/v1,, 132 | tvo/v1,, 133 | frm/v2,, 134 | app-mobile,, 135 | ap3/v1,, 136 | diets/v1,, 137 | manage-customers/v1,, 138 | leads,WordPress Leads?,https://wordpress.org/plugins/leads/ 139 | commentcava/v1.0,CommentCaVa disables the comment field for a certain amount of time,https://wordpress.org/plugins/commentcava/ 140 | lscf_rest,Advanced WordPress Filter Plugin,https://wordpress.org/plugins/live-search-custom-fields-lite/ 141 | wpv/v1,, 142 | tho/v1,, 143 | aghigh/v1,, 144 | spnl/v1,A Newsletter Plugin for WordPress,https://wordpress.org/plugins/search/spnl/ 145 | task_manager/v1,Task manager,https://wordpress.org/plugins/task-manager/ 146 | customfiy/v1,Theme Customizer Booster,https://wordpress.org/plugins/customify/ 147 | CHifcoRegCardPluginV2/v1,, 148 | CHifcoFireBasePlugin/v1,, 149 | CHifcoFireBaseVII/v2,, 150 | wp-api-menus/v2,, 151 | envira-lightroom/v3,Envira Gallery allows you to create photo galleries and video galleries,https://wordpress.org/plugins/envira-gallery-lite/ 152 | comments/v1,, 153 | addcomment/v1,, 154 | pf/v1,, 155 | postmatic/v1,, 156 | ivole/v1,Customer Reviews for WooCommerce?,https://wordpress.org/plugins/customer-reviews-woocommerce/ 157 | shwcp/v1,, 158 | wp-rest-api-log,WordPress plugin to log REST API requests and responses,https://wordpress.org/plugins/wp-rest-api-log/ 159 | wk/v1,, 160 | sfp-live-search/v1,, 161 | csco/v1,, 162 | caos/v1,A plugin that inserts the Analytics tracking code,https://wordpress.org/plugins/host-analyticsjs-local/ 163 | rest/events,, 164 | obfx-google-analytics,, 165 | shariff/v1,Shariff provides share buttons that respect the privacy of visitors,https://wordpress.org/plugins/shariff/ 166 | wp-discourse/v1,This plugin allows to use Discourse as a community engine,https://wordpress.org/plugins/wp-discourse/ 167 | dbmvs,, 168 | wp-crm/v1/form,This plugin is intended to significantly improve user management,https://wordpress.org/plugins/wp-crm/ 169 | gutenberg/v1,A new editing experience for WordPress,https://wordpress.org/plugins/gutenberg/ 170 | tribe_events/v2,, 171 | rnet/v1,, 172 | eklo/v2,, 173 | menus/v1,, 174 | sow/v1,, 175 | wpbooklist/v1,Used to sell books, record and catalog a library,https://wordpress.org/plugins/wpbooklist/ 176 | tabulate,This plugin provides a simple user-friendly interface to tables in the database,https://wordpress.org/plugins/tabulate/ 177 | geoblog/v1,, 178 | acf/v2,, 179 | mobilegate/v2,, 180 | jamtrap/v1,, 181 | paf,, 182 | in-cron/v1,, 183 | awb/v1,AWB allows to use parallax backgrounds with images, videos, youtube and vimeo,https://wordpress.org/plugins/advanced-backgrounds/ 184 | wctofb/v1,WooCommerce to facebook shop,https://wordpress.org/plugins/woo-to-facebook-shop/ 185 | weekly-class/v1,Generate a weekly schedule of classes,https://wordpress.org/plugins/weekly-class-schedule/ 186 | be-to-tatsu/v1,, 187 | braintree-gateway/v1/,A payment gateway, 188 | bfwc/settings/kount/,, 189 | gembloong/,, 190 | -------------------------------------------------------------------------------- /lib/requestsession.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) 2018-2020 Mickaël "Kilawyn" Walter 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining a copy 5 | of this software and associated documentation files (the "Software"), to deal 6 | in the Software without restriction, including without limitation the rights 7 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 8 | copies of the Software, and to permit persons to whom the Software is 9 | furnished to do so, subject to the following conditions: 10 | 11 | The above copyright notice and this permission notice shall be included in all 12 | copies or substantial portions of the Software. 13 | 14 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 15 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 16 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 17 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 18 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 19 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 20 | SOFTWARE. 21 | """ 22 | 23 | from http.cookies import SimpleCookie 24 | import requests 25 | 26 | from lib.console import Console 27 | 28 | class ConnectionCouldNotResolve(Exception): 29 | pass 30 | 31 | class ConnectionReset(Exception): 32 | pass 33 | 34 | class ConnectionRefused(Exception): 35 | pass 36 | 37 | class ConnectionTimeout(Exception): 38 | pass 39 | 40 | class HTTPError400(Exception): 41 | pass 42 | 43 | class HTTPError401(Exception): 44 | pass 45 | 46 | class HTTPError403(Exception): 47 | pass 48 | 49 | class HTTPError404(Exception): 50 | pass 51 | 52 | class HTTPError500(Exception): 53 | pass 54 | 55 | class HTTPError502(Exception): 56 | pass 57 | 58 | class HTTPError(Exception): 59 | pass 60 | 61 | class RequestSession: 62 | """ 63 | Wrapper to handle the requests library with session support 64 | """ 65 | 66 | def __init__(self, proxy=None, cookies=None, authorization=None): 67 | """ 68 | Creates a new RequestSession instance 69 | param proxy: a dict containing a proxy server string for HTTP and/or 70 | HTTPS connection 71 | param cookies: a string in the format of the Cookie header 72 | param authorization: a tuple containing login and password or 73 | requests.auth.HTTPBasicAuth for basic authentication or 74 | requests.auth.HTTPDigestAuth for NTLM-like authentication 75 | """ 76 | self.s = requests.Session() 77 | if proxy is not None: 78 | self.set_proxy(proxy) 79 | if cookies is not None: 80 | self.set_cookies(cookies) 81 | if authorization is not None and ( 82 | type(authorization) is tuple and len(authorization) == 2 or 83 | type(authorization) is requests.auth.HTTPBasicAuth or 84 | type(authorization) is requests.auth.HTTPDigestAuth): 85 | self.s.auth = authorization 86 | 87 | def get(self, url): 88 | """ 89 | Calls the get function from requests but handles errors to raise proper 90 | exception following the context 91 | """ 92 | return self.do_request("get", url) 93 | 94 | 95 | def post(self, url, data=None): 96 | """ 97 | Calls the post function from requests but handles errors to raise proper 98 | exception following the context 99 | """ 100 | return self.do_request("post", url, data) 101 | 102 | def do_request(self, method, url, data=None): 103 | """ 104 | Helper class to regroup requests and handle exceptions at the same 105 | location 106 | """ 107 | response = None 108 | try: 109 | if method == "post": 110 | response = self.s.post(url, data) 111 | else: 112 | response = self.s.get(url) 113 | except requests.ConnectionError as e: 114 | if "Errno -5" in str(e) or "Errno -2" in str(e)\ 115 | or "Errno -3" in str(e): 116 | Console.log_error("Could not resolve host %s" % url) 117 | raise ConnectionCouldNotResolve 118 | elif "Errno 111" in str(e): 119 | Console.log_error("Connection refused by %s" % url) 120 | raise ConnectionRefused 121 | elif "RemoteDisconnected" in str(e): 122 | Console.log_error("Connection reset by %s" % url) 123 | raise ConnectionReset 124 | else: 125 | print(e) 126 | raise e 127 | except Exception as e: 128 | raise e 129 | 130 | if response.status_code == 400: 131 | raise HTTPError400 132 | elif response.status_code == 401: 133 | Console.log_error("Error 401 (Unauthorized) while trying to fetch" 134 | " the API") 135 | raise HTTPError401 136 | elif response.status_code == 403: 137 | Console.log_error("Error 403 (Authorization Required) while trying" 138 | " to fetch the API") 139 | raise HTTPError403 140 | elif response.status_code == 404: 141 | raise HTTPError404 142 | elif response.status_code == 500: 143 | Console.log_error("Error 500 (Internal Server Error) while trying" 144 | " to fetch the API") 145 | raise HTTPError500 146 | elif response.status_code == 502: 147 | Console.log_error("Error 502 (Bad Gateway) while trying" 148 | " to fetch the API") 149 | raise HTTPError404 150 | elif response.status_code > 400: 151 | Console.log_error("Error %d while trying to fetch the API" % 152 | response.status_code) 153 | raise HTTPError 154 | 155 | return response 156 | 157 | def set_cookies(self, cookies): 158 | """ 159 | Sets new cookies from a string 160 | """ 161 | c = SimpleCookie() 162 | c.load(cookies) 163 | for key, m in c.items(): 164 | self.s.cookies.set(key, m.value) 165 | 166 | def get_cookies(self): 167 | return self.s.cookies.get_dict() 168 | 169 | def set_proxy(self, proxy): 170 | prot = 'http' 171 | if proxy[:5].lower() == 'https': 172 | prot = 'https' 173 | self.s.proxies = {prot: proxy} 174 | 175 | def get_proxies(self): 176 | return self.s.proxies 177 | 178 | def set_creds(self, credentials): 179 | self.s.auth = credentials 180 | 181 | def get_creds(self): 182 | return self.s.auth -------------------------------------------------------------------------------- /lib/utils.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) 2018-2020 Mickaël "Kilawyn" Walter 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining a copy 5 | of this software and associated documentation files (the "Software"), to deal 6 | in the Software without restriction, including without limitation the rights 7 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 8 | copies of the Software, and to permit persons to whom the Software is 9 | furnished to do so, subject to the following conditions: 10 | 11 | The above copyright notice and this permission notice shall be included in all 12 | copies or substantial portions of the Software. 13 | 14 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 15 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 16 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 17 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 18 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 19 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 20 | SOFTWARE. 21 | """ 22 | 23 | import json 24 | 25 | from urllib.parse import urlsplit, urlunsplit 26 | 27 | def get_by_id(value, id): 28 | """ 29 | Utility function to retrieve a value by and ID in a list of dicts, returns 30 | None of no correspondance have been made 31 | param value: the dict to process 32 | param id: the id to get 33 | """ 34 | if value is None: 35 | return None 36 | for val in value: 37 | if 'id' in val.keys() and val['id'] == id: 38 | return val 39 | return None 40 | 41 | # Neat code part from https://codereview.stackexchange.com/questions/13027/joini 42 | # ng-url-path-components-intelligently 43 | def url_path_join(*parts): 44 | """Normalize url parts and join them with a slash.""" 45 | schemes, netlocs, paths, queries, fragments = \ 46 | zip(*(urlsplit(part) for part in parts)) 47 | scheme = first(schemes) 48 | netloc = first(netlocs) 49 | path = '/'.join(x.strip('/') for x in paths if x) 50 | query = first(queries) 51 | fragment = first(fragments) 52 | return urlunsplit((scheme, netloc, path, query, fragment)) 53 | 54 | def first(sequence, default=''): 55 | return next((x for x in sequence if x), default) 56 | 57 | # Code from https://stackoverflow.com/questions/3173320/text-progress-bar-in-th 58 | # e-console 59 | 60 | def print_progress_bar (iteration, total, prefix = '', suffix = '', decimals = 1,\ 61 | length = 100, fill = '█'): 62 | """ 63 | Call in a loop to create terminal progress bar 64 | @params: 65 | iteration - Required : current iteration (Int) 66 | total - Required : total iterations (Int) 67 | prefix - Optional : prefix string (Str) 68 | suffix - Optional : suffix string (Str) 69 | decimals - Optional : positive number of decimals in percent \ 70 | complete (Int) 71 | length - Optional : character length of bar (Int) 72 | fill - Optional : bar fill character (Str) 73 | """ 74 | try: 75 | percent = ("{0:." + str(decimals) + "f}").format(100 * (iteration / \ 76 | float(total))) 77 | filledLength = int(length * iteration // total) 78 | except: 79 | percent = 0 80 | filledLength = 0 81 | 82 | bar = fill * filledLength + '-' * (length - filledLength) 83 | print('\r%s |%s| %s%% %s' % (prefix, bar, percent, suffix), end = '\r') 84 | # Print New Line on Complete 85 | if iteration == total: 86 | print() 87 | 88 | def get_content_as_json (response_obj): 89 | """ 90 | When a BOM is present (see issue #2), UTF-8 is not properly decoded by 91 | Response.json() method. This is a helper function that returns a json value 92 | even if a BOM is present in UTF-8 text 93 | @params: 94 | response_obj: a requests Response instance 95 | @returns: a decoded json object (list or dict) 96 | """ 97 | if response_obj.content[:3]== b'\xef\xbb\xbf': # UTF-8 BOM 98 | content = response_obj.content.decode("utf-8-sig") 99 | return json.loads(content) 100 | else: 101 | try: 102 | return response_obj.json() 103 | except: 104 | return {} 105 | -------------------------------------------------------------------------------- /lib/wpapi.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) 2018-2020 Mickaël "Kilawyn" Walter 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining a copy 5 | of this software and associated documentation files (the "Software"), to deal 6 | in the Software without restriction, including without limitation the rights 7 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 8 | copies of the Software, and to permit persons to whom the Software is 9 | furnished to do so, subject to the following conditions: 10 | 11 | The above copyright notice and this permission notice shall be included in all 12 | copies or substantial portions of the Software. 13 | 14 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 15 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 16 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 17 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 18 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 19 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 20 | SOFTWARE. 21 | """ 22 | 23 | import math 24 | import copy 25 | 26 | import requests 27 | from urllib.parse import urlencode 28 | 29 | from json.decoder import JSONDecodeError 30 | 31 | from lib.exceptions import NoWordpressApi, WordPressApiNotV2, \ 32 | NSNotFoundException 33 | from lib.requestsession import RequestSession, HTTPError400, HTTPError404 34 | from lib.utils import url_path_join, print_progress_bar, get_content_as_json, get_by_id 35 | 36 | class WPApi: 37 | """ 38 | Queries the WordPress API to retrieve information 39 | """ 40 | 41 | # Object types 42 | POST = 0 43 | """ 44 | The post type 45 | """ 46 | POST_REVISION = 1 47 | """ 48 | The post revision type 49 | """ 50 | WP_BLOCK = 2 51 | """ 52 | The Gutenberg block type 53 | """ 54 | CATEGORY = 3 55 | """ 56 | The category type 57 | """ 58 | TAG = 4 59 | """ 60 | The tag type 61 | """ 62 | PAGE = 5 63 | """ 64 | The page type 65 | """ 66 | COMMENT = 6 67 | """ 68 | The comment type 69 | """ 70 | MEDIA = 7 71 | """ 72 | The media type 73 | """ 74 | USER = 8 75 | """ 76 | The user type 77 | """ 78 | THEME = 9 79 | """ 80 | The theme type 81 | """ 82 | NAMESPACE = 10 83 | """ 84 | The namespace type 85 | """ 86 | #SEARCH_RESULT = 10 87 | ALL_TYPES = 20 88 | """ 89 | Constant representing all types 90 | """ 91 | 92 | def __init__(self, target, api_path="wp-json/", session=None, 93 | search_terms=None): 94 | """ 95 | Creates a new instance of WPApi 96 | param target: the target of the scan 97 | param api_path: the api path, if non-default 98 | param session: the requests session object to use for HTTP requests 99 | param search_terms : the terms of the keyword search, if any 100 | """ 101 | self.api_path = api_path 102 | self.search_terms = search_terms 103 | self.has_v2 = None 104 | self.name = None 105 | self.description = None 106 | self.url = target 107 | self.basic_info = None 108 | self.posts = None 109 | self.tags = None 110 | self.categories = None 111 | self.users = None 112 | self.media = None 113 | self.pages = None 114 | self.s = None 115 | self.comments_loaded = False 116 | self.orphan_comments = [] 117 | self.comments = None 118 | 119 | if session is not None: 120 | self.s = session 121 | else: 122 | self.s = RequestSession() 123 | 124 | @staticmethod 125 | def str_type_to_native(str_type): 126 | """ 127 | Converts a single object type as str to its corresponding native type. 128 | If the object type is unknown, this returns None as a fallback. 129 | This may have to be modified in cases of bugs. 130 | 131 | :param str_type: the object type as string 132 | :return: the object type as native constant 133 | 134 | ``` 135 | str_type_to_native("post") # returns WPApi.POST 136 | ``` 137 | """ 138 | if str_type == "user": 139 | return WPApi.USER 140 | elif str_type == "tag": 141 | return WPApi.TAG 142 | elif str_type == "category": 143 | return WPApi.CATEGORY 144 | elif str_type == "post": 145 | return WPApi.POST 146 | elif str_type == "page": 147 | return WPApi.PAGE 148 | elif str_type == "comment": 149 | return WPApi.COMMENT 150 | elif str_type == "media": 151 | return WPApi.MEDIA 152 | elif str_type == "post_revision": 153 | return WPApi.POST_REVISION 154 | elif str_type == "block": 155 | return WPApi.WP_BLOCK 156 | elif str_type == "theme": 157 | return WPApi.THEME 158 | elif str_type == "namespace": 159 | return WPApi.NAMESPACE 160 | return None 161 | 162 | @staticmethod 163 | def convert_obj_types_to_list(str_types): 164 | """ 165 | Converts a list of object type as list to a list of native constants 166 | representing the object types. 167 | """ 168 | out = [] 169 | if str_types is None or len(str_types) == 0 or 'all' in str_types: 170 | return [WPApi.ALL_TYPES] 171 | for el in str_types: 172 | current = WPApi.str_type_to_native(el) 173 | if current is not None: 174 | out.append(current) 175 | return out 176 | 177 | def get_orphans_comments(self): 178 | """ 179 | Returns the list of comments for which a post hasn't been found 180 | """ 181 | return self.orphan_comments 182 | 183 | def get_basic_info(self): 184 | """ 185 | Collects and stores basic information about the target 186 | """ 187 | rest_url = url_path_join(self.url, self.api_path) 188 | if self.basic_info is not None: 189 | return self.basic_info 190 | 191 | try: 192 | req = self.s.get(rest_url) 193 | except Exception: 194 | raise NoWordpressApi 195 | if req.status_code >= 400: 196 | raise NoWordpressApi 197 | self.basic_info = get_content_as_json(req) 198 | 199 | if 'name' in self.basic_info.keys(): 200 | self.name = self.basic_info['name'] 201 | 202 | if 'description' in self.basic_info.keys(): 203 | self.description = self.basic_info['description'] 204 | 205 | if 'namespaces' in self.basic_info.keys() and 'wp/v2' in \ 206 | self.basic_info['namespaces']: 207 | self.has_v2 = True 208 | 209 | return self.basic_info 210 | 211 | def crawl_pages(self, url, start=None, num=None, search_terms=None, display_progress=True): 212 | """ 213 | Crawls all pages while there is at least one result for the given 214 | endpoint or tries to get pages from start to end 215 | """ 216 | if search_terms is None: 217 | search_terms = self.search_terms 218 | page = 1 219 | total_entries = 0 220 | total_pages = 0 221 | more_entries = True 222 | entries = [] 223 | base_url = url 224 | entries_left = 1 225 | per_page = 10 226 | if search_terms is not None: 227 | if '?' in base_url: 228 | base_url += '&' + urlencode({'search': search_terms}) 229 | else: 230 | base_url += '?' + urlencode({'search': search_terms}) 231 | if start is not None: 232 | page = math.floor(start/per_page) + 1 233 | if num is not None: 234 | entries_left = num 235 | while more_entries and entries_left > 0: 236 | rest_url = url_path_join(self.url, self.api_path, (base_url % page)) 237 | if start is not None: 238 | rest_url += "&per_page=%d" % per_page 239 | try: 240 | req = self.s.get(rest_url) 241 | if (page == 1 or start is not None and page == math.floor(start/per_page) + 1) and 'X-WP-Total' in req.headers: 242 | total_entries = int(req.headers['X-WP-Total']) 243 | total_pages = int(req.headers['X-WP-TotalPages']) 244 | print("Total number of entries: %d" % total_entries) 245 | if start is not None and total_entries < start: 246 | start = total_entries - 1 247 | except HTTPError400: 248 | break 249 | except Exception: 250 | raise WordPressApiNotV2 251 | try: 252 | json_content = get_content_as_json(req) 253 | if type(json_content) is list and len(json_content) > 0: 254 | if (start is None or start is not None and page > math.floor(start/per_page) + 1) and num is None: 255 | entries += json_content 256 | if start is not None: 257 | entries_left -= len(json_content) 258 | elif start is not None and page == math.floor(start/per_page) + 1: 259 | if num is None or num is not None and len(json_content[start % per_page:]) < num: 260 | entries += json_content[start % per_page:] 261 | if num is not None: 262 | entries_left -= len(json_content[start % per_page:]) 263 | else: 264 | entries += json_content[start % per_page:(start % per_page) + num] 265 | entries_left = 0 266 | else: 267 | if num is not None and entries_left > len(json_content): 268 | entries += json_content 269 | entries_left -= len(json_content) 270 | else: 271 | entries += json_content[:entries_left] 272 | entries_left = 0 273 | 274 | if display_progress: 275 | if num is None and start is None and total_entries >= 0: 276 | print_progress_bar(page, total_pages, 277 | length=70) 278 | elif num is None and start is not None and total_entries >= 0: 279 | print_progress_bar(total_entries-start-entries_left, total_entries-start, 280 | length=70) 281 | elif num is not None and total_entries > 0: 282 | print_progress_bar(num-entries_left, num, 283 | length=70) 284 | else: 285 | more_entries = False 286 | except JSONDecodeError: 287 | more_entries = False 288 | 289 | page += 1 290 | 291 | return (entries, total_entries) 292 | 293 | def crawl_single_page(self, url): 294 | """ 295 | Crawls a single URL 296 | """ 297 | content = None 298 | rest_url = url_path_join(self.url, self.api_path, url) 299 | try: 300 | req = self.s.get(rest_url) 301 | except HTTPError400: 302 | return None 303 | except HTTPError404: 304 | return None 305 | except Exception: 306 | raise WordPressApiNotV2 307 | try: 308 | content = get_content_as_json(req) 309 | except JSONDecodeError: 310 | pass 311 | 312 | return content 313 | 314 | def get_from_cache(self, cache, start=None, num=None, force=False): 315 | """ 316 | Tries to fetch data from the given cache, also verifies first if WP-JSON is supported 317 | """ 318 | if self.has_v2 is None: 319 | self.get_basic_info() 320 | if not self.has_v2: 321 | raise WordPressApiNotV2 322 | if cache is not None and start is not None and len(cache) <= start: 323 | start = len(cache) - 1 324 | if cache is not None and not force: 325 | if start is not None and num is None and len(cache) > start and None not in cache[start:]: 326 | # If start is specified and not num, we want to return the posts in cache only if they were already cached 327 | return cache[start:] 328 | elif start is None and num is not None and len(cache) > num and None not in cache[:num]: 329 | # If num is specified and not start, we want to do something similar to the above 330 | return cache[:num] 331 | elif start is not None and num is not None and len(cache) > start + num and None not in cache[start:num]: 332 | return cache[start:start+num] 333 | elif (start is None and (num is None or num > len(cache))) and None not in cache: 334 | return cache 335 | 336 | return None 337 | 338 | def update_cache(self, cache, values, total_entries, start=None, num=None): 339 | if cache is None: 340 | cache = values 341 | elif len(values) > 0: 342 | s = start 343 | if start is None: 344 | s = 0 345 | if start >= total_entries: 346 | s = total_entries - 1 347 | n = num 348 | if n is not None and s + n > total_entries: 349 | n = total_entries - s 350 | if num is None: 351 | n = total_entries 352 | if n > len(cache): 353 | cache += [None] * (n - len(cache)) 354 | for el in values: 355 | cache[s] = el 356 | s += 1 357 | if s == n: 358 | break 359 | if len(cache) != total_entries: 360 | if start is not None and start < total_entries: 361 | cache = [None] * start + cache 362 | if num is not None: 363 | cache += [None] * (total_entries - len(cache)) 364 | return cache 365 | 366 | def get_comments(self, start=None, num=None, force=False): 367 | """ 368 | Retrieves all comments 369 | """ 370 | comments = self.get_from_cache(self.comments, start, num, force) 371 | if comments is not None: 372 | return comments 373 | 374 | comments, total_entries = self.crawl_pages('wp/v2/comments?page=%d', start, num) 375 | self.comments = self.update_cache(self.comments, comments, total_entries, start, num) 376 | return comments 377 | 378 | def get_posts(self, comments=False, start=None, num=None, force=False): 379 | """ 380 | Retrieves all posts or the specified ones 381 | """ 382 | if self.has_v2 is None: 383 | self.get_basic_info() 384 | if not self.has_v2: 385 | raise WordPressApiNotV2 386 | if self.posts is not None and start is not None and len(self.posts) < start: 387 | start = len(self.posts) - 1 388 | if self.posts is not None and (self.comments_loaded and comments or not comments) and not force: 389 | posts = self.get_from_cache(self.posts, start, num) 390 | if posts is not None: 391 | return posts 392 | posts, total_entries = self.crawl_pages('wp/v2/posts?page=%d', start=start, num=num) 393 | 394 | self.posts = self.update_cache(self.posts, posts, total_entries, start, num) 395 | 396 | if not self.comments_loaded and comments: 397 | # Load comments 398 | comment_list = self.crawl_pages('wp/v2/comments?page=%d')[0] 399 | for comment in comment_list: 400 | found_post = False 401 | for i in range(0, len(self.posts)): 402 | if self.posts[i]['id'] == comment['post']: 403 | if "comments" not in self.posts[i]: 404 | self.posts[i]['comments'] = [] 405 | self.posts[i]["comments"].append(comment) 406 | found_post = True 407 | break 408 | if not found_post: 409 | self.orphan_comments.append(comment) 410 | self.comments_loaded = True 411 | 412 | return_posts = self.posts 413 | if start is not None and start < len(return_posts): 414 | return_posts = return_posts[start:] 415 | if num is not None and num < len(return_posts): 416 | return_posts = return_posts[:num] 417 | return return_posts 418 | 419 | def get_tags(self, start=None, num=None, force=False): 420 | """ 421 | Retrieves all tags 422 | """ 423 | tags = self.get_from_cache(self.tags, start, num, force) 424 | if tags is not None: 425 | return tags 426 | 427 | tags, total_entries = self.crawl_pages('wp/v2/tags?page=%d', start, num) 428 | self.tags = self.update_cache(self.tags, tags, total_entries, start, num) 429 | return tags 430 | 431 | def get_categories(self, start=None, num=None, force=False): 432 | """ 433 | Retrieves all categories or the specified ones 434 | """ 435 | categories = self.get_from_cache(self.categories, start, num, force) 436 | if categories is not None: 437 | return categories 438 | 439 | categories, total_entries = self.crawl_pages('wp/v2/categories?page=%d', start=start, num=num) 440 | self.categories = self.update_cache(self.categories, categories, total_entries, start, num) 441 | return categories 442 | 443 | def get_users(self, start=None, num=None, force=False): 444 | """ 445 | Retrieves all users or the specified ones 446 | """ 447 | users = self.get_from_cache(self.users, start, num, force) 448 | if users is not None: 449 | return users 450 | 451 | users, total_entries = self.crawl_pages('wp/v2/users?page=%d', start=start, num=num) 452 | self.users = self.update_cache(self.users, users, total_entries, start, num) 453 | return users 454 | 455 | def get_media(self, start=None, num=None, force=False): 456 | """ 457 | Retrieves all media objects 458 | """ 459 | media = self.get_from_cache(self.media, start, num, force) 460 | if media is not None: 461 | return media 462 | 463 | media, total_entries = self.crawl_pages('wp/v2/media?page=%d', start=start, num=num) 464 | self.media = self.update_cache(self.media, media, total_entries, start, num) 465 | return media 466 | 467 | def get_media_urls(self, ids, cache=True): 468 | """ 469 | Retrieves the media download URLs for specified IDs or all or from cache 470 | """ 471 | media = [] 472 | if ids == 'all': 473 | media = self.get_media(force=(not cache)) 474 | elif ids == 'cache': 475 | media = self.get_from_cache(self.media, force=(not cache)) 476 | else: 477 | id_list = ids.split(',') 478 | media = [] 479 | for i in id_list: 480 | try: 481 | if int(i) > 0: 482 | m = self.get_obj_by_id(WPApi.MEDIA, int(i), cache) 483 | if m is not None and len(m) > 0 and type(m[0]) is dict: 484 | media.append(m[0]) 485 | except ValueError: 486 | pass 487 | urls = [] 488 | slugs = [] 489 | if media is None: 490 | return [] 491 | for m in media: 492 | if m is not None and type(m) is dict and "source_url" in m.keys() and 'slug' in m.keys(): 493 | urls.append(m["source_url"]) 494 | slugs.append(m['slug']) 495 | return urls, slugs 496 | 497 | 498 | def get_pages(self, start=None, num=None, force=False): 499 | """ 500 | Retrieves all pages 501 | """ 502 | pages = self.get_from_cache(self.pages, start, num, force) 503 | if pages is not None: 504 | return pages 505 | 506 | pages, total_entries = self.crawl_pages('wp/v2/pages?page=%d', start=start, num=num) 507 | self.pages = self.update_cache(self.pages, pages, total_entries, start, num) 508 | return pages 509 | 510 | def get_namespaces(self, start=None, num=None, force=False): 511 | """ 512 | Retrieves an array of namespaces 513 | """ 514 | if self.has_v2 is None or force: 515 | self.get_basic_info() 516 | if 'namespaces' in self.basic_info.keys(): 517 | if start is None and num is None: 518 | return self.basic_info['namespaces'] 519 | namespaces = copy.deepcopy(self.basic_info['namespaces']) 520 | if start is not None and start < len(namespaces): 521 | namespaces = namespaces[start:] 522 | if num <= len(namespaces): 523 | namespaces = namespaces[:num] 524 | return namespaces 525 | return [] 526 | 527 | def get_routes(self): 528 | """ 529 | Retrieves an array of routes 530 | """ 531 | if self.has_v2 is None: 532 | self.get_basic_info() 533 | if 'routes' in self.basic_info.keys(): 534 | return self.basic_info['routes'] 535 | return [] 536 | 537 | def crawl_namespaces(self, ns): 538 | """ 539 | Crawls all accessible get routes defined for the specified namespace. 540 | """ 541 | namespaces = self.get_namespaces() 542 | routes = self.get_routes() 543 | ns_data = {} 544 | if ns != "all" and ns not in namespaces: 545 | raise NSNotFoundException 546 | for url, route in routes.items(): 547 | if 'namespace' not in route.keys() \ 548 | or 'endpoints' not in route.keys(): 549 | continue 550 | url_as_ns = url.lstrip('/') 551 | if '(?P<' in url or url_as_ns in namespaces: 552 | continue 553 | if ns != 'all' and route['namespace'] != ns or \ 554 | route['namespace'] in ['wp/v2', '']: 555 | continue 556 | for endpoint in route['endpoints']: 557 | if 'GET' not in endpoint['methods']: 558 | continue 559 | keep = True 560 | if len(endpoint['args']) > 0 and type(endpoint['args']) is dict: 561 | for name,arg in endpoint['args'].items(): 562 | if arg['required']: 563 | keep = False 564 | if keep: 565 | rest_url = url_path_join(self.url, self.api_path, url) 566 | try: 567 | ns_request = self.s.get(rest_url) 568 | ns_data[url] = get_content_as_json(ns_request) 569 | except Exception: 570 | continue 571 | return ns_data 572 | 573 | def get_obj_by_id_helper(self, cache, obj_id, url, use_cache=True): 574 | if use_cache and cache is not None: 575 | obj = get_by_id(cache, obj_id) 576 | if obj is not None: 577 | return [obj] 578 | obj = self.crawl_single_page(url % obj_id) 579 | if type(obj) is dict: 580 | return [obj] 581 | return [] 582 | 583 | def get_obj_by_id(self, obj_type, obj_id, use_cache=True): 584 | """ 585 | Returns a list of maximum one object specified by its type and ID. 586 | 587 | Also returns an empty list if the ID does not exist. 588 | 589 | :param obj_type: the type of the object (ex. POST) 590 | :param obj_id: the ID of the object to fetch 591 | :param use_cache: if the cache should be used to avoid useless requests 592 | """ 593 | if obj_type == WPApi.USER: 594 | return self.get_obj_by_id_helper(self.users, obj_id, 'wp/v2/users/%d', use_cache) 595 | if obj_type == WPApi.TAG: 596 | return self.get_obj_by_id_helper(self.tags, obj_id, 'wp/v2/tags/%d', use_cache) 597 | if obj_type == WPApi.CATEGORY: 598 | return self.get_obj_by_id_helper(self.categories, obj_id, 'wp/v2/categories/%d', use_cache) 599 | if obj_type == WPApi.POST: 600 | return self.get_obj_by_id_helper(self.posts, obj_id, 'wp/v2/posts/%d', use_cache) 601 | if obj_type == WPApi.PAGE: 602 | return self.get_obj_by_id_helper(self.pages, obj_id, 'wp/v2/pages/%d', use_cache) 603 | if obj_type == WPApi.COMMENT: 604 | return self.get_obj_by_id_helper(self.comments, obj_id, 'wp/v2/comments/%d', use_cache) 605 | if obj_type == WPApi.MEDIA: 606 | return self.get_obj_by_id_helper(self.comments, obj_id, 'wp/v2/media/%d', use_cache) 607 | return [] 608 | 609 | def get_obj_list(self, obj_type, start, limit, cache, kwargs={}): 610 | """ 611 | Returns a list of maximum limit objects specified by the starting object offset. 612 | 613 | :param obj_type: the type of the object (ex. POST) 614 | :param start: the offset of the first object to return 615 | :param limit: the maximum number of objects to return 616 | :param cache: if the cache should be used to avoid useless requests 617 | :param kwargs: additional parameters to pass to the function (for POST only) 618 | """ 619 | get_func = None 620 | if obj_type == WPApi.USER: 621 | get_func = self.get_users 622 | elif obj_type == WPApi.TAG: 623 | get_func = self.get_tags 624 | elif obj_type == WPApi.CATEGORY: 625 | get_func = self.get_categories 626 | elif obj_type == WPApi.PAGE: 627 | get_func = self.get_pages 628 | elif obj_type == WPApi.COMMENT: 629 | get_func = self.get_comments 630 | elif obj_type == WPApi.MEDIA: 631 | get_func = self.get_media 632 | elif obj_type == WPApi.NAMESPACE: 633 | get_func = self.get_namespaces 634 | 635 | if get_func is not None: 636 | return get_func(start=start, num=limit, force=not cache) 637 | elif obj_type == WPApi.POST: 638 | return self.get_posts(start=start, num=limit, force=not cache, **kwargs) 639 | return [] 640 | 641 | def search(self, obj_types, keywords, start, limit): 642 | """ 643 | Looks for data with the specified keywords of the given types. 644 | 645 | :param obj_types: a list of the desired object types to look for 646 | :param keywords: the keywords to look for 647 | :param start: a start index 648 | :param limit: the max number to return 649 | :return: a dict of lists of objects sorted by types 650 | """ 651 | out = {} 652 | if WPApi.ALL_TYPES in obj_types or len(obj_types) == 0: 653 | obj_types = [ 654 | WPApi.POST, WPApi.CATEGORY, WPApi.TAG, WPApi.PAGE, 655 | WPApi.COMMENT, WPApi.MEDIA, WPApi.USER 656 | ] # All supported types for search 657 | for t in obj_types: 658 | if t == WPApi.POST: 659 | out[t] = self.crawl_pages('wp/v2/posts?page=%d', start=start, num=limit, search_terms=keywords, display_progress=False)[0] 660 | elif t == WPApi.CATEGORY: 661 | out[t] = self.crawl_pages('wp/v2/categories?page=%d', start=start, num=limit, search_terms=keywords, display_progress=False)[0] 662 | elif t == WPApi.TAG: 663 | out[t] = self.crawl_pages('wp/v2/tags?page=%d', start=start, num=limit, search_terms=keywords, display_progress=False)[0] 664 | elif t == WPApi.PAGE: 665 | out[t] = self.crawl_pages('wp/v2/pages?page=%d', start=start, num=limit, search_terms=keywords, display_progress=False)[0] 666 | elif t == WPApi.COMMENT: 667 | out[t] = self.crawl_pages('wp/v2/comments?page=%d', start=start, num=limit, search_terms=keywords, display_progress=False)[0] 668 | elif t == WPApi.MEDIA: 669 | out[t] = self.crawl_pages('wp/v2/media?page=%d', start=start, num=limit, search_terms=keywords, display_progress=False)[0] 670 | elif t == WPApi.USER: 671 | out[t] = self.crawl_pages('wp/v2/users?page=%d', start=start, num=limit, search_terms=keywords, display_progress=False)[0] 672 | return out -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | certifi==2022.12.7 2 | chardet==3.0.4 3 | idna==2.9 4 | requests==2.23.0 5 | urllib3==1.26.5 6 | --------------------------------------------------------------------------------