├── .gitignore ├── CONTACT.md ├── LICENSE.txt ├── README.md ├── requirements.txt ├── timesearch.py ├── timesearch_logo.svg ├── timesearch_modules ├── breakdown.py ├── common.py ├── exceptions.py ├── get_comments.py ├── get_styles.py ├── get_submissions.py ├── get_wiki.py ├── index.py ├── ingest_jsonfile.py ├── livestream.py ├── merge_db.py ├── offline_reading.py ├── pushshift.py └── tsdb.py └── utilities └── database_upgrader.py /.gitignore: -------------------------------------------------------------------------------- 1 | databases/* 2 | @hangman.md 3 | hangman.py 4 | -------------------------------------------------------------------------------- /CONTACT.md: -------------------------------------------------------------------------------- 1 | Contact 2 | ======= 3 | 4 | Please do not open pull requests without talking to me first. For serious issues and bugs, open a GitHub issue. If you just have a question, please send an email to `contact@voussoir.net`. For other contact options, see [voussoir.net/#contact](https://voussoir.net/#contact). 5 | 6 | I also mirror my work to other git services: 7 | 8 | - https://github.com/voussoir 9 | 10 | - https://gitlab.com/voussoir 11 | 12 | - https://codeberg.org/voussoir 13 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2021, Ethan Dalool aka voussoir 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | 1. Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | 2. Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | 3. Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | timesearch 2 | ========== 3 | 4 | ## NEWS (2023 06 25): 5 | 6 | Pushshift's API is currently offline. Without the timestamp search parameter or Pushshift access, timesearch is not able to get historical data. You can continue to use the `livestream` module to collect new posts and comments as they are made. 7 | 8 | You can still download the Pushshift archives, though. https://the-eye.eu/redarcs/ is one source. 9 | 10 | I have added a module for ingesting these json files into a timesearch database so that you can continue to use `offline_reading`, or if you just prefer the sqlite format. You need to extract the zst file with an archive tool like [7-Zip](https://www.7-zip.org/) before giving it to timesearch. 11 | 12 | `python timesearch.py ingest_jsonfile subredditname_submissions -r subredditname` 13 | 14 | `python timesearch.py ingest_jsonfile subredditname_comments -r subredditname` 15 | 16 | ## NEWS (2023 05 01): 17 | 18 | [Reddit has revoked Pushshift's API access](https://old.reddit.com/r/modnews/comments/134tjpe/reddit_data_api_update_changes_to_pushshift_access/), so [pushshift.io](https://pushshift.io) may not be able to continue ingesting reddit content. 19 | 20 | ## NEWS (2018 04 09): 21 | 22 | [Reddit has removed the timestamp search feature which timesearch was built off of](https://voussoir.github.io/t3_7tus5f.html#t1_dtfcdn0) ([original](https://old.reddit.com/r/changelog/comments/7tus5f/update_to_search_api/dtfcdn0/)). Please message the admins by [sending a PM to /r/reddit.com](https://old.reddit.com/message/compose?to=%2Fr%2Freddit.com&subject=Timestamp+search). Let them know that this feature is important to you, and you would like them to restore it on the new search stack. 23 | 24 | Thankfully, Jason Baumgartner aka [/u/Stuck_in_the_Matrix](https://old.reddit.com/u/Stuck_in_the_Matrix/overview), owner of [Pushshift.io](https://github.com/pushshift/api), has made it easy to interact with his dataset. Timesearch now queries his API to get post data, and then uses reddit's /api/info to get up-to-date information about those posts (scores, edited text bodies, ...). While we're at it, this also gives us the ability to speed up `get_comments`. In addition, we can get all of a user's comments which was not possible through reddit alone. 25 | 26 | NOTE: Because Pushshift is an independent dataset run by a regular person, it does not contain posts from private subreddits. Without the timestamp search parameter, scanning private subreddits is now impossible. I urge once again that you contact ~~your senator~~ the admins to have this feature restored. 27 | 28 | --- 29 | 30 | I don't have a test suite. You're my test suite! Messages go to [/u/GoldenSights](https://old.reddit.com/u/GoldenSights). 31 | 32 | Timesearch is a collection of utilities for archiving subreddits. 33 | 34 | ## Make sure you have: 35 | - Downloaded this project using the green "Clone or Download" button in the upper right. 36 | - Installed [Python](https://www.python.org/download). I use Python 3.7. 37 | - Installed PRAW >= 4, as well as the other modules in `requirements.txt`. Try `pip install -r requirements.txt` to get them all. 38 | - Created an OAuth app at https://old.reddit.com/prefs/apps. Make it `script` type, and set the redirect URI to `http://localhost:8080`. The title and description can be anything you want, and the about URL is not required. 39 | - Used [this PRAW script](https://praw.readthedocs.io/en/latest/tutorials/refresh_token.html) to generate a refresh token. Just save it as a .py file somewhere and run it through your terminal / command line. For simplicity's sake, I just choose `all` for the scopes. 40 | - The instructions mention `export praw_client_id=...`. This creates environment variables on Linux. If you are on Windows, or simply don't want to create environment variables, you can alternatively add `client_id='...'` and `client_secret='...'` to the `praw.Reddit` instance on line 40, alongside the `redirect_uri` and `user_agent` arguments. 41 | - Downloaded a copy of [this file](https://github.com/voussoir/reddit/blob/master/bot4.py) and saved it as `bot.py`. Fill out the variables using your OAuth information, and read the instructions to see where to put it. The most simple way is to save it in the same folder as this README file. 42 | - The `USERAGENT` is a description of your API usage. Typically "/u/username's praw client" is sufficient. 43 | - The `CONTACT_INFO` is sent when downloading from Pushshift, [as encouraged by Stuck_in_the_Matrix](https://old.reddit.com/r/pushshift/comments/c5yr9l/i_had_to_ban_a_couple_ips_that_were_making/). It could just be your email address or reddit username. 44 | 45 | ## This package consists of: 46 | 47 | - **get_submissions**: If you try to page through `/new` on a subreddit, you'll hit a limit at or before 1,000 posts. Timesearch uses the pushshift.io dataset to get information about very old posts, and then queries the reddit api to update their information. Previously, we used the `timestamp` cloudsearch query parameter on reddit's own API, but reddit has removed that feature and pushshift is now the only viable source for initial data. 48 | `python timesearch.py get_submissions -r subredditname ` 49 | `python timesearch.py get_submissions -u username ` 50 | 51 | - **get_comments**: Similar to `get_submissions`, this tool queries pushshift for comment data and updates it from reddit. 52 | `python timesearch.py get_comments -r subredditname ` 53 | `python timesearch.py get_comments -u username ` 54 | 55 | - **livestream**: get_submissions+get_comments is great for starting your database and getting the historical posts, but it's not the best for staying up-to-date. Instead, livestream monitors `/new` and `/comments` to continuously ingest data. 56 | `python timesearch.py livestream -r subredditname ` 57 | `python timesearch.py livestream -u username ` 58 | 59 | - **get_styles**: Downloads the stylesheet and CSS images. 60 | `python timesearch.py get_styles -r subredditname` 61 | 62 | - **get_wiki**: Downloads the wiki pages, sidebar, etc. from /wiki/pages. 63 | `python timesearch.py get_wiki -r subredditname` 64 | 65 | - **offline_reading**: Renders comment threads into HTML via markdown. 66 | Note: I'm currently using the [markdown library from pypi](https://pypi.python.org/pypi/Markdown), and it doesn't do reddit's custom markdown like `/r/` or `/u/`, obviously. So far I don't think anybody really uses o_r so I haven't invested much time into improving it. 67 | `python timesearch.py offline_reading -r subredditname ` 68 | `python timesearch.py offline_reading -u username ` 69 | 70 | - **index**: Generates plaintext or HTML lists of submissions, sorted by a property of your choosing. You can order by date, author, flair, etc. With the `--offline` parameter, you can make all the links point to the files you generated with `offline_reading`. 71 | `python timesearch.py index -r subredditname ` 72 | `python timesearch.py index -u username ` 73 | 74 | - **breakdown**: Produces a JSON file indicating which users make the most posts in a subreddit, or which subreddits a user posts in. 75 | `python timesearch.py breakdown -r subredditname` 76 | `python timesearch.py breakdown -u username` 77 | 78 | - **merge_db**: Copy all new data from one timesearch database into another. Useful for syncing or merging two scans of the same subreddit. 79 | `python timesearch.py merge_db --from filepath/database1.db --to filepath/database2.db` 80 | 81 | ### To use it 82 | 83 | When you download this project, the main file that you will execute is `timesearch.py` here in the root directory. It will load the appropriate module to run your command from the modules folder. 84 | 85 | You can view a summarized version of all the help text by running `timesearch.py`, and you can view a specific help text by running a command with no arguments, like `timesearch.py livestream`, etc. 86 | 87 | I recommend [sqlitebrowser](https://github.com/sqlitebrowser/sqlitebrowser/releases) if you want to inspect the database yourself. 88 | 89 | ## Changelog 90 | - 2020 01 27 91 | - When I first created Timesearch, it was simply a collection of all the random scripts I had written to archive various things. And they tended to have wacky names like `commentaugment` and `redmash`. Well, since the timesearch toolkit is meant to be a singular cohesive package now I decided to finally rename everything. I believe I have aliased everything properly so the old names still work for backwards compat, except for the fact the modules folder is now called `timesearch_modules` which may break your import statements if you ever imported that on your own. 92 | 93 | - 2018 04 09 94 | - Integrated with Pushshift to restore timesearch functionality, speed up commentaugment, and get user comments. 95 | 96 | - 2017 11 13 97 | - Gave timesearch its own Github repository so that (1) it will be easier for people to download it and (2) it has a cleaner, more independent URL. [voussoir/timesearch](https://github.com/voussoir/timesearch) 98 | 99 | - 2017 11 05 100 | - Added a try-except inside livestream helper to prevent generator from terminating. 101 | 102 | - 2017 11 04 103 | - For timesearch, I switched from using my custom cloudsearch iterator to the one that comes with PRAW4+. 104 | 105 | - 2017 10 12 106 | - Added the `mergedb` utility for combining databases. 107 | 108 | - 2017 06 02 109 | - You can use `commentaugment -s abcdef` to get a particular thread even if you haven't scraped anything else from that subreddit. Previously `-s` only worked if the database already existed and you specified it via `-r`. Now it is inferred from the submission itself. 110 | 111 | - 2017 04 28 112 | - Complete restructure into package, started using PRAW4. 113 | 114 | - 2016 08 10 115 | - Started merging redmash and wrote its argparser 116 | 117 | - 2016 07 03 118 | - Improved docstring clarity. 119 | 120 | - 2016 07 02 121 | - Added `livestream` argparse 122 | 123 | - 2016 06 07 124 | - Offline_reading has been merged with the main timesearch file 125 | - `get_all_posts` renamed to `timesearch` 126 | - Timesearch parameter `usermode` renamed to `username`; `maxupper` renamed to `upper`. 127 | - Everything now accessible via commandline arguments. Read the docstring at the top of the file. 128 | 129 | - 2016 06 05 130 | - NEW DATABASE SCHEME. Submissions and comments now live in different tables like they should have all along. Submission table has two new columns for a little bit of commentaugment metadata. This allows commentaugment to only scan threads that are new. 131 | - You can use the `migrate_20160605.py` script to convert old databases into new ones. 132 | 133 | - 2015 11 11 134 | - created `offline_reading.py` which converts a timesearch database into a comment tree that can be rendered into HTML 135 | 136 | - 2015 09 07 137 | - fixed bug which allowed `livestream` to crash because `bot.refresh()` was outside of the try-catch. 138 | 139 | - 2015 08 19 140 | - fixed bug in which updatescores stopped iterating early if you had more than 100 comments in a row in the db 141 | - commentaugment has been completely merged into the timesearch.py file. you can use commentaugment_prompt() to input the parameters, or use the commentaugment() function directly. 142 | 143 | 144 | ____ 145 | 146 | 147 | I want to live in a future where everyone uses UTC and agrees on daylight savings. 148 | 149 |

150 | Timesearch 151 |

152 | 153 | ## Mirrors 154 | 155 | https://git.voussoir.net/voussoir/timesearch 156 | 157 | https://github.com/voussoir/timesearch 158 | 159 | https://gitlab.com/voussoir/timesearch 160 | 161 | https://codeberg.org/voussoir/timesearch 162 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | markdown 2 | praw 3 | voussoirkit 4 | -------------------------------------------------------------------------------- /timesearch.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This is the main launch file for Timesearch. 3 | 4 | When you run `python timesearch.py get_submissions -r subredditname` or any 5 | other command, your arguments will go to the timesearch_modules file as 6 | appropriate for your command. 7 | ''' 8 | import argparse 9 | import sys 10 | 11 | from voussoirkit import betterhelp 12 | from voussoirkit import vlogging 13 | 14 | from timesearch_modules import exceptions 15 | 16 | # NOTE: Originally I wanted the docstring for each module to be within their 17 | # file. However, this means that composing the global helptext would require 18 | # importing those modules, which will subsequently import PRAW and a whole lot 19 | # of other things. This made TS very slow to load which is okay when you're 20 | # actually using it but really terrible when you're just viewing the help text. 21 | 22 | def breakdown_gateway(args): 23 | from timesearch_modules import breakdown 24 | breakdown.breakdown_argparse(args) 25 | 26 | def get_comments_gateway(args): 27 | from timesearch_modules import get_comments 28 | get_comments.get_comments_argparse(args) 29 | 30 | def get_styles_gateway(args): 31 | from timesearch_modules import get_styles 32 | get_styles.get_styles_argparse(args) 33 | 34 | def get_wiki_gateway(args): 35 | from timesearch_modules import get_wiki 36 | get_wiki.get_wiki_argparse(args) 37 | 38 | def ingest_jsonfile_gateway(args): 39 | from timesearch_modules import ingest_jsonfile 40 | ingest_jsonfile.ingest_jsonfile_argparse(args) 41 | 42 | def livestream_gateway(args): 43 | from timesearch_modules import livestream 44 | livestream.livestream_argparse(args) 45 | 46 | def merge_db_gateway(args): 47 | from timesearch_modules import merge_db 48 | merge_db.merge_db_argparse(args) 49 | 50 | def offline_reading_gateway(args): 51 | from timesearch_modules import offline_reading 52 | offline_reading.offline_reading_argparse(args) 53 | 54 | def index_gateway(args): 55 | from timesearch_modules import index 56 | index.index_argparse(args) 57 | 58 | def get_submissions_gateway(args): 59 | from timesearch_modules import get_submissions 60 | get_submissions.get_submissions_argparse(args) 61 | 62 | @vlogging.main_decorator 63 | def main(argv): 64 | parser = argparse.ArgumentParser( 65 | description=''' 66 | The subreddit archiver 67 | 68 | The basics: 69 | 1. Collect a subreddit's submissions 70 | timesearch get_submissions -r subredditname 71 | 72 | 2. Collect the comments for those submissions 73 | timesearch get_comments -r subredditname 74 | 75 | 3. Stay up to date 76 | timesearch livestream -r subredditname 77 | ''', 78 | ) 79 | subparsers = parser.add_subparsers() 80 | 81 | # BREAKDOWN 82 | p_breakdown = subparsers.add_parser( 83 | 'breakdown', 84 | description=''' 85 | Generate the comment / submission counts for users in a subreddit, or 86 | the subreddits that a user posts to. 87 | 88 | Automatically dumps into a _breakdown.json file 89 | in the same directory as the database. 90 | ''', 91 | ) 92 | p_breakdown.add_argument( 93 | '--sort', 94 | dest='sort', 95 | type=str, 96 | default=None, 97 | help=''' 98 | Sort the output by one property. 99 | Should be one of "name", "submissions", "comments", "total_posts". 100 | ''', 101 | ) 102 | p_breakdown.add_argument( 103 | '-r', 104 | '--subreddit', 105 | dest='subreddit', 106 | default=None, 107 | help=''' 108 | The subreddit database to break down. 109 | ''', 110 | ) 111 | p_breakdown.add_argument( 112 | '-u', 113 | '--user', 114 | dest='username', 115 | default=None, 116 | help=''' 117 | The username database to break down. 118 | ''', 119 | ) 120 | p_breakdown.set_defaults(func=breakdown_gateway) 121 | 122 | # GET_COMMENTS 123 | p_get_comments = subparsers.add_parser( 124 | 'get_comments', 125 | aliases=['get-comments', 'commentaugment'], 126 | description=''' 127 | Collect comments on a subreddit or comments made by a user. 128 | ''', 129 | ) 130 | p_get_comments.add_argument( 131 | '-r', 132 | '--subreddit', 133 | dest='subreddit', 134 | default=None, 135 | ) 136 | p_get_comments.add_argument( 137 | '-s', 138 | '--specific', 139 | dest='specific_submission', 140 | default=None, 141 | help=''' 142 | Given a submission ID like t3_xxxxxx, scan only that submission. 143 | ''', 144 | ) 145 | p_get_comments.add_argument( 146 | '-u', 147 | '--user', 148 | dest='username', 149 | default=None, 150 | ) 151 | p_get_comments.add_argument( 152 | '--dont_supplement', 153 | '--dont-supplement', 154 | dest='do_supplement', 155 | action='store_false', 156 | help=''' 157 | If provided, trust the pushshift data and do not fetch live copies 158 | from reddit. 159 | ''', 160 | ) 161 | p_get_comments.add_argument( 162 | '--lower', 163 | dest='lower', 164 | default='update', 165 | help=''' 166 | If a number - the unix timestamp to start at. 167 | If "update" - continue from latest comment in db. 168 | WARNING: If at some point you collected comments for a particular 169 | submission which was ahead of the rest of your comments, using "update" 170 | will start from that later submission, and you will miss the stuff in 171 | between that specific post and the past. 172 | ''', 173 | ) 174 | p_get_comments.add_argument( 175 | '--upper', 176 | dest='upper', 177 | default=None, 178 | help=''' 179 | If a number - the unix timestamp to stop at. 180 | If not provided - stop at current time. 181 | ''', 182 | ) 183 | p_get_comments.set_defaults(func=get_comments_gateway) 184 | 185 | # GET_STYLES 186 | p_get_styles = subparsers.add_parser( 187 | 'get_styles', 188 | aliases=['get-styles', 'getstyles'], 189 | help=''' 190 | Collect the stylesheet, and css images. 191 | ''', 192 | ) 193 | p_get_styles.add_argument( 194 | '-r', 195 | '--subreddit', 196 | dest='subreddit', 197 | ) 198 | p_get_styles.set_defaults(func=get_styles_gateway) 199 | 200 | # GET_WIKI 201 | p_get_wiki = subparsers.add_parser( 202 | 'get_wiki', 203 | aliases=['get-wiki', 'getwiki'], 204 | description=''' 205 | Collect all available wiki pages. 206 | ''', 207 | ) 208 | p_get_wiki.add_argument( 209 | '-r', 210 | '--subreddit', 211 | dest='subreddit', 212 | ) 213 | p_get_wiki.set_defaults(func=get_wiki_gateway) 214 | 215 | # INGEST_JSONFILE 216 | p_ingest_jsonfile = subparsers.add_parser( 217 | 'ingest_jsonfile', 218 | description=''' 219 | This module was added after reddit's June 2023 API changes which 220 | resulted in pushshift losing API access, and pushshift's own API was 221 | disabled. The community has made archive files available for download. 222 | These archive files contain 1 object (a submission or a comment) per 223 | line in a JSON format. 224 | 225 | You can ingest these into timesearch so that you can continue to use 226 | timesearch's offline_reading or index features. 227 | ''', 228 | ) 229 | p_ingest_jsonfile.add_argument( 230 | 'json_file', 231 | help=''' 232 | Path to a file containing 1 json object per line. Each object must be 233 | either a submission or a comment. 234 | ''', 235 | ) 236 | p_ingest_jsonfile.add_argument( 237 | '-r', 238 | '--subreddit', 239 | dest='subreddit', 240 | default=None, 241 | ) 242 | p_ingest_jsonfile.add_argument( 243 | '-u', 244 | '--user', 245 | dest='username', 246 | default=None, 247 | ) 248 | p_ingest_jsonfile.set_defaults(func=ingest_jsonfile_gateway) 249 | 250 | # LIVESTREAM 251 | p_livestream = subparsers.add_parser( 252 | 'livestream', 253 | description=''' 254 | Continously collect submissions and/or comments. 255 | ''', 256 | ) 257 | p_livestream.add_argument( 258 | '--once', 259 | dest='once', 260 | action='store_true', 261 | help=''' 262 | If provided, only do a single loop. Otherwise go forever. 263 | ''', 264 | ) 265 | p_livestream.add_argument( 266 | '-c', 267 | '--comments', 268 | dest='comments', 269 | action='store_true', 270 | help=''' 271 | If provided, do collect comments. Otherwise don't. 272 | 273 | If submissions and comments are BOTH left unspecified, then they will 274 | BOTH be collected. 275 | ''', 276 | ) 277 | p_livestream.add_argument( 278 | '--limit', 279 | dest='limit', 280 | type=int, 281 | default=None, 282 | help=''' 283 | Number of items to fetch per request. 284 | ''', 285 | ) 286 | p_livestream.add_argument( 287 | '-r', 288 | '--subreddit', 289 | dest='subreddit', 290 | default=None, 291 | help=''' 292 | The subreddit to collect from. 293 | ''', 294 | ) 295 | p_livestream.add_argument( 296 | '-s', 297 | '--submissions', 298 | dest='submissions', 299 | action='store_true', 300 | help=''' 301 | If provided, do collect submissions. Otherwise don't. 302 | 303 | If submissions and comments are BOTH left unspecified, then they will 304 | BOTH be collected. 305 | ''', 306 | ) 307 | p_livestream.add_argument( 308 | '-u', 309 | '--user', 310 | dest='username', 311 | default=None, 312 | help=''' 313 | The redditor to collect from. 314 | ''', 315 | ) 316 | p_livestream.add_argument( 317 | '-w', 318 | '--wait', 319 | dest='sleepy', 320 | default=30, 321 | help=''' 322 | The number of seconds to wait between cycles. 323 | ''', 324 | ) 325 | p_livestream.set_defaults(func=livestream_gateway) 326 | 327 | # MERGEDB' 328 | p_merge_db = subparsers.add_parser( 329 | 'merge_db', 330 | aliases=['merge-db', 'mergedb'], 331 | description=''' 332 | Copy all new posts from one timesearch database into another. 333 | ''', 334 | ) 335 | p_merge_db.examples = [ 336 | '--from redditdev1.db --to redditdev2.db', 337 | ] 338 | p_merge_db.add_argument( 339 | '--from', 340 | dest='from_db_path', 341 | required=True, 342 | help=''' 343 | The database file containing the posts you wish to copy. 344 | ''', 345 | ) 346 | p_merge_db.add_argument( 347 | '--to', 348 | dest='to_db_path', 349 | required=True, 350 | help=''' 351 | The database file to which you will copy the posts. 352 | The database is modified in-place. 353 | Existing posts will be ignored and not updated. 354 | ''', 355 | ) 356 | p_merge_db.set_defaults(func=merge_db_gateway) 357 | 358 | # OFFLINE_READING 359 | p_offline_reading = subparsers.add_parser( 360 | 'offline_reading', 361 | aliases=['offline-reading'], 362 | description=''' 363 | Render submissions and comment threads to HTML via Markdown. 364 | ''', 365 | ) 366 | p_offline_reading.add_argument( 367 | '-r', 368 | '--subreddit', 369 | dest='subreddit', 370 | default=None, 371 | ) 372 | p_offline_reading.add_argument( 373 | '-s', 374 | '--specific', 375 | dest='specific_submission', 376 | default=None, 377 | type=str, 378 | help=''' 379 | Given a submission ID like t3_xxxxxx, render only that submission. 380 | Otherwise render every submission in the database. 381 | ''', 382 | ) 383 | p_offline_reading.add_argument( 384 | '-u', 385 | '--user', 386 | dest='username', 387 | default=None, 388 | ) 389 | p_offline_reading.set_defaults(func=offline_reading_gateway) 390 | 391 | # INDEX 392 | p_index = subparsers.add_parser( 393 | 'index', 394 | aliases=['redmash'], 395 | description=''' 396 | Dump submission listings to a plaintext or HTML file. 397 | ''', 398 | ) 399 | p_index.examples = [ 400 | { 401 | 'args': '-r botwatch --date', 402 | 'comment': 'Does only the date file.' 403 | }, 404 | { 405 | 'args': '-r botwatch --score --title', 406 | 'comment': 'Does both the score and title files.' 407 | }, 408 | { 409 | 'args': '-r botwatch --score --score_threshold 50', 410 | 'comment': 'Only shows submissions with >= 50 points.' 411 | }, 412 | { 413 | 'args': '-r botwatch --all', 414 | 'comment': 'Performs all of the different mashes.' 415 | }, 416 | ] 417 | p_index.add_argument( 418 | '-r', 419 | '--subreddit', 420 | dest='subreddit', 421 | default=None, 422 | help=''' 423 | The subreddit database to dump. 424 | ''', 425 | ) 426 | p_index.add_argument( 427 | '-u', 428 | '--user', 429 | dest='username', 430 | default=None, 431 | help=''' 432 | The username database to dump. 433 | ''', 434 | ) 435 | p_index.add_argument( 436 | '--all', 437 | dest='do_all', 438 | action='store_true', 439 | help=''' 440 | Perform all of the indexes listed below. 441 | ''', 442 | ) 443 | p_index.add_argument( 444 | '--author', 445 | dest='do_author', 446 | action='store_true', 447 | help=''' 448 | For subreddit databases only. 449 | Perform an index sorted by author. 450 | ''', 451 | ) 452 | p_index.add_argument( 453 | '--date', 454 | dest='do_date', 455 | action='store_true', 456 | help=''' 457 | Perform an index sorted by date. 458 | ''', 459 | ) 460 | p_index.add_argument( 461 | '--flair', 462 | dest='do_flair', 463 | action='store_true', 464 | help=''' 465 | Perform an index sorted by flair. 466 | ''', 467 | ) 468 | p_index.add_argument( 469 | '--html', 470 | dest='html', 471 | action='store_true', 472 | help=''' 473 | Write HTML files instead of plain text. 474 | ''', 475 | ) 476 | p_index.add_argument( 477 | '--score', 478 | dest='do_score', 479 | action='store_true', 480 | help=''' 481 | Perform an index sorted by score. 482 | ''', 483 | ) 484 | p_index.add_argument( 485 | '--sub', 486 | dest='do_subreddit', 487 | action='store_true', 488 | help=''' 489 | For username databases only. 490 | Perform an index sorted by subreddit. 491 | ''', 492 | ) 493 | p_index.add_argument( 494 | '--title', 495 | dest='do_title', 496 | action='store_true', 497 | help=''' 498 | Perform an index sorted by title. 499 | ''', 500 | ) 501 | p_index.add_argument( 502 | '--offline', 503 | dest='offline', 504 | action='store_true', 505 | help=''' 506 | The links in the index will point to the files generated by 507 | offline_reading. That is, `../offline_reading/fullname.html` instead 508 | of `http://redd.it/id`. This will NOT trigger offline_reading to 509 | generate the files now, so you must run that tool separately. 510 | ''', 511 | ) 512 | p_index.add_argument( 513 | '--score_threshold', 514 | '--score-threshold', 515 | dest='score_threshold', 516 | type=int, 517 | default=0, 518 | help=''' 519 | Only index posts with at least this many points. 520 | Applies to ALL indexes! 521 | ''', 522 | ) 523 | p_index.set_defaults(func=index_gateway) 524 | 525 | # GET_SUBMISSIONS 526 | p_get_submissions = subparsers.add_parser( 527 | 'get_submissions', 528 | aliases=['get-submissions', 'timesearch'], 529 | description=''' 530 | Collect submissions from the subreddit across all of history, or 531 | Collect submissions by a user (as many as possible). 532 | ''', 533 | ) 534 | p_get_submissions.add_argument( 535 | '--lower', 536 | dest='lower', 537 | default='update', 538 | help=''' 539 | If a number - the unix timestamp to start at. 540 | If "update" - continue from latest submission in db. 541 | ''', 542 | ) 543 | p_get_submissions.add_argument( 544 | '-r', 545 | '--subreddit', 546 | dest='subreddit', 547 | type=str, 548 | default=None, 549 | help=''' 550 | The subreddit to scan. Mutually exclusive with username. 551 | ''', 552 | ) 553 | p_get_submissions.add_argument( 554 | '-u', 555 | '--user', 556 | dest='username', 557 | type=str, 558 | default=None, 559 | help=''' 560 | The user to scan. Mutually exclusive with subreddit. 561 | ''', 562 | ) 563 | p_get_submissions.add_argument( 564 | '--upper', 565 | dest='upper', 566 | default=None, 567 | help=''' 568 | If a number - the unix timestamp to stop at. 569 | If not provided - stop at current time. 570 | ''', 571 | ) 572 | p_get_submissions.add_argument( 573 | '--dont_supplement', 574 | '--dont-supplement', 575 | dest='do_supplement', 576 | action='store_false', 577 | help=''' 578 | If provided, trust the pushshift data and do not fetch live copies 579 | from reddit. 580 | ''', 581 | ) 582 | p_get_submissions.set_defaults(func=get_submissions_gateway) 583 | 584 | try: 585 | return betterhelp.go(parser, argv) 586 | except exceptions.DatabaseNotFound as exc: 587 | message = str(exc) 588 | message += '\nHave you used any of the other utilities to collect data?' 589 | print(message) 590 | return 1 591 | 592 | if __name__ == '__main__': 593 | raise SystemExit(main(sys.argv[1:])) 594 | -------------------------------------------------------------------------------- /timesearch_logo.svg: -------------------------------------------------------------------------------- 1 | 2 | 13 | 15 | 17 | 18 | 20 | image/svg+xml 21 | 23 | 24 | 25 | 26 | 27 | 30 | 34 | 41 | 48 | 52 | 56 | 63 | 67 | 71 | 75 | 76 | 77 | -------------------------------------------------------------------------------- /timesearch_modules/breakdown.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | 4 | from . import common 5 | from . import tsdb 6 | 7 | 8 | def breakdown_database(subreddit=None, username=None): 9 | ''' 10 | Given a database, return a json dict breaking down the submission / comment count for 11 | users (if a subreddit database) or subreddits (if a user database). 12 | ''' 13 | if (subreddit is None) == (username is None): 14 | raise Exception('Enter subreddit or username but not both') 15 | 16 | breakdown_results = {} 17 | def _ingest(names, subkey): 18 | for name in names: 19 | breakdown_results.setdefault(name, {}) 20 | breakdown_results[name].setdefault(subkey, 0) 21 | breakdown_results[name][subkey] += 1 22 | 23 | if subreddit: 24 | database = tsdb.TSDB.for_subreddit(subreddit, do_create=False) 25 | else: 26 | database = tsdb.TSDB.for_user(username, do_create=False) 27 | cur = database.sql.cursor() 28 | 29 | for table in ['submissions', 'comments']: 30 | if subreddit: 31 | cur.execute('SELECT author FROM %s' % table) 32 | elif username: 33 | cur.execute('SELECT subreddit FROM %s' % table) 34 | 35 | names = (row[0] for row in common.fetchgenerator(cur)) 36 | _ingest(names, table) 37 | 38 | for name in breakdown_results: 39 | breakdown_results[name].setdefault('submissions', 0) 40 | breakdown_results[name].setdefault('comments', 0) 41 | 42 | return breakdown_results 43 | 44 | def breakdown_argparse(args): 45 | if args.subreddit: 46 | database = tsdb.TSDB.for_subreddit(args.subreddit, do_create=False) 47 | else: 48 | database = tsdb.TSDB.for_user(args.username, do_create=False) 49 | 50 | breakdown_results = breakdown_database( 51 | subreddit=args.subreddit, 52 | username=args.username, 53 | ) 54 | 55 | def sort_name(name): 56 | return name.lower() 57 | def sort_submissions(name): 58 | invert_score = -1 * breakdown_results[name]['submissions'] 59 | return (invert_score, name.lower()) 60 | def sort_comments(name): 61 | invert_score = -1 * breakdown_results[name]['comments'] 62 | return (invert_score, name.lower()) 63 | def sort_total_posts(name): 64 | invert_score = breakdown_results[name]['submissions'] + breakdown_results[name]['comments'] 65 | invert_score = -1 * invert_score 66 | return (invert_score, name.lower()) 67 | breakdown_sorters = { 68 | 'name': sort_name, 69 | 'submissions': sort_submissions, 70 | 'comments': sort_comments, 71 | 'total_posts': sort_total_posts, 72 | } 73 | 74 | breakdown_names = list(breakdown_results.keys()) 75 | if args.sort is not None: 76 | try: 77 | sorter = breakdown_sorters[args.sort.lower()] 78 | except KeyError: 79 | message = '{sorter} is not a sorter. Choose from {options}' 80 | message = message.format(sorter=args.sort, options=list(breakdown_sorters.keys())) 81 | raise KeyError(message) 82 | breakdown_names.sort(key=sorter) 83 | dump = ' "{name}": {{"submissions": {submissions}, "comments": {comments}}}' 84 | dump = [dump.format(name=name, **breakdown_results[name]) for name in breakdown_names] 85 | dump = ',\n'.join(dump) 86 | dump = '{\n' + dump + '\n}\n' 87 | else: 88 | dump = json.dumps(breakdown_results) 89 | 90 | if args.sort is None: 91 | breakdown_basename = '%s_breakdown.json' 92 | else: 93 | breakdown_basename = '%%s_breakdown_%s.json' % args.sort 94 | 95 | breakdown_basename = breakdown_basename % database.filepath.replace_extension('').basename 96 | breakdown_filepath = database.breakdown_dir.with_child(breakdown_basename) 97 | breakdown_filepath.parent.makedirs(exist_ok=True) 98 | breakdown_file = breakdown_filepath.open('w') 99 | with breakdown_file: 100 | breakdown_file.write(dump) 101 | print('Wrote', breakdown_filepath.relative_path) 102 | 103 | return breakdown_results 104 | -------------------------------------------------------------------------------- /timesearch_modules/common.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import logging 3 | import os 4 | import time 5 | import traceback 6 | 7 | from voussoirkit import vlogging 8 | 9 | VERSION = '2020.09.06.0' 10 | 11 | try: 12 | import praw 13 | except ImportError: 14 | praw = None 15 | if praw is None or praw.__version__.startswith('3.'): 16 | import praw4 17 | praw = praw4 18 | 19 | try: 20 | import bot 21 | except ImportError: 22 | bot = None 23 | if bot is None or bot.praw != praw: 24 | try: 25 | import bot4 26 | bot = bot4 27 | except ImportError: 28 | message = '\n'.join([ 29 | 'Could not find your PRAW4 bot file as either `bot.py` or `bot4.py`.', 30 | 'Please see the README.md file for instructions on how to prepare it.' 31 | ]) 32 | raise ImportError(message) 33 | 34 | 35 | log = vlogging.get_logger(__name__) 36 | 37 | r = bot.anonymous() 38 | 39 | def assert_file_exists(filepath): 40 | if not os.path.exists(filepath): 41 | raise FileNotFoundError(filepath) 42 | 43 | def b36(i): 44 | if isinstance(i, int): 45 | return base36encode(i) 46 | return base36decode(i) 47 | 48 | def base36decode(number): 49 | return int(number, 36) 50 | 51 | def base36encode(number, alphabet='0123456789abcdefghijklmnopqrstuvwxyz'): 52 | """Converts an integer to a base36 string.""" 53 | if not isinstance(number, (int)): 54 | raise TypeError('number must be an integer') 55 | base36 = '' 56 | sign = '' 57 | if number < 0: 58 | sign = '-' 59 | number = -number 60 | if 0 <= number < len(alphabet): 61 | return sign + alphabet[number] 62 | while number != 0: 63 | number, i = divmod(number, len(alphabet)) 64 | base36 = alphabet[i] + base36 65 | return sign + base36 66 | 67 | def fetchgenerator(cursor): 68 | while True: 69 | item = cursor.fetchone() 70 | if item is None: 71 | break 72 | yield item 73 | 74 | def generator_chunker(generator, chunk_size): 75 | ''' 76 | Given an item generator, yield lists of length chunk_size, except maybe 77 | the last one. 78 | ''' 79 | chunk = [] 80 | for item in generator: 81 | chunk.append(item) 82 | if len(chunk) == chunk_size: 83 | yield chunk 84 | chunk = [] 85 | if len(chunk) != 0: 86 | yield chunk 87 | 88 | def get_now(stamp=True): 89 | now = datetime.datetime.now(datetime.timezone.utc) 90 | if stamp: 91 | return int(now.timestamp()) 92 | return now 93 | 94 | def human(timestamp): 95 | x = datetime.datetime.utcfromtimestamp(timestamp) 96 | x = datetime.datetime.strftime(x, "%b %d %Y %H:%M:%S") 97 | return x 98 | 99 | def int_none(x): 100 | if x is None: 101 | return None 102 | return int(x) 103 | 104 | def is_xor(*args): 105 | ''' 106 | Return True if and only if one arg is truthy. 107 | ''' 108 | return [bool(a) for a in args].count(True) == 1 109 | 110 | def login(): 111 | global r 112 | log.debug('Logging in to reddit.') 113 | r = bot.login(r) 114 | 115 | def nofailrequest(function): 116 | ''' 117 | Creates a function that will retry until it succeeds. 118 | This function accepts 1 parameter, a function, and returns a modified 119 | version of that function that will try-catch, sleep, and loop until it 120 | finally returns. 121 | ''' 122 | def a(*args, **kwargs): 123 | while True: 124 | try: 125 | result = function(*args, **kwargs) 126 | return result 127 | except KeyboardInterrupt: 128 | raise 129 | except Exception: 130 | traceback.print_exc() 131 | print('Retrying in 2...') 132 | time.sleep(2) 133 | return a 134 | 135 | def split_any(text, delimiters): 136 | delimiters = list(delimiters) 137 | (splitter, replacers) = (delimiters[0], delimiters[1:]) 138 | for replacer in replacers: 139 | text = text.replace(replacer, splitter) 140 | return text.split(splitter) 141 | 142 | def subreddit_for_submission(submission_id): 143 | submission_id = t3_prefix(submission_id)[3:] 144 | submission = r.submission(submission_id) 145 | return submission.subreddit 146 | 147 | def t3_prefix(submission_id): 148 | if not submission_id.startswith('t3_'): 149 | submission_id = 't3_' + submission_id 150 | return submission_id 151 | -------------------------------------------------------------------------------- /timesearch_modules/exceptions.py: -------------------------------------------------------------------------------- 1 | class TimesearchException(Exception): 2 | ''' 3 | Base type for all of the Timesearch exceptions. 4 | Subtypes should have a class attribute `error_message`. The error message 5 | may contain {format} strings which will be formatted using the 6 | Exception's constructor arguments. 7 | ''' 8 | error_message = '' 9 | def __init__(self, *args, **kwargs): 10 | self.given_args = args 11 | self.given_kwargs = kwargs 12 | self.error_message = self.error_message.format(*args, **kwargs) 13 | self.args = (self.error_message, args, kwargs) 14 | 15 | def __str__(self): 16 | return self.error_message 17 | 18 | OUTOFDATE = ''' 19 | Database is out of date. {current} should be {new}. 20 | Please run utilities\\database_upgrader.py "{filepath.absolute_path}" 21 | '''.strip() 22 | class DatabaseOutOfDate(TimesearchException): 23 | ''' 24 | Raised by TSDB __init__ if the user's database is behind. 25 | ''' 26 | error_message = OUTOFDATE 27 | 28 | class DatabaseNotFound(TimesearchException, FileNotFoundError): 29 | error_message = 'Database file not found: "{}"' 30 | 31 | class NotExclusive(TimesearchException): 32 | ''' 33 | For when two or more mutually exclusive actions have been requested. 34 | ''' 35 | error_message = 'One and only one of {} must be passed.' 36 | -------------------------------------------------------------------------------- /timesearch_modules/get_comments.py: -------------------------------------------------------------------------------- 1 | import traceback 2 | 3 | from . import common 4 | from . import exceptions 5 | from . import pushshift 6 | from . import tsdb 7 | 8 | def get_comments( 9 | subreddit=None, 10 | username=None, 11 | specific_submission=None, 12 | do_supplement=True, 13 | lower=None, 14 | upper=None, 15 | ): 16 | if not specific_submission and not common.is_xor(subreddit, username): 17 | raise exceptions.NotExclusive(['subreddit', 'username']) 18 | if username and specific_submission: 19 | raise exceptions.NotExclusive(['username', 'specific_submission']) 20 | 21 | common.login() 22 | 23 | if specific_submission: 24 | (database, subreddit) = tsdb.TSDB.for_submission(specific_submission, do_create=True, fix_name=True) 25 | specific_submission = common.t3_prefix(specific_submission)[3:] 26 | specific_submission = common.r.submission(specific_submission) 27 | database.insert(specific_submission) 28 | 29 | elif subreddit: 30 | (database, subreddit) = tsdb.TSDB.for_subreddit(subreddit, do_create=True, fix_name=True) 31 | 32 | else: 33 | (database, username) = tsdb.TSDB.for_user(username, do_create=True, fix_name=True) 34 | 35 | cur = database.sql.cursor() 36 | 37 | if lower is None: 38 | lower = 0 39 | if lower == 'update': 40 | query_latest = 'SELECT created FROM comments ORDER BY created DESC LIMIT 1' 41 | if subreddit: 42 | # Instead of blindly taking the highest timestamp currently in the db, 43 | # we must consider the case that the user has previously done a 44 | # specific_submission scan and now wants to do a general scan, which 45 | # would trick the latest timestamp into missing anything before that 46 | # specific submission. 47 | query = ''' 48 | SELECT created FROM comments WHERE NOT EXISTS ( 49 | SELECT 1 FROM submissions 50 | WHERE submissions.idstr == comments.submission 51 | AND submissions.augmented_at IS NOT NULL 52 | ) 53 | ORDER BY created DESC LIMIT 1 54 | ''' 55 | unaugmented = cur.execute(query).fetchone() 56 | if unaugmented: 57 | lower = unaugmented[0] - 1 58 | else: 59 | latest = cur.execute(query_latest).fetchone() 60 | if latest: 61 | lower = latest[0] - 1 62 | if username: 63 | latest = cur.execute(query_latest).fetchone() 64 | if latest: 65 | lower = latest[0] - 1 66 | if lower == 'update': 67 | lower = 0 68 | 69 | if specific_submission: 70 | comments = pushshift.get_comments_from_submission(specific_submission) 71 | elif subreddit: 72 | comments = pushshift.get_comments_from_subreddit(subreddit, lower=lower, upper=upper) 73 | elif username: 74 | comments = pushshift.get_comments_from_user(username, lower=lower, upper=upper) 75 | 76 | if do_supplement: 77 | comments = pushshift.supplement_reddit_data(comments, chunk_size=100) 78 | comments = common.generator_chunker(comments, 500) 79 | 80 | form = '{lower} ({lower_unix}) - {upper} ({upper_unix}) +{gain}' 81 | for chunk in comments: 82 | step = database.insert(chunk) 83 | message = form.format( 84 | lower=common.human(chunk[0].created_utc), 85 | upper=common.human(chunk[-1].created_utc), 86 | lower_unix=int(chunk[0].created_utc), 87 | upper_unix=int(chunk[-1].created_utc), 88 | gain=step['new_comments'], 89 | ) 90 | print(message) 91 | 92 | if specific_submission: 93 | query = ''' 94 | UPDATE submissions 95 | set augmented_at = ? 96 | WHERE idstr == ? 97 | ''' 98 | bindings = [common.get_now(), specific_submission.fullname] 99 | cur.execute(query, bindings) 100 | database.sql.commit() 101 | 102 | def get_comments_argparse(args): 103 | return get_comments( 104 | subreddit=args.subreddit, 105 | username=args.username, 106 | #limit=common.int_none(args.limit), 107 | #threshold=common.int_none(args.threshold), 108 | #num_thresh=common.int_none(args.num_thresh), 109 | specific_submission=args.specific_submission, 110 | do_supplement=args.do_supplement, 111 | lower=args.lower, 112 | upper=args.upper, 113 | ) 114 | -------------------------------------------------------------------------------- /timesearch_modules/get_styles.py: -------------------------------------------------------------------------------- 1 | import os 2 | import requests 3 | 4 | from . import common 5 | from . import tsdb 6 | 7 | session = requests.Session() 8 | 9 | def get_styles(subreddit): 10 | (database, subreddit) = tsdb.TSDB.for_subreddit(subreddit, fix_name=True) 11 | 12 | print('Getting styles for /r/%s' % subreddit) 13 | subreddit = common.r.subreddit(subreddit) 14 | styles = subreddit.stylesheet() 15 | 16 | database.styles_dir.makedirs(exist_ok=True) 17 | 18 | stylesheet_filepath = database.styles_dir.with_child('stylesheet.css') 19 | print('Downloading %s' % stylesheet_filepath.relative_path) 20 | with stylesheet_filepath.open('w', encoding='utf-8') as stylesheet: 21 | stylesheet.write(styles.stylesheet) 22 | 23 | for image in styles.images: 24 | image_basename = image['name'] + '.' + image['url'].split('.')[-1] 25 | image_filepath = database.styles_dir.with_child(image_basename) 26 | print('Downloading %s' % image_filepath.relative_path) 27 | with image_filepath.open('wb') as image_file: 28 | response = session.get(image['url']) 29 | image_file.write(response.content) 30 | 31 | def get_styles_argparse(args): 32 | return get_styles(args.subreddit) 33 | -------------------------------------------------------------------------------- /timesearch_modules/get_submissions.py: -------------------------------------------------------------------------------- 1 | import time 2 | import traceback 3 | 4 | from . import common 5 | from . import exceptions 6 | from . import pushshift 7 | from . import tsdb 8 | 9 | def _normalize_subreddit(subreddit): 10 | if subreddit is None: 11 | pass 12 | elif isinstance(subreddit, str): 13 | subreddit = common.r.subreddit(subreddit) 14 | elif not isinstance(subreddit, common.praw.models.Subreddit): 15 | raise TypeError(type(subreddit)) 16 | return subreddit 17 | 18 | def _normalize_user(user): 19 | if user is None: 20 | pass 21 | elif isinstance(user, str): 22 | user = common.r.redditor(user) 23 | elif not isinstance(user, common.praw.models.Redditor): 24 | raise TypeError(type(user)) 25 | return user 26 | 27 | def get_submissions( 28 | subreddit=None, 29 | username=None, 30 | lower=None, 31 | upper=None, 32 | do_supplement=True, 33 | ): 34 | ''' 35 | Collect submissions across time. 36 | Please see the global DOCSTRING variable. 37 | ''' 38 | if not common.is_xor(subreddit, username): 39 | raise exceptions.NotExclusive(['subreddit', 'username']) 40 | 41 | common.login() 42 | 43 | if subreddit: 44 | (database, subreddit) = tsdb.TSDB.for_subreddit(subreddit, fix_name=True) 45 | elif username: 46 | (database, username) = tsdb.TSDB.for_user(username, fix_name=True) 47 | cur = database.sql.cursor() 48 | 49 | subreddit = _normalize_subreddit(subreddit) 50 | user = _normalize_user(username) 51 | 52 | if lower == 'update': 53 | # Start from the latest submission 54 | cur.execute('SELECT created FROM submissions ORDER BY created DESC LIMIT 1') 55 | fetch = cur.fetchone() 56 | if fetch is not None: 57 | lower = fetch[0] 58 | else: 59 | lower = None 60 | if lower is None: 61 | lower = 0 62 | 63 | if username: 64 | submissions = pushshift.get_submissions_from_user(username, lower=lower, upper=upper) 65 | else: 66 | submissions = pushshift.get_submissions_from_subreddit(subreddit, lower=lower, upper=upper) 67 | 68 | if do_supplement: 69 | submissions = pushshift.supplement_reddit_data(submissions, chunk_size=100) 70 | submissions = common.generator_chunker(submissions, 200) 71 | 72 | form = '{lower} ({lower_unix}) - {upper} ({upper_unix}) +{gain}' 73 | for chunk in submissions: 74 | chunk.sort(key=lambda x: x.created_utc) 75 | step = database.insert(chunk) 76 | message = form.format( 77 | lower=common.human(chunk[0].created_utc), 78 | upper=common.human(chunk[-1].created_utc), 79 | lower_unix=int(chunk[0].created_utc), 80 | upper_unix=int(chunk[-1].created_utc), 81 | gain=step['new_submissions'], 82 | ) 83 | print(message) 84 | 85 | cur.execute('SELECT COUNT(idint) FROM submissions') 86 | itemcount = cur.fetchone()[0] 87 | 88 | print('Ended with %d items in %s' % (itemcount, database.filepath.basename)) 89 | 90 | def get_submissions_argparse(args): 91 | if args.lower == 'update': 92 | lower = 'update' 93 | else: 94 | lower = common.int_none(args.lower) 95 | 96 | return get_submissions( 97 | subreddit=args.subreddit, 98 | username=args.username, 99 | lower=lower, 100 | upper=common.int_none(args.upper), 101 | do_supplement=args.do_supplement, 102 | ) 103 | -------------------------------------------------------------------------------- /timesearch_modules/get_wiki.py: -------------------------------------------------------------------------------- 1 | import os 2 | import markdown 3 | 4 | from . import common 5 | from . import tsdb 6 | 7 | 8 | def get_wiki(subreddit): 9 | (database, subreddit) = tsdb.TSDB.for_subreddit(subreddit, fix_name=True) 10 | 11 | print('Getting wiki pages for /r/%s' % subreddit) 12 | subreddit = common.r.subreddit(subreddit) 13 | 14 | for wikipage in subreddit.wiki: 15 | if wikipage.name == 'config/stylesheet': 16 | continue 17 | 18 | wikipage_path = database.wiki_dir.join(wikipage.name).add_extension('md') 19 | wikipage_path.parent.makedirs(exist_ok=True) 20 | wikipage_path.write('w', wikipage.content_md, encoding='utf-8') 21 | print('Wrote', wikipage_path.relative_path) 22 | 23 | html_path = wikipage_path.replace_extension('html') 24 | escaped = wikipage.content_md.replace('<', '<').replace('>', '&rt;') 25 | html_path.write('w', markdown.markdown(escaped, output_format='html5'), encoding='utf-8') 26 | print('Wrote', html_path.relative_path) 27 | 28 | def get_wiki_argparse(args): 29 | return get_wiki(args.subreddit) 30 | -------------------------------------------------------------------------------- /timesearch_modules/index.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import os 3 | 4 | from . import common 5 | from . import exceptions 6 | from . import tsdb 7 | 8 | 9 | LINE_FORMAT_TXT = ''' 10 | {timestamp}: [{title}]({link}) - /u/{author} (+{score}) 11 | '''.replace('\n', '') 12 | 13 | LINE_FORMAT_HTML = ''' 14 |
{timestamp}: [{flairtext}] {title} - {author} (+{score})
15 | '''.replace('\n', '') 16 | 17 | TIMESTAMP_FORMAT = '%Y %b %d' 18 | # The time format. 19 | # "%Y %b %d" = "2016 August 10" 20 | # See http://strftime.org/ 21 | 22 | HTML_HEADER = ''' 23 | 24 | 25 | 26 | 32 | 33 | 34 | 35 | ''' 36 | 37 | HTML_FOOTER = ''' 38 | 39 | 40 | ''' 41 | 42 | 43 | def index( 44 | subreddit=None, 45 | username=None, 46 | do_all=False, 47 | do_date=False, 48 | do_title=False, 49 | do_score=False, 50 | do_author=False, 51 | do_subreddit=False, 52 | do_flair=False, 53 | html=False, 54 | offline=False, 55 | score_threshold=0, 56 | ): 57 | if not common.is_xor(subreddit, username): 58 | raise exceptions.NotExclusive(['subreddit', 'username']) 59 | 60 | if subreddit: 61 | database = tsdb.TSDB.for_subreddit(subreddit, do_create=False) 62 | else: 63 | database = tsdb.TSDB.for_user(username, do_create=False) 64 | 65 | kwargs = {'html': html, 'offline': offline, 'score_threshold': score_threshold} 66 | wrote = None 67 | 68 | if do_all or do_date: 69 | print('Writing time file') 70 | wrote = index_worker(database, suffix='_date', orderby='created ASC', **kwargs) 71 | 72 | if do_all or do_title: 73 | print('Writing title file') 74 | wrote = index_worker(database, suffix='_title', orderby='title ASC', **kwargs) 75 | 76 | if do_all or do_score: 77 | print('Writing score file') 78 | wrote = index_worker(database, suffix='_score', orderby='score DESC', **kwargs) 79 | 80 | if not username and (do_all or do_author): 81 | print('Writing author file') 82 | wrote = index_worker(database, suffix='_author', orderby='author ASC', **kwargs) 83 | 84 | if username and (do_all or do_subreddit): 85 | print('Writing subreddit file') 86 | wrote = index_worker(database, suffix='_subreddit', orderby='subreddit ASC', **kwargs) 87 | 88 | if do_all or do_flair: 89 | print('Writing flair file') 90 | # Items with flair come before items without. Each group is sorted by time separately. 91 | orderby = 'flair_text IS NULL ASC, created ASC' 92 | wrote = index_worker(database, suffix='_flair', orderby=orderby, **kwargs) 93 | 94 | if not wrote: 95 | raise Exception('No sorts selected! Read the docstring') 96 | print('Done.') 97 | 98 | def index_worker( 99 | database, 100 | suffix, 101 | orderby, 102 | score_threshold=0, 103 | html=False, 104 | offline=False, 105 | ): 106 | cur = database.sql.cursor() 107 | statement = 'SELECT * FROM submissions WHERE score >= {threshold} ORDER BY {order}' 108 | statement = statement.format(threshold=score_threshold, order=orderby) 109 | cur.execute(statement) 110 | 111 | database.index_dir.makedirs(exist_ok=True) 112 | 113 | extension = '.html' if html else '.txt' 114 | mash_basename = database.filepath.replace_extension('').basename 115 | mash_basename += suffix + extension 116 | mash_filepath = database.index_dir.with_child(mash_basename) 117 | 118 | mash_handle = mash_filepath.open('w', encoding='UTF-8') 119 | if html: 120 | mash_handle.write(HTML_HEADER) 121 | line_format = LINE_FORMAT_HTML 122 | else: 123 | line_format = LINE_FORMAT_TXT 124 | 125 | do_timestamp = '{timestamp}' in line_format 126 | 127 | for submission in common.fetchgenerator(cur): 128 | submission = tsdb.DBEntry(submission) 129 | 130 | if do_timestamp: 131 | timestamp = int(submission.created) 132 | timestamp = datetime.datetime.utcfromtimestamp(timestamp) 133 | timestamp = timestamp.strftime(TIMESTAMP_FORMAT) 134 | else: 135 | timestamp = '' 136 | 137 | if offline: 138 | link = f'../offline_reading/{submission.idstr}.html' 139 | else: 140 | link = f'https://redd.it/{submission.idstr[3:]}' 141 | 142 | author = submission.author 143 | if author.lower() == '[deleted]': 144 | author_link = '#' 145 | else: 146 | author_link = 'https://reddit.com/u/%s' % author 147 | 148 | line = line_format.format( 149 | author=author, 150 | authorlink=author_link, 151 | flaircss=submission.flair_css_class or '', 152 | flairtext=submission.flair_text or '', 153 | id=submission.idstr, 154 | numcomments=submission.num_comments, 155 | score=submission.score, 156 | link=link, 157 | subreddit=submission.subreddit, 158 | timestamp=timestamp, 159 | title=submission.title.replace('\n', ' '), 160 | url=submission.url or link, 161 | ) 162 | line += '\n' 163 | mash_handle.write(line) 164 | 165 | if html: 166 | mash_handle.write(HTML_FOOTER) 167 | mash_handle.close() 168 | print('Wrote', mash_filepath.relative_path) 169 | return mash_filepath 170 | 171 | def index_argparse(args): 172 | return index( 173 | subreddit=args.subreddit, 174 | username=args.username, 175 | do_all=args.do_all, 176 | do_date=args.do_date, 177 | do_title=args.do_title, 178 | do_score=args.do_score, 179 | do_author=args.do_author, 180 | do_subreddit=args.do_subreddit, 181 | do_flair=args.do_flair, 182 | html=args.html, 183 | offline=args.offline, 184 | score_threshold=common.int_none(args.score_threshold), 185 | ) 186 | -------------------------------------------------------------------------------- /timesearch_modules/ingest_jsonfile.py: -------------------------------------------------------------------------------- 1 | import json 2 | import time 3 | import traceback 4 | 5 | from voussoirkit import pathclass 6 | 7 | from . import common 8 | from . import exceptions 9 | from . import pushshift 10 | from . import tsdb 11 | 12 | def is_submission(obj): 13 | return ( 14 | obj.get('name', '').startswith('t3_') 15 | or obj.get('over_18') is not None 16 | ) 17 | 18 | def is_comment(obj): 19 | return ( 20 | obj.get('name', '').startswith('t1_') 21 | or obj.get('parent_id', '').startswith('t3_') 22 | or obj.get('link_id', '').startswith('t3_') 23 | ) 24 | 25 | def jsonfile_to_objects(filepath): 26 | filepath = pathclass.Path(filepath) 27 | filepath.assert_is_file() 28 | 29 | with filepath.open('r', encoding='utf-8') as handle: 30 | for line in handle: 31 | line = line.strip() 32 | if not line: 33 | break 34 | obj = json.loads(line) 35 | if is_submission(obj): 36 | yield pushshift.DummySubmission(**obj) 37 | elif is_comment(obj): 38 | yield pushshift.DummyComment(**obj) 39 | else: 40 | raise ValueError(f'Could not recognize object type {obj}.') 41 | 42 | def ingest_jsonfile( 43 | filepath, 44 | subreddit=None, 45 | username=None, 46 | ): 47 | if not common.is_xor(subreddit, username): 48 | raise exceptions.NotExclusive(['subreddit', 'username']) 49 | 50 | if subreddit: 51 | (database, subreddit) = tsdb.TSDB.for_subreddit(subreddit, fix_name=True) 52 | elif username: 53 | (database, username) = tsdb.TSDB.for_user(username, fix_name=True) 54 | cur = database.sql.cursor() 55 | 56 | objects = jsonfile_to_objects(filepath) 57 | database.insert(objects) 58 | 59 | cur.execute('SELECT COUNT(idint) FROM submissions') 60 | submissioncount = cur.fetchone()[0] 61 | cur.execute('SELECT COUNT(idint) FROM comments') 62 | commentcount = cur.fetchone()[0] 63 | 64 | print('Ended with %d submissions and %d comments in %s' % (submissioncount, commentcount, database.filepath.basename)) 65 | 66 | def ingest_jsonfile_argparse(args): 67 | return ingest_jsonfile( 68 | subreddit=args.subreddit, 69 | username=args.username, 70 | filepath=args.json_file, 71 | ) 72 | -------------------------------------------------------------------------------- /timesearch_modules/livestream.py: -------------------------------------------------------------------------------- 1 | import copy 2 | import prawcore 3 | import time 4 | import traceback 5 | 6 | from . import common 7 | from . import exceptions 8 | from . import tsdb 9 | 10 | from voussoirkit import vlogging 11 | 12 | log = vlogging.get_logger(__name__) 13 | 14 | def _listify(x): 15 | ''' 16 | The user may have given us a string containing multiple subreddits / users. 17 | Try to split that up into a list of names. 18 | ''' 19 | if not x: 20 | return [] 21 | if isinstance(x, str): 22 | return common.split_any(x, ['+', ' ', ',']) 23 | return x 24 | 25 | def generator_printer(generator): 26 | ''' 27 | Given a generator that produces livestream update steps, print them out. 28 | This yields None because print returns None. 29 | ''' 30 | prev_message_length = 0 31 | for step in generator: 32 | newtext = '%s: +%ds, %dc' % (step['tsdb'].filepath.basename, step['new_submissions'], step['new_comments']) 33 | totalnew = step['new_submissions'] + step['new_comments'] 34 | status = '{now} {new}'.format(now=common.human(common.get_now()), new=newtext) 35 | clear_prev = (' ' * prev_message_length) + '\r' 36 | print(clear_prev + status, end='') 37 | prev_message_length = len(status) 38 | if totalnew == 0 and log.level == 0 or log.level > vlogging.DEBUG: 39 | # Since there were no news, allow the next line to overwrite status 40 | print('\r', end='', flush=True) 41 | else: 42 | print() 43 | yield None 44 | 45 | def cycle_generators(generators, only_once, sleepy): 46 | ''' 47 | Given multiple generators, yield an item from each one, cycling through 48 | them in a round-robin fashion. 49 | 50 | This is useful if you want to convert multiple livestream generators into a 51 | single generator that take turns updating each of them and yields all of 52 | their items. 53 | ''' 54 | while True: 55 | for generator in generators: 56 | yield next(generator) 57 | if only_once: 58 | break 59 | time.sleep(sleepy) 60 | 61 | def livestream( 62 | subreddit=None, 63 | username=None, 64 | as_a_generator=False, 65 | do_submissions=True, 66 | do_comments=True, 67 | limit=100, 68 | only_once=False, 69 | sleepy=30, 70 | ): 71 | ''' 72 | Continuously get posts from this source and insert them into the database. 73 | 74 | as_a_generator: 75 | Return a generator where every iteration does a single livestream loop 76 | and yields the return value of TSDB.insert (A summary of new 77 | submission & comment count). 78 | This is useful if you want to manage the generator yourself. 79 | Otherwise, this function will run the generator forever. 80 | ''' 81 | subreddits = _listify(subreddit) 82 | usernames = _listify(username) 83 | kwargs = { 84 | 'do_submissions': do_submissions, 85 | 'do_comments': do_comments, 86 | 'limit': limit, 87 | 'params': {'show': 'all'}, 88 | } 89 | 90 | subreddit_generators = [ 91 | _livestream_as_a_generator(subreddit=subreddit, username=None, **kwargs) for subreddit in subreddits 92 | ] 93 | user_generators = [ 94 | _livestream_as_a_generator(subreddit=None, username=username, **kwargs) for username in usernames 95 | ] 96 | generators = subreddit_generators + user_generators 97 | 98 | if as_a_generator: 99 | if len(generators) == 1: 100 | return generators[0] 101 | return generators 102 | 103 | generator = cycle_generators(generators, only_once=only_once, sleepy=sleepy) 104 | generator = generator_printer(generator) 105 | 106 | try: 107 | for step in generator: 108 | pass 109 | except KeyboardInterrupt: 110 | print() 111 | return 112 | 113 | hangman = lambda: livestream( 114 | username='gallowboob', 115 | do_submissions=True, 116 | do_comments=True, 117 | sleepy=60, 118 | ) 119 | 120 | def _livestream_as_a_generator( 121 | subreddit, 122 | username, 123 | do_submissions, 124 | do_comments, 125 | limit, 126 | params, 127 | ): 128 | 129 | if not common.is_xor(subreddit, username): 130 | raise exceptions.NotExclusive(['subreddit', 'username']) 131 | 132 | if not any([do_submissions, do_comments]): 133 | raise TypeError('Required do_submissions and/or do_comments parameter') 134 | common.login() 135 | 136 | if subreddit: 137 | log.debug('Getting subreddit %s', subreddit) 138 | (database, subreddit) = tsdb.TSDB.for_subreddit(subreddit, fix_name=True) 139 | subreddit = common.r.subreddit(subreddit) 140 | submission_function = subreddit.new if do_submissions else None 141 | comment_function = subreddit.comments if do_comments else None 142 | else: 143 | log.debug('Getting redditor %s', username) 144 | (database, username) = tsdb.TSDB.for_user(username, fix_name=True) 145 | user = common.r.redditor(username) 146 | submission_function = user.submissions.new if do_submissions else None 147 | comment_function = user.comments.new if do_comments else None 148 | 149 | while True: 150 | try: 151 | items = _livestream_helper( 152 | submission_function=submission_function, 153 | comment_function=comment_function, 154 | limit=limit, 155 | params=params, 156 | ) 157 | newitems = database.insert(items) 158 | yield newitems 159 | except prawcore.exceptions.NotFound: 160 | print(database.filepath.basename, '404 not found') 161 | step = {'tsdb': database, 'new_comments': 0, 'new_submissions': 0} 162 | yield step 163 | except Exception: 164 | traceback.print_exc() 165 | print('Retrying...') 166 | step = {'tsdb': database, 'new_comments': 0, 'new_submissions': 0} 167 | yield step 168 | 169 | def _livestream_helper( 170 | submission_function=None, 171 | comment_function=None, 172 | *args, 173 | **kwargs, 174 | ): 175 | ''' 176 | Given a submission-retrieving function and/or a comment-retrieving function, 177 | collect submissions and comments in a list together and return that. 178 | 179 | args and kwargs go into the collecting functions. 180 | ''' 181 | if not any([submission_function, comment_function]): 182 | raise TypeError('Required submissions and/or comments parameter') 183 | results = [] 184 | 185 | if submission_function: 186 | log.debug('Getting submissions %s %s', args, kwargs) 187 | this_kwargs = copy.deepcopy(kwargs) 188 | submission_batch = submission_function(*args, **this_kwargs) 189 | results.extend(submission_batch) 190 | if comment_function: 191 | log.debug('Getting comments %s %s', args, kwargs) 192 | this_kwargs = copy.deepcopy(kwargs) 193 | comment_batch = comment_function(*args, **this_kwargs) 194 | results.extend(comment_batch) 195 | log.debug('Got %d posts', len(results)) 196 | return results 197 | 198 | def livestream_argparse(args): 199 | if args.submissions is args.comments is False: 200 | args.submissions = True 201 | args.comments = True 202 | if args.limit is None: 203 | limit = 100 204 | else: 205 | limit = int(args.limit) 206 | 207 | return livestream( 208 | subreddit=args.subreddit, 209 | username=args.username, 210 | do_comments=args.comments, 211 | do_submissions=args.submissions, 212 | limit=limit, 213 | only_once=args.once, 214 | sleepy=int(args.sleepy), 215 | ) 216 | -------------------------------------------------------------------------------- /timesearch_modules/merge_db.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from . import common 4 | from . import tsdb 5 | 6 | 7 | MIGRATE_QUERY = ''' 8 | INSERT INTO {tablename} 9 | SELECT othertable.* FROM other.{tablename} othertable 10 | LEFT JOIN {tablename} mytable ON mytable.idint == othertable.idint 11 | WHERE mytable.idint IS NULL; 12 | ''' 13 | 14 | def _migrate_helper(db, tablename): 15 | query = MIGRATE_QUERY.format(tablename=tablename) 16 | print(query) 17 | 18 | oldcount = db.cur.execute('SELECT count(*) FROM %s' % tablename).fetchone()[0] 19 | db.cur.execute(query) 20 | db.sql.commit() 21 | 22 | newcount = db.cur.execute('SELECT count(*) FROM %s' % tablename).fetchone()[0] 23 | print('Gained %d items.' % (newcount - oldcount)) 24 | 25 | def merge_db(from_db_path, to_db_path): 26 | to_db = tsdb.TSDB(to_db_path) 27 | from_db = tsdb.TSDB(from_db_path) 28 | 29 | to_db.cur.execute('ATTACH DATABASE "%s" AS other' % from_db_path) 30 | _migrate_helper(to_db, 'submissions') 31 | _migrate_helper(to_db, 'comments') 32 | 33 | def merge_db_argparse(args): 34 | return merge_db(args.from_db_path, args.to_db_path) 35 | -------------------------------------------------------------------------------- /timesearch_modules/offline_reading.py: -------------------------------------------------------------------------------- 1 | import os 2 | import markdown 3 | 4 | from . import common 5 | from . import exceptions 6 | from . import tsdb 7 | 8 | 9 | HTML_HEADER = ''' 10 | 11 | 12 | {title} 13 | 14 | 15 | 16 | 37 | 38 | 39 | '''.strip() 40 | 41 | HTML_FOOTER = ''' 42 | 43 | 44 | 61 | 62 | '''.strip() 63 | 64 | HTML_COMMENT = ''' 65 |
66 |

67 | [-] 71 | 72 | {usernamelink} 73 | | 74 | {score} points 75 | | 76 | {human} 77 |

78 |
79 | {body} 80 | {{children}} 81 |
82 |
83 | '''.strip() 84 | 85 | HTML_SUBMISSION = ''' 86 |
87 |

88 | {usernamelink} 89 | | 90 | {score} points 91 | | 92 | {human} 93 |

94 | {title} 95 |

{url_or_text}

96 |
97 | {{children}} 98 | '''.strip() 99 | 100 | 101 | class TreeNode: 102 | def __init__(self, identifier, data, parent=None): 103 | assert isinstance(identifier, str) 104 | assert '\\' not in identifier 105 | self.identifier = identifier 106 | self.data = data 107 | self.parent = parent 108 | self.children = {} 109 | 110 | def __getitem__(self, key): 111 | return self.children[key] 112 | 113 | def __repr__(self): 114 | return 'TreeNode %s' % self.abspath() 115 | 116 | def abspath(self): 117 | node = self 118 | nodes = [node] 119 | while node.parent is not None: 120 | node = node.parent 121 | nodes.append(node) 122 | nodes.reverse() 123 | nodes = [node.identifier for node in nodes] 124 | return '\\'.join(nodes) 125 | 126 | def add_child(self, other_node, overwrite_parent=False): 127 | self.check_child_availability(other_node.identifier) 128 | if other_node.parent is not None and not overwrite_parent: 129 | raise ValueError('That node already has a parent. Try `overwrite_parent=True`') 130 | 131 | other_node.parent = self 132 | self.children[other_node.identifier] = other_node 133 | return other_node 134 | 135 | def check_child_availability(self, identifier): 136 | if ':' in identifier: 137 | raise Exception('Only roots may have a colon') 138 | if identifier in self.children: 139 | raise Exception('Node %s already has child %s' % (self.identifier, identifier)) 140 | 141 | def detach(self): 142 | del self.parent.children[self.identifier] 143 | self.parent = None 144 | 145 | def listnodes(self, customsort=None): 146 | items = list(self.children.items()) 147 | if customsort is None: 148 | items.sort(key=lambda x: x[0].lower()) 149 | else: 150 | items.sort(key=customsort) 151 | return [item[1] for item in items] 152 | 153 | def merge_other(self, othertree, otherroot=None): 154 | newroot = None 155 | if ':' in othertree.identifier: 156 | if otherroot is None: 157 | raise Exception('Must specify a new name for the other tree\'s root') 158 | else: 159 | newroot = otherroot 160 | else: 161 | newroot = othertree.identifier 162 | othertree.identifier = newroot 163 | othertree.parent = self 164 | self.check_child_availability(newroot) 165 | self.children[newroot] = othertree 166 | 167 | def printtree(self, customsort=None): 168 | for node in self.walk(customsort): 169 | print(node.abspath()) 170 | 171 | def walk(self, customsort=None): 172 | yield self 173 | for child in self.listnodes(customsort=customsort): 174 | #print(child) 175 | #print(child.listnodes()) 176 | yield from child.walk(customsort=customsort) 177 | 178 | def html_format_comment(comment): 179 | text = HTML_COMMENT.format( 180 | id=comment.idstr, 181 | body=sanitize_braces(render_markdown(comment.body)), 182 | usernamelink=html_helper_userlink(comment), 183 | score=comment.score, 184 | human=common.human(comment.created), 185 | permalink=html_helper_permalink(comment), 186 | ) 187 | return text 188 | 189 | def html_format_submission(submission): 190 | text = HTML_SUBMISSION.format( 191 | id=submission.idstr, 192 | title=sanitize_braces(submission.title), 193 | usernamelink=html_helper_userlink(submission), 194 | score=submission.score, 195 | human=common.human(submission.created), 196 | permalink=html_helper_permalink(submission), 197 | url_or_text=html_helper_urlortext(submission), 198 | ) 199 | return text 200 | 201 | def html_from_database(database, specific_submission=None): 202 | ''' 203 | Given a timesearch database, produce html pages for each 204 | of the submissions it contains (or one particular submission fullname) 205 | ''' 206 | if markdown is None: 207 | raise ImportError('Page cannot be rendered without the markdown module') 208 | 209 | submission_trees = trees_from_database(database, specific_submission) 210 | for submission_tree in submission_trees: 211 | page = html_from_tree(submission_tree, sort=lambda x: x.data.score * -1) 212 | database.offline_reading_dir.makedirs(exist_ok=True) 213 | 214 | html = '' 215 | 216 | header = HTML_HEADER.format(title=submission_tree.data.title) 217 | html += header 218 | 219 | html += page 220 | 221 | html += HTML_FOOTER 222 | yield (submission_tree.identifier, html) 223 | 224 | def html_from_tree(tree, sort=None): 225 | ''' 226 | Given a tree *whose root is the submission*, return 227 | HTML-formatted text representing each submission's comment page. 228 | ''' 229 | if tree.data.object_type == 'submission': 230 | page = html_format_submission(tree.data) 231 | elif tree.data.object_type == 'comment': 232 | page = html_format_comment(tree.data) 233 | children = tree.listnodes() 234 | if sort is not None: 235 | children.sort(key=sort) 236 | children = [html_from_tree(child, sort) for child in children] 237 | if len(children) == 0: 238 | children = '' 239 | else: 240 | children = '\n\n'.join(children) 241 | try: 242 | page = page.format(children=children) 243 | except IndexError: 244 | print(page) 245 | raise 246 | return page 247 | 248 | def html_helper_permalink(item): 249 | ''' 250 | Given a submission or a comment, return the URL for its permalink. 251 | ''' 252 | link = 'https://old.reddit.com/r/%s/comments/' % item.subreddit 253 | if item.object_type == 'submission': 254 | link += item.idstr[3:] 255 | elif item.object_type == 'comment': 256 | link += '%s/_/%s' % (item.submission[3:], item.idstr[3:]) 257 | return link 258 | 259 | def html_helper_urlortext(submission): 260 | ''' 261 | Given a submission, return either an tag for its url, or its 262 | markdown-rendered selftext. 263 | ''' 264 | if submission.url: 265 | text = '{url}'.format(url=submission.url) 266 | elif submission.selftext: 267 | text = render_markdown(submission.selftext) 268 | else: 269 | text = '' 270 | text = sanitize_braces(text) 271 | return text 272 | 273 | def html_helper_userlink(item): 274 | ''' 275 | Given a submission or comment, return an tag for its author, or [deleted]. 276 | ''' 277 | name = item.author 278 | if name.lower() == '[deleted]': 279 | return '[deleted]' 280 | link = 'https://old.reddit.com/u/{name}' 281 | link = '{name}' % link 282 | link = link.format(name=name) 283 | return link 284 | 285 | def render_markdown(text): 286 | # I was going to use html.escape, but then it turns html entities like 287 | #   into &nbsp; which doesn't work. 288 | # So I only want to escape the brackets. 289 | escaped = text.replace('<', '<').replace('>', '&rt;') 290 | text = markdown.markdown(escaped, output_format='html5') 291 | return text 292 | 293 | def sanitize_braces(text): 294 | text = text.replace('{', '{{') 295 | text = text.replace('}', '}}') 296 | return text 297 | 298 | def trees_from_database(database, specific_submission=None): 299 | ''' 300 | Given a timesearch database, take all of the submission 301 | ids, take all of the comments for each submission id, and run them 302 | through `tree_from_submission`. 303 | 304 | Yield each submission's tree as it is generated. 305 | ''' 306 | cur1 = database.sql.cursor() 307 | cur2 = database.sql.cursor() 308 | 309 | if specific_submission is None: 310 | cur1.execute('SELECT idstr FROM submissions ORDER BY created ASC') 311 | submission_ids = common.fetchgenerator(cur1) 312 | # sql always returns rows as tuples, even when selecting one column. 313 | submission_ids = (x[0] for x in submission_ids) 314 | else: 315 | specific_submission = common.t3_prefix(specific_submission) 316 | submission_ids = [specific_submission] 317 | 318 | found_some_posts = False 319 | for submission_id in submission_ids: 320 | found_some_posts = True 321 | cur2.execute('SELECT * FROM submissions WHERE idstr == ?', [submission_id]) 322 | submission = cur2.fetchone() 323 | cur2.execute('SELECT * FROM comments WHERE submission == ?', [submission_id]) 324 | fetched_comments = cur2.fetchall() 325 | submission_tree = tree_from_submission(submission, fetched_comments) 326 | yield submission_tree 327 | 328 | if not found_some_posts: 329 | raise Exception('Found no submissions!') 330 | 331 | def tree_from_submission(submission_dbrow, comments_dbrows): 332 | ''' 333 | Given the sqlite data for a submission and all of its comments, 334 | return a tree with the submission id as the root 335 | ''' 336 | submission = tsdb.DBEntry(submission_dbrow) 337 | comments = [tsdb.DBEntry(c) for c in comments_dbrows] 338 | comments.sort(key=lambda x: x.created) 339 | 340 | print('Building tree for %s (%d comments)' % (submission.idstr, len(comments))) 341 | # Thanks Martin Schmidt for the algorithm 342 | # http://stackoverflow.com/a/29942118/5430534 343 | tree = TreeNode(identifier=submission.idstr, data=submission) 344 | node_map = {} 345 | 346 | for comment in comments: 347 | # Ensure this comment is in a node of its own 348 | this_node = node_map.get(comment.idstr, None) 349 | if this_node: 350 | # This ID was detected as a parent of a previous iteration 351 | # Now we're actually filling it in. 352 | this_node.data = comment 353 | else: 354 | this_node = TreeNode(comment.idstr, comment) 355 | node_map[comment.idstr] = this_node 356 | 357 | # Attach this node to the parent. 358 | if comment.parent.startswith('t3_'): 359 | tree.add_child(this_node) 360 | else: 361 | parent_node = node_map.get(comment.parent, None) 362 | if not parent_node: 363 | parent_node = TreeNode(comment.parent, data=None) 364 | node_map[comment.parent] = parent_node 365 | parent_node.add_child(this_node) 366 | this_node.parent = parent_node 367 | return tree 368 | 369 | def offline_reading(subreddit=None, username=None, specific_submission=None): 370 | if not specific_submission and not common.is_xor(subreddit, username): 371 | raise exceptions.NotExclusive(['subreddit', 'username']) 372 | 373 | if specific_submission and not username and not subreddit: 374 | database = tsdb.TSDB.for_submission(specific_submission, do_create=False) 375 | 376 | elif subreddit: 377 | database = tsdb.TSDB.for_subreddit(subreddit, do_create=False) 378 | 379 | else: 380 | database = tsdb.TSDB.for_user(username, do_create=False) 381 | 382 | htmls = html_from_database(database, specific_submission=specific_submission) 383 | 384 | for (id, html) in htmls: 385 | html_basename = '%s.html' % id 386 | html_filepath = database.offline_reading_dir.with_child(html_basename) 387 | html_handle = html_filepath.open('w', encoding='utf-8') 388 | html_handle.write(html) 389 | html_handle.close() 390 | print('Wrote', html_filepath.relative_path) 391 | 392 | def offline_reading_argparse(args): 393 | return offline_reading( 394 | subreddit=args.subreddit, 395 | username=args.username, 396 | specific_submission=args.specific_submission, 397 | ) 398 | -------------------------------------------------------------------------------- /timesearch_modules/pushshift.py: -------------------------------------------------------------------------------- 1 | ''' 2 | On January 29, 2018, reddit announced the death of the ?timestamp cloudsearch 3 | parameter for submissions. RIP. 4 | https://www.reddit.com/r/changelog/comments/7tus5f/update_to_search_api/dtfcdn0 5 | 6 | This module interfaces with api.pushshift.io to restore this functionality. 7 | It also provides new features previously impossible through reddit alone, such 8 | as scanning all of a user's comments. 9 | ''' 10 | import html 11 | import requests 12 | import time 13 | import traceback 14 | 15 | from . import common 16 | 17 | from voussoirkit import ratelimiter 18 | from voussoirkit import vlogging 19 | 20 | log = vlogging.get_logger(__name__) 21 | 22 | print('Thank you Jason Baumgartner of Pushshift.io!') 23 | 24 | USERAGENT = 'Timesearch ({version}) ({contact})' 25 | API_URL = 'https://api.pushshift.io/reddit/' 26 | 27 | DEFAULT_PARAMS = { 28 | 'size': 1000, 29 | 'order': 'asc', 30 | 'sort': 'created_utc', 31 | } 32 | 33 | # Pushshift does not supply attributes that are null. So we fill them back in. 34 | FALLBACK_ATTRIBUTES = { 35 | 'distinguished': None, 36 | 'edited': False, 37 | 'link_flair_css_class': None, 38 | 'link_flair_text': None, 39 | 'score': 0, 40 | 'selftext': '', 41 | } 42 | 43 | contact_info_message = ''' 44 | Please add a CONTACT_INFO string variable to your bot.py file. 45 | This will be added to your pushshift useragent. 46 | '''.strip() 47 | if not getattr(common.bot, 'CONTACT_INFO', ''): 48 | raise ValueError(contact_info_message) 49 | 50 | useragent = USERAGENT.format(version=common.VERSION, contact=common.bot.CONTACT_INFO) 51 | ratelimit = None 52 | session = requests.Session() 53 | session.headers.update({'User-Agent': useragent}) 54 | ratelimit = ratelimiter.Ratelimiter(allowance=120, period=60) 55 | 56 | class DummyObject: 57 | ''' 58 | These classes are used to convert the JSON data we get from pushshift into 59 | objects so that the rest of timesearch can operate transparently. 60 | This requires a bit of whack-a-mole including: 61 | - Fleshing out the attributes which PS did not include because they were 62 | null (we use FALLBACK_ATTRIBUTES to replace them). 63 | - Providing the convenience methods and @properties that PRAW provides. 64 | - Mimicking the rich attributes like author and subreddit. 65 | ''' 66 | def __init__(self, **attributes): 67 | for (key, val) in attributes.items(): 68 | if key == 'author': 69 | val = DummyObject(name=val) 70 | elif key == 'subreddit': 71 | val = DummyObject(display_name=val) 72 | elif key in ['body', 'selftext']: 73 | val = html.unescape(val) 74 | elif key == 'parent_id': 75 | if val is None: 76 | val = attributes['link_id'] 77 | elif isinstance(val, int): 78 | val = 't1_' + common.b36(val) 79 | 80 | setattr(self, key, val) 81 | 82 | for (key, val) in FALLBACK_ATTRIBUTES.items(): 83 | if not hasattr(self, key): 84 | setattr(self, key, val) 85 | 86 | # In rare cases, things sometimes don't have a subreddit. 87 | # Promo posts seem to be one example. 88 | FALLBACK_ATTRIBUTES['subreddit'] = DummyObject(display_name=None) 89 | 90 | class DummySubmission(DummyObject): 91 | @property 92 | def fullname(self): 93 | return 't3_' + self.id 94 | 95 | class DummyComment(DummyObject): 96 | @property 97 | def fullname(self): 98 | return 't1_' + self.id 99 | 100 | 101 | def _normalize_subreddit(subreddit): 102 | if isinstance(subreddit, str): 103 | return subreddit 104 | else: 105 | return subreddit.display_name 106 | 107 | def _normalize_user(user): 108 | if isinstance(user, str): 109 | return user 110 | else: 111 | return user.name 112 | 113 | def _pagination_core(url, params, dummy_type, lower=None, upper=None): 114 | if upper is not None: 115 | params['before'] = upper 116 | if lower is not None: 117 | params['after'] = lower 118 | 119 | setify = lambda items: set(item['id'] for item in items) 120 | prev_batch_ids = set() 121 | 122 | while True: 123 | for retry in range(5): 124 | try: 125 | batch = get(url, params) 126 | except requests.exceptions.HTTPError as exc: 127 | traceback.print_exc() 128 | print('Retrying in 5...') 129 | time.sleep(5) 130 | else: 131 | break 132 | 133 | log.debug('Got batch of %d items.', len(batch)) 134 | batch_ids = setify(batch) 135 | if len(batch_ids) == 0 or batch_ids.issubset(prev_batch_ids): 136 | break 137 | submissions = [dummy_type(**x) for x in batch if x['id'] not in prev_batch_ids] 138 | submissions.sort(key=lambda x: x.created_utc) 139 | # Take the latest-1 to avoid the lightning strike chance that two posts 140 | # have the same timestamp and this occurs at a page boundary. 141 | # Since ?after=latest would cause us to miss that second one. 142 | params['after'] = submissions[-1].created_utc - 1 143 | yield from submissions 144 | 145 | prev_batch_ids = batch_ids 146 | ratelimit.limit() 147 | 148 | def get(url, params=None): 149 | if not url.startswith('https://'): 150 | url = API_URL + url.lstrip('/') 151 | 152 | if params is None: 153 | params = {} 154 | 155 | for (key, val) in DEFAULT_PARAMS.items(): 156 | params.setdefault(key, val) 157 | 158 | log.debug('Requesting %s with %s', url, params) 159 | ratelimit.limit() 160 | response = session.get(url, params=params) 161 | response.raise_for_status() 162 | response = response.json() 163 | data = response['data'] 164 | return data 165 | 166 | def get_comments_from_submission(submission): 167 | if isinstance(submission, str): 168 | submission_id = common.t3_prefix(submission)[3:] 169 | else: 170 | submission_id = submission.id 171 | 172 | params = {'link_id': submission_id} 173 | comments = _pagination_core( 174 | url='comment/search/', 175 | params=params, 176 | dummy_type=DummyComment, 177 | ) 178 | yield from comments 179 | 180 | def get_comments_from_subreddit(subreddit, **kwargs): 181 | subreddit = _normalize_subreddit(subreddit) 182 | params = {'subreddit': subreddit} 183 | comments = _pagination_core( 184 | url='comment/search/', 185 | params=params, 186 | dummy_type=DummyComment, 187 | **kwargs 188 | ) 189 | yield from comments 190 | 191 | def get_comments_from_user(user, **kwargs): 192 | user = _normalize_user(user) 193 | params = {'author': user} 194 | comments = _pagination_core( 195 | url='comment/search/', 196 | params=params, 197 | dummy_type=DummyComment, 198 | **kwargs 199 | ) 200 | yield from comments 201 | 202 | def get_submissions_from_subreddit(subreddit, **kwargs): 203 | subreddit = _normalize_subreddit(subreddit) 204 | params = {'subreddit': subreddit} 205 | submissions = _pagination_core( 206 | url='submission/search/', 207 | params=params, 208 | dummy_type=DummySubmission, 209 | **kwargs 210 | ) 211 | yield from submissions 212 | 213 | def get_submissions_from_user(user, **kwargs): 214 | user = _normalize_user(user) 215 | params = {'author': user} 216 | submissions = _pagination_core( 217 | url='submission/search/', 218 | params=params, 219 | dummy_type=DummySubmission, 220 | **kwargs 221 | ) 222 | yield from submissions 223 | 224 | def supplement_reddit_data(dummies, chunk_size=100): 225 | ''' 226 | Given an iterable of the Dummy Pushshift objects, yield them back and also 227 | yield the live Reddit objects they refer to according to reddit's /api/info. 228 | The live object will always come after the corresponding dummy object. 229 | By doing this, we enjoy the strengths of both data sources: Pushshift 230 | will give us deleted or removed objects that reddit would not, and reddit 231 | gives us up-to-date scores and text bodies. 232 | ''' 233 | chunks = common.generator_chunker(dummies, chunk_size) 234 | for chunk in chunks: 235 | log.debug('Supplementing %d items with live reddit data.', len(chunk)) 236 | ids = [item.fullname for item in chunk] 237 | live_copies = list(common.r.info(ids)) 238 | live_copies = {item.fullname: item for item in live_copies} 239 | for item in chunk: 240 | yield item 241 | live_item = live_copies.get(item.fullname, None) 242 | if live_item: 243 | yield live_item 244 | -------------------------------------------------------------------------------- /timesearch_modules/tsdb.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sqlite3 3 | import time 4 | import types 5 | 6 | from . import common 7 | from . import exceptions 8 | from . import pushshift 9 | 10 | from voussoirkit import pathclass 11 | from voussoirkit import sqlhelpers 12 | from voussoirkit import vlogging 13 | 14 | log = vlogging.get_logger(__name__) 15 | 16 | # For backwards compatibility reasons, this list of format strings will help 17 | # timesearch find databases that are using the old filename style. 18 | # The final element will be used if none of the previous ones were found. 19 | DB_FORMATS_SUBREDDIT = [ 20 | '.\\{name}.db', 21 | '.\\subreddits\\{name}\\{name}.db', 22 | '.\\{name}\\{name}.db', 23 | '.\\databases\\{name}.db', 24 | '.\\subreddits\\{name}\\{name}.db', 25 | ] 26 | DB_FORMATS_USER = [ 27 | '.\\@{name}.db', 28 | '.\\users\\@{name}\\@{name}.db', 29 | '.\\@{name}\\@{name}.db', 30 | '.\\databases\\@{name}.db', 31 | '.\\users\\@{name}\\@{name}.db', 32 | ] 33 | 34 | DATABASE_VERSION = 2 35 | DB_VERSION_PRAGMA = f''' 36 | PRAGMA user_version = {DATABASE_VERSION}; 37 | ''' 38 | 39 | DB_PRAGMAS = f''' 40 | ''' 41 | 42 | DB_INIT = f''' 43 | {DB_PRAGMAS} 44 | {DB_VERSION_PRAGMA} 45 | ---------------------------------------------------------------------------------------------------- 46 | CREATE TABLE IF NOT EXISTS config( 47 | key TEXT, 48 | value TEXT 49 | ); 50 | ---------------------------------------------------------------------------------------------------- 51 | CREATE TABLE IF NOT EXISTS submissions( 52 | idint INT, 53 | idstr TEXT, 54 | created INT, 55 | self INT, 56 | nsfw INT, 57 | author TEXT, 58 | title TEXT, 59 | url TEXT, 60 | selftext TEXT, 61 | score INT, 62 | subreddit TEXT, 63 | distinguish INT, 64 | textlen INT, 65 | num_comments INT, 66 | flair_text TEXT, 67 | flair_css_class TEXT, 68 | augmented_at INT, 69 | augmented_count INT 70 | ); 71 | CREATE INDEX IF NOT EXISTS submission_index ON submissions(idstr); 72 | ---------------------------------------------------------------------------------------------------- 73 | CREATE TABLE IF NOT EXISTS comments( 74 | idint INT, 75 | idstr TEXT, 76 | created INT, 77 | author TEXT, 78 | parent TEXT, 79 | submission TEXT, 80 | body TEXT, 81 | score INT, 82 | subreddit TEXT, 83 | distinguish TEXT, 84 | textlen INT 85 | ); 86 | CREATE INDEX IF NOT EXISTS comment_index ON comments(idstr); 87 | ---------------------------------------------------------------------------------------------------- 88 | CREATE TABLE IF NOT EXISTS submission_edits( 89 | idstr TEXT, 90 | previous_selftext TEXT, 91 | replaced_at INT 92 | ); 93 | CREATE INDEX IF NOT EXISTS submission_edits_index ON submission_edits(idstr); 94 | ---------------------------------------------------------------------------------------------------- 95 | CREATE TABLE IF NOT EXISTS comment_edits( 96 | idstr TEXT, 97 | previous_body TEXT, 98 | replaced_at INT 99 | ); 100 | CREATE INDEX IF NOT EXISTS comment_edits_index ON comment_edits(idstr); 101 | ''' 102 | 103 | DEFAULT_CONFIG = { 104 | 'store_edits': True, 105 | } 106 | 107 | SQL_SUBMISSION_COLUMNS = [ 108 | 'idint', 109 | 'idstr', 110 | 'created', 111 | 'self', 112 | 'nsfw', 113 | 'author', 114 | 'title', 115 | 'url', 116 | 'selftext', 117 | 'score', 118 | 'subreddit', 119 | 'distinguish', 120 | 'textlen', 121 | 'num_comments', 122 | 'flair_text', 123 | 'flair_css_class', 124 | 'augmented_at', 125 | 'augmented_count', 126 | ] 127 | 128 | SQL_COMMENT_COLUMNS = [ 129 | 'idint', 130 | 'idstr', 131 | 'created', 132 | 'author', 133 | 'parent', 134 | 'submission', 135 | 'body', 136 | 'score', 137 | 'subreddit', 138 | 'distinguish', 139 | 'textlen', 140 | ] 141 | 142 | SQL_EDITS_COLUMNS = [ 143 | 'idstr', 144 | 'text', 145 | 'replaced_at', 146 | ] 147 | 148 | SQL_SUBMISSION = {key:index for (index, key) in enumerate(SQL_SUBMISSION_COLUMNS)} 149 | SQL_COMMENT = {key:index for (index, key) in enumerate(SQL_COMMENT_COLUMNS)} 150 | 151 | SUBMISSION_TYPES = (common.praw.models.Submission, pushshift.DummySubmission) 152 | COMMENT_TYPES = (common.praw.models.Comment, pushshift.DummyComment) 153 | 154 | 155 | class DBEntry: 156 | ''' 157 | This class converts a tuple row from the database into an object so that 158 | you can access the attributes with dot notation. 159 | ''' 160 | def __init__(self, dbrow): 161 | if dbrow[1].startswith('t3_'): 162 | columns = SQL_SUBMISSION_COLUMNS 163 | self.object_type = 'submission' 164 | else: 165 | columns = SQL_COMMENT_COLUMNS 166 | self.object_type = 'comment' 167 | 168 | self.id = None 169 | self.idstr = None 170 | for (index, attribute) in enumerate(columns): 171 | setattr(self, attribute, dbrow[index]) 172 | 173 | def __repr__(self): 174 | return 'DBEntry(\'%s\')' % self.id 175 | 176 | 177 | class TSDB: 178 | def __init__(self, filepath, *, do_create=True, skip_version_check=False): 179 | self.filepath = pathclass.Path(filepath) 180 | if not self.filepath.is_file: 181 | if not do_create: 182 | raise exceptions.DatabaseNotFound(self.filepath) 183 | print('New database', self.filepath.relative_path) 184 | 185 | self.filepath.parent.makedirs(exist_ok=True) 186 | 187 | self.breakdown_dir = self.filepath.parent.with_child('breakdown') 188 | self.offline_reading_dir = self.filepath.parent.with_child('offline_reading') 189 | self.index_dir = self.filepath.parent.with_child('index') 190 | self.styles_dir = self.filepath.parent.with_child('styles') 191 | self.wiki_dir = self.filepath.parent.with_child('wiki') 192 | 193 | existing_database = self.filepath.exists 194 | self.sql = sqlite3.connect(self.filepath.absolute_path) 195 | self.cur = self.sql.cursor() 196 | 197 | if existing_database: 198 | if not skip_version_check: 199 | self._check_version() 200 | self._load_pragmas() 201 | else: 202 | self._first_time_setup() 203 | 204 | self.config = {} 205 | for (key, default_value) in DEFAULT_CONFIG.items(): 206 | self.cur.execute('SELECT value FROM config WHERE key == ?', [key]) 207 | existing_value = self.cur.fetchone() 208 | if existing_value is None: 209 | self.cur.execute('INSERT INTO config VALUES(?, ?)', [key, default_value]) 210 | self.config[key] = default_value 211 | else: 212 | existing_value = existing_value[0] 213 | if isinstance(default_value, int): 214 | existing_value = int(existing_value) 215 | self.config[key] = existing_value 216 | 217 | def _check_version(self): 218 | ''' 219 | Compare database's user_version against DATABASE_VERSION, 220 | raising exceptions.DatabaseOutOfDate if not correct. 221 | ''' 222 | existing = self.cur.execute('PRAGMA user_version').fetchone()[0] 223 | if existing != DATABASE_VERSION: 224 | raise exceptions.DatabaseOutOfDate( 225 | current=existing, 226 | new=DATABASE_VERSION, 227 | filepath=self.filepath, 228 | ) 229 | 230 | def _first_time_setup(self): 231 | self.sql.executescript(DB_INIT) 232 | self.sql.commit() 233 | 234 | def _load_pragmas(self): 235 | self.sql.executescript(DB_PRAGMAS) 236 | self.sql.commit() 237 | 238 | def __repr__(self): 239 | return 'TSDB(%s)' % self.filepath 240 | 241 | @staticmethod 242 | def _pick_filepath(formats, name): 243 | ''' 244 | Starting with the most specific and preferred filename format, check 245 | if there is an existing database that matches the name we're looking 246 | for, and return that path. If none of them exist, then use the most 247 | preferred filepath. 248 | ''' 249 | for form in formats: 250 | path = form.format(name=name) 251 | if os.path.isfile(path): 252 | break 253 | return pathclass.Path(path) 254 | 255 | @classmethod 256 | def _for_object_helper(cls, name, path_formats, do_create=True, fix_name=False): 257 | if name != os.path.basename(name): 258 | filepath = pathclass.Path(name) 259 | 260 | else: 261 | filepath = cls._pick_filepath(formats=path_formats, name=name) 262 | 263 | database = cls(filepath=filepath, do_create=do_create) 264 | if fix_name: 265 | return (database, name_from_path(name)) 266 | return database 267 | 268 | @classmethod 269 | def for_submission(cls, submission_id, fix_name=False, *args, **kwargs): 270 | subreddit = common.subreddit_for_submission(submission_id) 271 | database = cls.for_subreddit(subreddit, *args, **kwargs) 272 | if fix_name: 273 | return (database, subreddit.display_name) 274 | return database 275 | 276 | @classmethod 277 | def for_subreddit(cls, name, do_create=True, fix_name=False): 278 | if isinstance(name, common.praw.models.Subreddit): 279 | name = name.display_name 280 | elif not isinstance(name, str): 281 | raise TypeError(name, 'should be str or Subreddit.') 282 | return cls._for_object_helper( 283 | name, 284 | do_create=do_create, 285 | fix_name=fix_name, 286 | path_formats=DB_FORMATS_SUBREDDIT, 287 | ) 288 | 289 | @classmethod 290 | def for_user(cls, name, do_create=True, fix_name=False): 291 | if isinstance(name, common.praw.models.Redditor): 292 | name = name.name 293 | elif not isinstance(name, str): 294 | raise TypeError(name, 'should be str or Redditor.') 295 | 296 | return cls._for_object_helper( 297 | name, 298 | do_create=do_create, 299 | fix_name=fix_name, 300 | path_formats=DB_FORMATS_USER, 301 | ) 302 | 303 | def check_for_edits(self, obj, existing_entry): 304 | ''' 305 | If the item's current text doesn't match the stored text, decide what 306 | to do. 307 | 308 | Firstly, make sure to ignore deleted comments. 309 | Then, if the database is configured to store edited text, do so. 310 | Finally, return the body that we want to store in the main table. 311 | ''' 312 | if isinstance(obj, SUBMISSION_TYPES): 313 | existing_body = existing_entry[SQL_SUBMISSION['selftext']] 314 | body = obj.selftext 315 | else: 316 | existing_body = existing_entry[SQL_COMMENT['body']] 317 | body = obj.body 318 | 319 | if body != existing_body: 320 | if should_keep_existing_text(obj): 321 | body = existing_body 322 | elif self.config['store_edits']: 323 | self.insert_edited(obj, old_text=existing_body) 324 | return body 325 | 326 | def insert(self, objects, commit=True): 327 | if not isinstance(objects, (list, tuple, types.GeneratorType)): 328 | objects = [objects] 329 | 330 | if isinstance(objects, types.GeneratorType): 331 | log.debug('Trying to insert a generator of objects.') 332 | else: 333 | log.debug('Trying to insert %d objects.', len(objects)) 334 | 335 | new_values = { 336 | 'tsdb': self, 337 | 'new_submissions': 0, 338 | 'new_comments': 0, 339 | } 340 | methods = { 341 | common.praw.models.Submission: (self.insert_submission, 'new_submissions'), 342 | common.praw.models.Comment: (self.insert_comment, 'new_comments'), 343 | } 344 | methods[pushshift.DummySubmission] = methods[common.praw.models.Submission] 345 | methods[pushshift.DummyComment] = methods[common.praw.models.Comment] 346 | 347 | for obj in objects: 348 | (method, key) = methods.get(type(obj), (None, None)) 349 | if method is None: 350 | raise TypeError('Unsupported', type(obj), obj) 351 | status = method(obj) 352 | new_values[key] += status 353 | 354 | if commit: 355 | log.debug('Committing insert.') 356 | self.sql.commit() 357 | 358 | log.debug('Done inserting.') 359 | return new_values 360 | 361 | def insert_edited(self, obj, old_text): 362 | ''' 363 | Having already detected that the item has been edited, add a record to 364 | the appropriate *_edits table containing the text that is being 365 | replaced. 366 | ''' 367 | if isinstance(obj, SUBMISSION_TYPES): 368 | table = 'submission_edits' 369 | key = 'previous_selftext' 370 | else: 371 | table = 'comment_edits' 372 | key = 'previous_body' 373 | 374 | if obj.edited is False: 375 | replaced_at = int(time.time()) 376 | else: 377 | replaced_at = int(obj.edited) 378 | 379 | postdata = { 380 | 'idstr': obj.fullname, 381 | key: old_text, 382 | 'replaced_at': replaced_at, 383 | } 384 | cur = self.sql.cursor() 385 | (qmarks, bindings) = sqlhelpers.insert_filler(postdata) 386 | query = f'INSERT INTO {table} {qmarks}' 387 | cur.execute(query, bindings) 388 | 389 | def insert_submission(self, submission): 390 | cur = self.sql.cursor() 391 | cur.execute('SELECT * FROM submissions WHERE idstr == ?', [submission.fullname]) 392 | existing_entry = cur.fetchone() 393 | 394 | if submission.author is None: 395 | author = '[DELETED]' 396 | else: 397 | author = submission.author.name 398 | 399 | if not existing_entry: 400 | if submission.is_self: 401 | # Selfpost's URL leads back to itself, so just ignore it. 402 | url = None 403 | elif hasattr(submission, 'crosspost_parent') and getattr(submission, 'crosspost_parent_list'): 404 | url = submission.crosspost_parent_list[0]['permalink'] 405 | else: 406 | url = getattr(submission, 'url', None) 407 | 408 | if url and url.startswith('/r/'): 409 | url = 'https://reddit.com' + url 410 | 411 | postdata = { 412 | 'idint': common.b36(submission.id), 413 | 'idstr': submission.fullname, 414 | 'created': submission.created_utc, 415 | 'self': submission.is_self, 416 | 'nsfw': submission.over_18, 417 | 'author': author, 418 | 'title': submission.title, 419 | 'url': url, 420 | 'selftext': submission.selftext, 421 | 'score': submission.score, 422 | 'subreddit': submission.subreddit.display_name, 423 | 'distinguish': submission.distinguished, 424 | 'textlen': len(submission.selftext), 425 | 'num_comments': submission.num_comments, 426 | 'flair_text': submission.link_flair_text, 427 | 'flair_css_class': submission.link_flair_css_class, 428 | 'augmented_at': None, 429 | 'augmented_count': None, 430 | } 431 | (qmarks, bindings) = sqlhelpers.insert_filler(postdata) 432 | query = f'INSERT INTO submissions {qmarks}' 433 | cur.execute(query, bindings) 434 | 435 | else: 436 | selftext = self.check_for_edits(submission, existing_entry=existing_entry) 437 | 438 | query = ''' 439 | UPDATE submissions SET 440 | nsfw = coalesce(?, nsfw), 441 | score = coalesce(?, score), 442 | selftext = coalesce(?, selftext), 443 | distinguish = coalesce(?, distinguish), 444 | num_comments = coalesce(?, num_comments), 445 | flair_text = coalesce(?, flair_text), 446 | flair_css_class = coalesce(?, flair_css_class) 447 | WHERE idstr == ? 448 | ''' 449 | bindings = [ 450 | submission.over_18, 451 | submission.score, 452 | selftext, 453 | submission.distinguished, 454 | submission.num_comments, 455 | submission.link_flair_text, 456 | submission.link_flair_css_class, 457 | submission.fullname 458 | ] 459 | cur.execute(query, bindings) 460 | 461 | return existing_entry is None 462 | 463 | def insert_comment(self, comment): 464 | cur = self.sql.cursor() 465 | cur.execute('SELECT * FROM comments WHERE idstr == ?', [comment.fullname]) 466 | existing_entry = cur.fetchone() 467 | 468 | if comment.author is None: 469 | author = '[DELETED]' 470 | else: 471 | author = comment.author.name 472 | 473 | if not existing_entry: 474 | postdata = { 475 | 'idint': common.b36(comment.id), 476 | 'idstr': comment.fullname, 477 | 'created': comment.created_utc, 478 | 'author': author, 479 | 'parent': comment.parent_id, 480 | 'submission': comment.link_id, 481 | 'body': comment.body, 482 | 'score': comment.score, 483 | 'subreddit': comment.subreddit.display_name, 484 | 'distinguish': comment.distinguished, 485 | 'textlen': len(comment.body), 486 | } 487 | (qmarks, bindings) = sqlhelpers.insert_filler(postdata) 488 | query = f'INSERT INTO comments {qmarks}' 489 | cur.execute(query, bindings) 490 | 491 | else: 492 | body = self.check_for_edits(comment, existing_entry=existing_entry) 493 | 494 | query = ''' 495 | UPDATE comments SET 496 | score = coalesce(?, score), 497 | body = coalesce(?, body), 498 | distinguish = coalesce(?, distinguish) 499 | WHERE idstr == ? 500 | ''' 501 | bindings = [ 502 | comment.score, 503 | body, 504 | comment.distinguished, 505 | comment.fullname 506 | ] 507 | cur.execute(query, bindings) 508 | 509 | return existing_entry is None 510 | 511 | 512 | def name_from_path(filepath): 513 | ''' 514 | In order to support usage like 515 | > timesearch livestream -r D:\\some\\other\\filepath\\learnpython.db 516 | this function extracts the subreddit name / username based on the given 517 | path, so that we can pass it into `r.subreddit` / `r.redditor` properly. 518 | ''' 519 | if isinstance(filepath, pathclass.Path): 520 | filepath = filepath.basename 521 | else: 522 | filepath = os.path.basename(filepath) 523 | name = os.path.splitext(filepath)[0] 524 | name = name.strip('@') 525 | return name 526 | 527 | def should_keep_existing_text(obj): 528 | ''' 529 | Under certain conditions we do not want to update the entry in the db 530 | with the most recent copy of the text. For example, if the post has 531 | been deleted and the text now shows '[deleted]' we would prefer to 532 | keep whatever we already have. 533 | 534 | This function puts away the work I would otherwise have to duplicate 535 | for both submissions and comments. 536 | ''' 537 | body = obj.selftext if isinstance(obj, SUBMISSION_TYPES) else obj.body 538 | if obj.author is None and body in ['[removed]', '[deleted]']: 539 | return True 540 | 541 | greasy = ['has been overwritten', 'pastebin.com/64GuVi2F'] 542 | if any(grease in body for grease in greasy): 543 | return True 544 | 545 | return False 546 | -------------------------------------------------------------------------------- /utilities/database_upgrader.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import sqlite3 4 | import sys 5 | 6 | sys.path.append(os.path.dirname(os.path.dirname(__file__))) 7 | 8 | from timesearch_modules import tsdb 9 | 10 | 11 | def upgrade_1_to_2(db): 12 | ''' 13 | In this version, many of the timesearch modules were renamed, including 14 | redmash -> index. This update will rename the existing `redmash` folder 15 | to `index`. 16 | ''' 17 | cur = db.sql.cursor() 18 | redmash_dir = db.index_dir.parent.with_child('redmash') 19 | if redmash_dir.exists: 20 | redmash_dir.assert_is_directory() 21 | print('Renaming redmash folder to index.') 22 | os.rename(redmash_dir, db.index_dir) 23 | 24 | def upgrade_all(database_filename): 25 | ''' 26 | Given the filename of a database, apply all of the needed 27 | upgrade_x_to_y functions in order. 28 | ''' 29 | db = tsdb.TSDB(database_filename, do_create=False, skip_version_check=True) 30 | 31 | cur = db.sql.cursor() 32 | 33 | cur.execute('PRAGMA user_version') 34 | current_version = cur.fetchone()[0] 35 | needed_version = tsdb.DATABASE_VERSION 36 | 37 | if current_version == needed_version: 38 | print('Already up to date with version %d.' % needed_version) 39 | return 40 | 41 | for version_number in range(current_version + 1, needed_version + 1): 42 | print('Upgrading from %d to %d' % (current_version, version_number)) 43 | upgrade_function = 'upgrade_%d_to_%d' % (current_version, version_number) 44 | upgrade_function = eval(upgrade_function) 45 | upgrade_function(db) 46 | db.sql.cursor().execute('PRAGMA user_version = %d' % version_number) 47 | db.sql.commit() 48 | current_version = version_number 49 | print('Upgrades finished.') 50 | 51 | 52 | def upgrade_all_argparse(args): 53 | return upgrade_all(database_filename=args.database_filename) 54 | 55 | def main(argv): 56 | parser = argparse.ArgumentParser() 57 | 58 | parser.add_argument('database_filename') 59 | parser.set_defaults(func=upgrade_all_argparse) 60 | 61 | args = parser.parse_args(argv) 62 | return args.func(args) 63 | 64 | if __name__ == '__main__': 65 | raise SystemExit(main(sys.argv[1:])) 66 | --------------------------------------------------------------------------------