├── .gitignore
├── CONTACT.md
├── LICENSE.txt
├── README.md
├── requirements.txt
├── timesearch.py
├── timesearch_logo.svg
├── timesearch_modules
    ├── breakdown.py
    ├── common.py
    ├── exceptions.py
    ├── get_comments.py
    ├── get_styles.py
    ├── get_submissions.py
    ├── get_wiki.py
    ├── index.py
    ├── ingest_jsonfile.py
    ├── livestream.py
    ├── merge_db.py
    ├── offline_reading.py
    ├── pushshift.py
    └── tsdb.py
└── utilities
    └── database_upgrader.py


/.gitignore:
--------------------------------------------------------------------------------
1 | databases/*
2 | @hangman.md
3 | hangman.py
4 | 


--------------------------------------------------------------------------------
/CONTACT.md:
--------------------------------------------------------------------------------
 1 | Contact
 2 | =======
 3 | 
 4 | Please do not open pull requests without talking to me first. For serious issues and bugs, open a GitHub issue. If you just have a question, please send an email to `contact@voussoir.net`. For other contact options, see [voussoir.net/#contact](https://voussoir.net/#contact).
 5 | 
 6 | I also mirror my work to other git services:
 7 | 
 8 | - https://github.com/voussoir
 9 | 
10 | - https://gitlab.com/voussoir
11 | 
12 | - https://codeberg.org/voussoir
13 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | BSD 3-Clause License
 2 | 
 3 | Copyright (c) 2021, Ethan Dalool aka voussoir
 4 | All rights reserved.
 5 | 
 6 | Redistribution and use in source and binary forms, with or without
 7 | modification, are permitted provided that the following conditions are met:
 8 | 
 9 | 1. Redistributions of source code must retain the above copyright notice, this
10 |    list of conditions and the following disclaimer.
11 | 
12 | 2. Redistributions in binary form must reproduce the above copyright notice,
13 |    this list of conditions and the following disclaimer in the documentation
14 |    and/or other materials provided with the distribution.
15 | 
16 | 3. Neither the name of the copyright holder nor the names of its
17 |    contributors may be used to endorse or promote products derived from
18 |    this software without specific prior written permission.
19 | 
20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | timesearch
  2 | ==========
  3 | 
  4 | ## NEWS (2023 06 25):
  5 | 
  6 | Pushshift's API is currently offline. Without the timestamp search parameter or Pushshift access, timesearch is not able to get historical data. You can continue to use the `livestream` module to collect new posts and comments as they are made.
  7 | 
  8 | You can still download the Pushshift archives, though. https://the-eye.eu/redarcs/ is one source.
  9 | 
 10 | I have added a module for ingesting these json files into a timesearch database so that you can continue to use `offline_reading`, or if you just prefer the sqlite format. You need to extract the zst file with an archive tool like [7-Zip](https://www.7-zip.org/) before giving it to timesearch.
 11 | 
 12 | `python timesearch.py ingest_jsonfile subredditname_submissions -r subredditname`
 13 | 
 14 | `python timesearch.py ingest_jsonfile subredditname_comments -r subredditname`
 15 | 
 16 | ## NEWS (2023 05 01):
 17 | 
 18 | [Reddit has revoked Pushshift's API access](https://old.reddit.com/r/modnews/comments/134tjpe/reddit_data_api_update_changes_to_pushshift_access/), so [pushshift.io](https://pushshift.io) may not be able to continue ingesting reddit content.
 19 | 
 20 | ## NEWS (2018 04 09):
 21 | 
 22 | [Reddit has removed the timestamp search feature which timesearch was built off of](https://voussoir.github.io/t3_7tus5f.html#t1_dtfcdn0) ([original](https://old.reddit.com/r/changelog/comments/7tus5f/update_to_search_api/dtfcdn0/)). Please message the admins by [sending a PM to /r/reddit.com](https://old.reddit.com/message/compose?to=%2Fr%2Freddit.com&subject=Timestamp+search). Let them know that this feature is important to you, and you would like them to restore it on the new search stack.
 23 | 
 24 | Thankfully, Jason Baumgartner aka [/u/Stuck_in_the_Matrix](https://old.reddit.com/u/Stuck_in_the_Matrix/overview), owner of [Pushshift.io](https://github.com/pushshift/api), has made it easy to interact with his dataset. Timesearch now queries his API to get post data, and then uses reddit's /api/info to get up-to-date information about those posts (scores, edited text bodies, ...). While we're at it, this also gives us the ability to speed up `get_comments`. In addition, we can get all of a user's comments which was not possible through reddit alone.
 25 | 
 26 | NOTE: Because Pushshift is an independent dataset run by a regular person, it does not contain posts from private subreddits. Without the timestamp search parameter, scanning private subreddits is now impossible. I urge once again that you contact ~~your senator~~ the admins to have this feature restored.
 27 | 
 28 | ---
 29 | 
 30 | I don't have a test suite. You're my test suite! Messages go to [/u/GoldenSights](https://old.reddit.com/u/GoldenSights).
 31 | 
 32 | Timesearch is a collection of utilities for archiving subreddits.
 33 | 
 34 | ## Make sure you have:
 35 | - Downloaded this project using the green "Clone or Download" button in the upper right.
 36 | - Installed [Python](https://www.python.org/download). I use Python 3.7.
 37 | - Installed PRAW >= 4, as well as the other modules in `requirements.txt`. Try `pip install -r requirements.txt` to get them all.
 38 | - Created an OAuth app at https://old.reddit.com/prefs/apps. Make it `script` type, and set the redirect URI to `http://localhost:8080`. The title and description can be anything you want, and the about URL is not required.
 39 | - Used [this PRAW script](https://praw.readthedocs.io/en/latest/tutorials/refresh_token.html) to generate a refresh token. Just save it as a .py file somewhere and run it through your terminal / command line. For simplicity's sake, I just choose `all` for the scopes.
 40 |   - The instructions mention `export praw_client_id=...`. This creates environment variables on Linux. If you are on Windows, or simply don't want to create environment variables, you can alternatively add `client_id='...'` and `client_secret='...'` to the `praw.Reddit` instance on line 40, alongside the `redirect_uri` and `user_agent` arguments.
 41 | - Downloaded a copy of [this file](https://github.com/voussoir/reddit/blob/master/bot4.py) and saved it as `bot.py`. Fill out the variables using your OAuth information, and read the instructions to see where to put it. The most simple way is to save it in the same folder as this README file.
 42 |   - The `USERAGENT` is a description of your API usage. Typically "/u/username's praw client" is sufficient.
 43 |   - The `CONTACT_INFO` is sent when downloading from Pushshift, [as encouraged by Stuck_in_the_Matrix](https://old.reddit.com/r/pushshift/comments/c5yr9l/i_had_to_ban_a_couple_ips_that_were_making/). It could just be your email address or reddit username.
 44 | 
 45 | ## This package consists of:
 46 | 
 47 | - **get_submissions**: If you try to page through `/new` on a subreddit, you'll hit a limit at or before 1,000 posts. Timesearch uses the pushshift.io dataset to get information about very old posts, and then queries the reddit api to update their information. Previously, we used the `timestamp` cloudsearch query parameter on reddit's own API, but reddit has removed that feature and pushshift is now the only viable source for initial data.  
 48 |     `python timesearch.py get_submissions -r subredditname <flags>`  
 49 |     `python timesearch.py get_submissions -u username <flags>`
 50 | 
 51 | - **get_comments**: Similar to `get_submissions`, this tool queries pushshift for comment data and updates it from reddit.  
 52 |     `python timesearch.py get_comments -r subredditname <flags>`  
 53 |     `python timesearch.py get_comments -u username <flags>`
 54 | 
 55 | - **livestream**: get_submissions+get_comments is great for starting your database and getting the historical posts, but it's not the best for staying up-to-date. Instead, livestream monitors `/new` and `/comments` to continuously ingest data.  
 56 |     `python timesearch.py livestream -r subredditname <flags>`  
 57 |     `python timesearch.py livestream -u username <flags>`
 58 | 
 59 | - **get_styles**: Downloads the stylesheet and CSS images.  
 60 |     `python timesearch.py get_styles -r subredditname`
 61 | 
 62 | - **get_wiki**: Downloads the wiki pages, sidebar, etc. from /wiki/pages.  
 63 |     `python timesearch.py get_wiki -r subredditname`
 64 | 
 65 | - **offline_reading**: Renders comment threads into HTML via markdown.  
 66 |     Note: I'm currently using the [markdown library from pypi](https://pypi.python.org/pypi/Markdown), and it doesn't do reddit's custom markdown like `/r/` or `/u/`, obviously. So far I don't think anybody really uses o_r so I haven't invested much time into improving it.  
 67 |     `python timesearch.py offline_reading -r subredditname <flags>`  
 68 |     `python timesearch.py offline_reading -u username <flags>`
 69 | 
 70 | - **index**: Generates plaintext or HTML lists of submissions, sorted by a property of your choosing. You can order by date, author, flair, etc. With the `--offline` parameter, you can make all the links point to the files you generated with `offline_reading`.  
 71 |     `python timesearch.py index -r subredditname <flags>`  
 72 |     `python timesearch.py index -u username <flags>`
 73 | 
 74 | - **breakdown**: Produces a JSON file indicating which users make the most posts in a subreddit, or which subreddits a user posts in.  
 75 |     `python timesearch.py breakdown -r subredditname` <flags>  
 76 |     `python timesearch.py breakdown -u username` <flags>
 77 | 
 78 | - **merge_db**: Copy all new data from one timesearch database into another. Useful for syncing or merging two scans of the same subreddit.  
 79 |     `python timesearch.py merge_db --from filepath/database1.db --to filepath/database2.db`
 80 | 
 81 | ### To use it
 82 | 
 83 | When you download this project, the main file that you will execute is `timesearch.py` here in the root directory. It will load the appropriate module to run your command from the modules folder.
 84 | 
 85 | You can view a summarized version of all the help text by running `timesearch.py`, and you can view a specific help text by running a command with no arguments, like `timesearch.py livestream`, etc.
 86 | 
 87 | I recommend [sqlitebrowser](https://github.com/sqlitebrowser/sqlitebrowser/releases) if you want to inspect the database yourself.
 88 | 
 89 | ## Changelog
 90 | - 2020 01 27
 91 |     - When I first created Timesearch, it was simply a collection of all the random scripts I had written to archive various things. And they tended to have wacky names like `commentaugment` and `redmash`. Well, since the timesearch toolkit is meant to be a singular cohesive package now I decided to finally rename everything. I believe I have aliased everything properly so the old names still work for backwards compat, except for the fact the modules folder is now called `timesearch_modules` which may break your import statements if you ever imported that on your own.
 92 | 
 93 | - 2018 04 09
 94 |     - Integrated with Pushshift to restore timesearch functionality, speed up commentaugment, and get user comments.
 95 | 
 96 | - 2017 11 13
 97 |     - Gave timesearch its own Github repository so that (1) it will be easier for people to download it and (2) it has a cleaner, more independent URL. [voussoir/timesearch](https://github.com/voussoir/timesearch)
 98 | 
 99 | - 2017 11 05
100 |     - Added a try-except inside livestream helper to prevent generator from terminating.
101 | 
102 | - 2017 11 04
103 |     - For timesearch, I switched from using my custom cloudsearch iterator to the one that comes with PRAW4+.
104 | 
105 | - 2017 10 12
106 |     - Added the `mergedb` utility for combining databases.
107 | 
108 | - 2017 06 02
109 |     - You can use `commentaugment -s abcdef` to get a particular thread even if you haven't scraped anything else from that subreddit. Previously `-s` only worked if the database already existed and you specified it via `-r`. Now it is inferred from the submission itself.
110 | 
111 | - 2017 04 28
112 |     - Complete restructure into package, started using PRAW4.
113 | 
114 | - 2016 08 10
115 |     - Started merging redmash and wrote its argparser
116 | 
117 | - 2016 07 03
118 |     - Improved docstring clarity.
119 | 
120 | - 2016 07 02
121 |     - Added `livestream` argparse
122 | 
123 | - 2016 06 07
124 |     - Offline_reading has been merged with the main timesearch file
125 |     - `get_all_posts` renamed to `timesearch`
126 |     - Timesearch parameter `usermode` renamed to `username`; `maxupper` renamed to `upper`.
127 |     - Everything now accessible via commandline arguments. Read the docstring at the top of the file.
128 | 
129 | - 2016 06 05
130 |     - NEW DATABASE SCHEME. Submissions and comments now live in different tables like they should have all along. Submission table has two new columns for a little bit of commentaugment metadata. This allows commentaugment to only scan threads that are new.
131 |     - You can use the `migrate_20160605.py` script to convert old databases into new ones.
132 | 
133 | - 2015 11 11
134 |     - created `offline_reading.py` which converts a timesearch database into a comment tree that can be rendered into HTML
135 | 
136 | - 2015 09 07
137 |     - fixed bug which allowed `livestream` to crash because `bot.refresh()` was outside of the try-catch.
138 | 
139 | - 2015 08 19
140 |     - fixed bug in which updatescores stopped iterating early if you had more than 100 comments in a row in the db
141 |     - commentaugment has been completely merged into the timesearch.py file. you can use commentaugment_prompt() to input the parameters, or use the commentaugment() function directly.
142 | 
143 | 
144 | ____
145 | 
146 | 
147 | I want to live in a future where everyone uses UTC and agrees on daylight savings.
148 | 
149 | <p align="center">
150 |     <img height=256 src="https://github.com/voussoir/timesearch/blob/master/timesearch_logo.svg?raw=true&sanitize=true" alt="Timesearch"/>
151 | </p>
152 | 
153 | ## Mirrors
154 | 
155 | https://git.voussoir.net/voussoir/timesearch
156 | 
157 | https://github.com/voussoir/timesearch
158 | 
159 | https://gitlab.com/voussoir/timesearch
160 | 
161 | https://codeberg.org/voussoir/timesearch
162 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | markdown
2 | praw
3 | voussoirkit
4 | 


--------------------------------------------------------------------------------
/timesearch.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | This is the main launch file for Timesearch.
  3 | 
  4 | When you run `python timesearch.py get_submissions -r subredditname` or any
  5 | other command, your arguments will go to the timesearch_modules file as
  6 | appropriate for your command.
  7 | '''
  8 | import argparse
  9 | import sys
 10 | 
 11 | from voussoirkit import betterhelp
 12 | from voussoirkit import vlogging
 13 | 
 14 | from timesearch_modules import exceptions
 15 | 
 16 | # NOTE: Originally I wanted the docstring for each module to be within their
 17 | # file. However, this means that composing the global helptext would require
 18 | # importing those modules, which will subsequently import PRAW and a whole lot
 19 | # of other things. This made TS very slow to load which is okay when you're
 20 | # actually using it but really terrible when you're just viewing the help text.
 21 | 
 22 | def breakdown_gateway(args):
 23 |     from timesearch_modules import breakdown
 24 |     breakdown.breakdown_argparse(args)
 25 | 
 26 | def get_comments_gateway(args):
 27 |     from timesearch_modules import get_comments
 28 |     get_comments.get_comments_argparse(args)
 29 | 
 30 | def get_styles_gateway(args):
 31 |     from timesearch_modules import get_styles
 32 |     get_styles.get_styles_argparse(args)
 33 | 
 34 | def get_wiki_gateway(args):
 35 |     from timesearch_modules import get_wiki
 36 |     get_wiki.get_wiki_argparse(args)
 37 | 
 38 | def ingest_jsonfile_gateway(args):
 39 |     from timesearch_modules import ingest_jsonfile
 40 |     ingest_jsonfile.ingest_jsonfile_argparse(args)
 41 | 
 42 | def livestream_gateway(args):
 43 |     from timesearch_modules import livestream
 44 |     livestream.livestream_argparse(args)
 45 | 
 46 | def merge_db_gateway(args):
 47 |     from timesearch_modules import merge_db
 48 |     merge_db.merge_db_argparse(args)
 49 | 
 50 | def offline_reading_gateway(args):
 51 |     from timesearch_modules import offline_reading
 52 |     offline_reading.offline_reading_argparse(args)
 53 | 
 54 | def index_gateway(args):
 55 |     from timesearch_modules import index
 56 |     index.index_argparse(args)
 57 | 
 58 | def get_submissions_gateway(args):
 59 |     from timesearch_modules import get_submissions
 60 |     get_submissions.get_submissions_argparse(args)
 61 | 
 62 | @vlogging.main_decorator
 63 | def main(argv):
 64 |     parser = argparse.ArgumentParser(
 65 |         description='''
 66 |         The subreddit archiver
 67 | 
 68 |         The basics:
 69 |         1. Collect a subreddit's submissions
 70 |             timesearch get_submissions -r subredditname
 71 | 
 72 |         2. Collect the comments for those submissions
 73 |             timesearch get_comments -r subredditname
 74 | 
 75 |         3. Stay up to date
 76 |             timesearch livestream -r subredditname
 77 |         ''',
 78 |     )
 79 |     subparsers = parser.add_subparsers()
 80 | 
 81 |     # BREAKDOWN
 82 |     p_breakdown = subparsers.add_parser(
 83 |         'breakdown',
 84 |         description='''
 85 |         Generate the comment / submission counts for users in a subreddit, or
 86 |         the subreddits that a user posts to.
 87 | 
 88 |         Automatically dumps into a <database>_breakdown.json file
 89 |         in the same directory as the database.
 90 |         ''',
 91 |     )
 92 |     p_breakdown.add_argument(
 93 |         '--sort',
 94 |         dest='sort',
 95 |         type=str,
 96 |         default=None,
 97 |         help='''
 98 |         Sort the output by one property.
 99 |         Should be one of "name", "submissions", "comments", "total_posts".
100 |         ''',
101 |     )
102 |     p_breakdown.add_argument(
103 |         '-r',
104 |         '--subreddit',
105 |         dest='subreddit',
106 |         default=None,
107 |         help='''
108 |         The subreddit database to break down.
109 |         ''',
110 |     )
111 |     p_breakdown.add_argument(
112 |         '-u',
113 |         '--user',
114 |         dest='username',
115 |         default=None,
116 |         help='''
117 |         The username database to break down.
118 |         ''',
119 |     )
120 |     p_breakdown.set_defaults(func=breakdown_gateway)
121 | 
122 |     # GET_COMMENTS
123 |     p_get_comments = subparsers.add_parser(
124 |         'get_comments',
125 |         aliases=['get-comments', 'commentaugment'],
126 |         description='''
127 |         Collect comments on a subreddit or comments made by a user.
128 |         ''',
129 |     )
130 |     p_get_comments.add_argument(
131 |         '-r',
132 |         '--subreddit',
133 |         dest='subreddit',
134 |         default=None,
135 |     )
136 |     p_get_comments.add_argument(
137 |         '-s',
138 |         '--specific',
139 |         dest='specific_submission',
140 |         default=None,
141 |         help='''
142 |         Given a submission ID like t3_xxxxxx, scan only that submission.
143 |         ''',
144 |     )
145 |     p_get_comments.add_argument(
146 |         '-u',
147 |         '--user',
148 |         dest='username',
149 |         default=None,
150 |     )
151 |     p_get_comments.add_argument(
152 |         '--dont_supplement',
153 |         '--dont-supplement',
154 |         dest='do_supplement',
155 |         action='store_false',
156 |         help='''
157 |         If provided, trust the pushshift data and do not fetch live copies
158 |         from reddit.
159 |         ''',
160 |     )
161 |     p_get_comments.add_argument(
162 |         '--lower',
163 |         dest='lower',
164 |         default='update',
165 |         help='''
166 |         If a number - the unix timestamp to start at.
167 |         If "update" - continue from latest comment in db.
168 |         WARNING: If at some point you collected comments for a particular
169 |         submission which was ahead of the rest of your comments, using "update"
170 |         will start from that later submission, and you will miss the stuff in
171 |         between that specific post and the past.
172 |         ''',
173 |     )
174 |     p_get_comments.add_argument(
175 |         '--upper',
176 |         dest='upper',
177 |         default=None,
178 |         help='''
179 |         If a number - the unix timestamp to stop at.
180 |         If not provided - stop at current time.
181 |         ''',
182 |     )
183 |     p_get_comments.set_defaults(func=get_comments_gateway)
184 | 
185 |     # GET_STYLES
186 |     p_get_styles = subparsers.add_parser(
187 |         'get_styles',
188 |         aliases=['get-styles', 'getstyles'],
189 |         help='''
190 |         Collect the stylesheet, and css images.
191 |         ''',
192 |     )
193 |     p_get_styles.add_argument(
194 |         '-r',
195 |         '--subreddit',
196 |         dest='subreddit',
197 |     )
198 |     p_get_styles.set_defaults(func=get_styles_gateway)
199 | 
200 |     # GET_WIKI
201 |     p_get_wiki = subparsers.add_parser(
202 |         'get_wiki',
203 |         aliases=['get-wiki', 'getwiki'],
204 |         description='''
205 |         Collect all available wiki pages.
206 |         ''',
207 |     )
208 |     p_get_wiki.add_argument(
209 |         '-r',
210 |         '--subreddit',
211 |         dest='subreddit',
212 |     )
213 |     p_get_wiki.set_defaults(func=get_wiki_gateway)
214 | 
215 |     # INGEST_JSONFILE
216 |     p_ingest_jsonfile = subparsers.add_parser(
217 |         'ingest_jsonfile',
218 |         description='''
219 |         This module was added after reddit's June 2023 API changes which
220 |         resulted in pushshift losing API access, and pushshift's own API was
221 |         disabled. The community has made archive files available for download.
222 |         These archive files contain 1 object (a submission or a comment) per
223 |         line in a JSON format.
224 | 
225 |         You can ingest these into timesearch so that you can continue to use
226 |         timesearch's offline_reading or index features.
227 |         ''',
228 |     )
229 |     p_ingest_jsonfile.add_argument(
230 |         'json_file',
231 |         help='''
232 |         Path to a file containing 1 json object per line. Each object must be
233 |         either a submission or a comment.
234 |         ''',
235 |     )
236 |     p_ingest_jsonfile.add_argument(
237 |         '-r',
238 |         '--subreddit',
239 |         dest='subreddit',
240 |         default=None,
241 |     )
242 |     p_ingest_jsonfile.add_argument(
243 |         '-u',
244 |         '--user',
245 |         dest='username',
246 |         default=None,
247 |     )
248 |     p_ingest_jsonfile.set_defaults(func=ingest_jsonfile_gateway)
249 | 
250 |     # LIVESTREAM
251 |     p_livestream = subparsers.add_parser(
252 |         'livestream',
253 |         description='''
254 |         Continously collect submissions and/or comments.
255 |         ''',
256 |     )
257 |     p_livestream.add_argument(
258 |         '--once',
259 |         dest='once',
260 |         action='store_true',
261 |         help='''
262 |         If provided, only do a single loop. Otherwise go forever.
263 |         ''',
264 |     )
265 |     p_livestream.add_argument(
266 |         '-c',
267 |         '--comments',
268 |         dest='comments',
269 |         action='store_true',
270 |         help='''
271 |         If provided, do collect comments. Otherwise don't.
272 | 
273 |         If submissions and comments are BOTH left unspecified, then they will
274 |         BOTH be collected.
275 |         ''',
276 |     )
277 |     p_livestream.add_argument(
278 |         '--limit',
279 |         dest='limit',
280 |         type=int,
281 |         default=None,
282 |         help='''
283 |         Number of items to fetch per request.
284 |         ''',
285 |     )
286 |     p_livestream.add_argument(
287 |         '-r',
288 |         '--subreddit',
289 |         dest='subreddit',
290 |         default=None,
291 |         help='''
292 |         The subreddit to collect from.
293 |         ''',
294 |     )
295 |     p_livestream.add_argument(
296 |         '-s',
297 |         '--submissions',
298 |         dest='submissions',
299 |         action='store_true',
300 |         help='''
301 |         If provided, do collect submissions. Otherwise don't.
302 | 
303 |         If submissions and comments are BOTH left unspecified, then they will
304 |         BOTH be collected.
305 |         ''',
306 |     )
307 |     p_livestream.add_argument(
308 |         '-u',
309 |         '--user',
310 |         dest='username',
311 |         default=None,
312 |         help='''
313 |         The redditor to collect from.
314 |         ''',
315 |     )
316 |     p_livestream.add_argument(
317 |         '-w',
318 |         '--wait',
319 |         dest='sleepy',
320 |         default=30,
321 |         help='''
322 |         The number of seconds to wait between cycles.
323 |         ''',
324 |     )
325 |     p_livestream.set_defaults(func=livestream_gateway)
326 | 
327 |     # MERGEDB'
328 |     p_merge_db = subparsers.add_parser(
329 |         'merge_db',
330 |         aliases=['merge-db', 'mergedb'],
331 |         description='''
332 |         Copy all new posts from one timesearch database into another.
333 |         ''',
334 |     )
335 |     p_merge_db.examples = [
336 |         '--from redditdev1.db --to redditdev2.db',
337 |     ]
338 |     p_merge_db.add_argument(
339 |         '--from',
340 |         dest='from_db_path',
341 |         required=True,
342 |         help='''
343 |         The database file containing the posts you wish to copy.
344 |         ''',
345 |     )
346 |     p_merge_db.add_argument(
347 |         '--to',
348 |         dest='to_db_path',
349 |         required=True,
350 |         help='''
351 |         The database file to which you will copy the posts.
352 |         The database is modified in-place.
353 |         Existing posts will be ignored and not updated.
354 |         ''',
355 |     )
356 |     p_merge_db.set_defaults(func=merge_db_gateway)
357 | 
358 |     # OFFLINE_READING
359 |     p_offline_reading = subparsers.add_parser(
360 |         'offline_reading',
361 |         aliases=['offline-reading'],
362 |         description='''
363 |         Render submissions and comment threads to HTML via Markdown.
364 |         ''',
365 |     )
366 |     p_offline_reading.add_argument(
367 |         '-r',
368 |         '--subreddit',
369 |         dest='subreddit',
370 |         default=None,
371 |     )
372 |     p_offline_reading.add_argument(
373 |         '-s',
374 |         '--specific',
375 |         dest='specific_submission',
376 |         default=None,
377 |         type=str,
378 |         help='''
379 |         Given a submission ID like t3_xxxxxx, render only that submission.
380 |         Otherwise render every submission in the database.
381 |         ''',
382 |     )
383 |     p_offline_reading.add_argument(
384 |         '-u',
385 |         '--user',
386 |         dest='username',
387 |         default=None,
388 |     )
389 |     p_offline_reading.set_defaults(func=offline_reading_gateway)
390 | 
391 |     # INDEX
392 |     p_index = subparsers.add_parser(
393 |         'index',
394 |         aliases=['redmash'],
395 |         description='''
396 |         Dump submission listings to a plaintext or HTML file.
397 |         ''',
398 |     )
399 |     p_index.examples = [
400 |         {
401 |             'args': '-r botwatch --date',
402 |             'comment': 'Does only the date file.'
403 |         },
404 |         {
405 |             'args': '-r botwatch --score --title',
406 |             'comment': 'Does both the score and title files.'
407 |         },
408 |         {
409 |             'args': '-r botwatch --score --score_threshold 50',
410 |             'comment': 'Only shows submissions with >= 50 points.'
411 |         },
412 |         {
413 |             'args': '-r botwatch --all',
414 |             'comment': 'Performs all of the different mashes.'
415 |         },
416 |     ]
417 |     p_index.add_argument(
418 |         '-r',
419 |         '--subreddit',
420 |         dest='subreddit',
421 |         default=None,
422 |         help='''
423 |         The subreddit database to dump.
424 |         ''',
425 |     )
426 |     p_index.add_argument(
427 |         '-u',
428 |         '--user',
429 |         dest='username',
430 |         default=None,
431 |         help='''
432 |         The username database to dump.
433 |         ''',
434 |     )
435 |     p_index.add_argument(
436 |         '--all',
437 |         dest='do_all',
438 |         action='store_true',
439 |         help='''
440 |         Perform all of the indexes listed below.
441 |         ''',
442 |     )
443 |     p_index.add_argument(
444 |         '--author',
445 |         dest='do_author',
446 |         action='store_true',
447 |         help='''
448 |         For subreddit databases only.
449 |         Perform an index sorted by author.
450 |         ''',
451 |     )
452 |     p_index.add_argument(
453 |         '--date',
454 |         dest='do_date',
455 |         action='store_true',
456 |         help='''
457 |         Perform an index sorted by date.
458 |         ''',
459 |     )
460 |     p_index.add_argument(
461 |         '--flair',
462 |         dest='do_flair',
463 |         action='store_true',
464 |         help='''
465 |         Perform an index sorted by flair.
466 |         ''',
467 |     )
468 |     p_index.add_argument(
469 |         '--html',
470 |         dest='html',
471 |         action='store_true',
472 |         help='''
473 |         Write HTML files instead of plain text.
474 |         ''',
475 |     )
476 |     p_index.add_argument(
477 |         '--score',
478 |         dest='do_score',
479 |         action='store_true',
480 |         help='''
481 |         Perform an index sorted by score.
482 |         ''',
483 |     )
484 |     p_index.add_argument(
485 |         '--sub',
486 |         dest='do_subreddit',
487 |         action='store_true',
488 |         help='''
489 |         For username databases only.
490 |         Perform an index sorted by subreddit.
491 |         ''',
492 |     )
493 |     p_index.add_argument(
494 |         '--title',
495 |         dest='do_title',
496 |         action='store_true',
497 |         help='''
498 |         Perform an index sorted by title.
499 |         ''',
500 |     )
501 |     p_index.add_argument(
502 |         '--offline',
503 |         dest='offline',
504 |         action='store_true',
505 |         help='''
506 |         The links in the index will point to the files generated by
507 |         offline_reading. That is, `../offline_reading/fullname.html` instead
508 |         of `http://redd.it/id`. This will NOT trigger offline_reading to
509 |         generate the files now, so you must run that tool separately.
510 |         ''',
511 |     )
512 |     p_index.add_argument(
513 |         '--score_threshold',
514 |         '--score-threshold',
515 |         dest='score_threshold',
516 |         type=int,
517 |         default=0,
518 |         help='''
519 |         Only index posts with at least this many points.
520 |         Applies to ALL indexes!
521 |         ''',
522 |     )
523 |     p_index.set_defaults(func=index_gateway)
524 | 
525 |     # GET_SUBMISSIONS
526 |     p_get_submissions = subparsers.add_parser(
527 |         'get_submissions',
528 |         aliases=['get-submissions', 'timesearch'],
529 |         description='''
530 |         Collect submissions from the subreddit across all of history, or
531 |         Collect submissions by a user (as many as possible).
532 |         ''',
533 |     )
534 |     p_get_submissions.add_argument(
535 |         '--lower',
536 |         dest='lower',
537 |         default='update',
538 |         help='''
539 |         If a number - the unix timestamp to start at.
540 |         If "update" - continue from latest submission in db.
541 |         ''',
542 |     )
543 |     p_get_submissions.add_argument(
544 |         '-r',
545 |         '--subreddit',
546 |         dest='subreddit',
547 |         type=str,
548 |         default=None,
549 |         help='''
550 |         The subreddit to scan. Mutually exclusive with username.
551 |         ''',
552 |     )
553 |     p_get_submissions.add_argument(
554 |         '-u',
555 |         '--user',
556 |         dest='username',
557 |         type=str,
558 |         default=None,
559 |         help='''
560 |         The user to scan. Mutually exclusive with subreddit.
561 |         ''',
562 |     )
563 |     p_get_submissions.add_argument(
564 |         '--upper',
565 |         dest='upper',
566 |         default=None,
567 |         help='''
568 |         If a number - the unix timestamp to stop at.
569 |         If not provided - stop at current time.
570 |         ''',
571 |     )
572 |     p_get_submissions.add_argument(
573 |         '--dont_supplement',
574 |         '--dont-supplement',
575 |         dest='do_supplement',
576 |         action='store_false',
577 |         help='''
578 |         If provided, trust the pushshift data and do not fetch live copies
579 |         from reddit.
580 |         ''',
581 |     )
582 |     p_get_submissions.set_defaults(func=get_submissions_gateway)
583 | 
584 |     try:
585 |         return betterhelp.go(parser, argv)
586 |     except exceptions.DatabaseNotFound as exc:
587 |         message = str(exc)
588 |         message += '\nHave you used any of the other utilities to collect data?'
589 |         print(message)
590 |         return 1
591 | 
592 | if __name__ == '__main__':
593 |     raise SystemExit(main(sys.argv[1:]))
594 | 


--------------------------------------------------------------------------------
/timesearch_logo.svg:
--------------------------------------------------------------------------------
 1 | <?xml version="1.0" encoding="UTF-8" standalone="no"?>
 2 | <svg
 3 |    xmlns:dc="http://purl.org/dc/elements/1.1/"
 4 |    xmlns:cc="http://creativecommons.org/ns#"
 5 |    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 6 |    xmlns:svg="http://www.w3.org/2000/svg"
 7 |    xmlns="http://www.w3.org/2000/svg"
 8 |    version="1.1"
 9 |    id="svg2"
10 |    viewBox="0 0 599.99999 600.00001"
11 |    height="640"
12 |    width="640">
13 |   <defs
14 |      id="defs4" />
15 |   <metadata
16 |      id="metadata7">
17 |     <rdf:RDF>
18 |       <cc:Work
19 |          rdf:about="">
20 |         <dc:format>image/svg+xml</dc:format>
21 |         <dc:type
22 |            rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
23 |         <dc:title></dc:title>
24 |       </cc:Work>
25 |     </rdf:RDF>
26 |   </metadata>
27 |   <g
28 |      transform="translate(0,-452.36215)"
29 |      id="layer1">
30 |     <path
31 |        style="fill:#c6c6c6;fill-opacity:1;stroke:#000000;stroke-width:9.37499905;stroke-linecap:round;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1"
32 |        d="m 485.55308,612.53737 c -18.25596,-7.92065 -33.85084,-1.73869 -48.55914,11.66825 -10.65568,9.71287 -12.59113,26.60632 -13.89476,40.96756 -0.75758,8.34566 3.96897,16.49848 3.65567,24.8726 -0.45786,12.23755 -12.65731,24.50347 -8.04329,35.84661 4.71423,11.58949 19.20868,23.06811 31.44721,20.48177 12.23887,-2.5864 8.19473,-29.09025 20.47817,-31.4558 7.90317,-1.522 6.05477,20.595 16.82079,18.28796 10.76602,-2.30705 31.83849,-11.00314 35.10545,-26.33457 3.26697,-15.33144 2.95885,-42.6636 -2.19352,-59.25723 -4.32357,-13.92437 -21.44382,-29.27515 -34.81658,-35.07715 z"
33 |        id="path873" />
34 |     <ellipse
35 |        ry="152.60158"
36 |        rx="200.65501"
37 |        cy="704.4032"
38 |        cx="300.50983"
39 |        id="path851"
40 |        style="opacity:1;fill:#ffffff;fill-opacity:1;stroke:#000000;stroke-width:9.37499905;stroke-linecap:butt;stroke-linejoin:bevel;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
41 |     <ellipse
42 |        ry="26.345205"
43 |        rx="26.338001"
44 |        cy="668.6427"
45 |        cx="285.36905"
46 |        id="path853"
47 |        style="opacity:1;fill:#ff6600;fill-opacity:1;stroke:none;stroke-width:14.99999714;stroke-linecap:butt;stroke-linejoin:bevel;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
48 |     <path
49 |        id="path857"
50 |        d="m 299.35589,766.22783 49.00136,-7.31563 h 37.29954 l 34.37409,6.58407"
51 |        style="fill:none;stroke:#000000;stroke-width:7.49999952;stroke-linecap:round;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" />
52 |     <path
53 |        id="path855"
54 |        d="m 149.85725,605.9533 c -18.25595,-7.92065 -43.35856,0.45601 -58.066868,13.86294 -10.655684,9.71288 -12.591129,26.60632 -13.894767,40.96756 -0.757574,8.34566 3.968981,16.49848 3.655675,24.87259 -0.457851,12.23757 -12.65731,24.50348 -8.043286,35.84662 4.714229,11.58949 19.208688,23.06811 31.447206,20.48177 12.23887,-2.5864 8.19473,-29.09025 20.47818,-31.4558 7.90317,-1.522 14.63776,7.67764 19.74624,13.89857 16.28496,19.83127 11.21185,54.31101 31.22324,70.37181 3.31854,2.6634 9.30464,-0.26959 12.51158,2.52727 14.71144,12.83023 9.23391,38.57078 19.30193,55.29619 5.80202,9.63857 14.36322,17.47748 22.95236,24.74178 9.36566,7.92104 22.56629,11.06909 30.75652,20.20089 9.78726,10.91244 9.23999,29.22375 20.01785,39.15846 15.55242,14.33575 37.50872,24.25812 58.65759,24.18927 17.7486,-0.0578 33.37815,-12.36931 49.0936,-20.62024 13.52723,-7.10207 29.25407,-12.62348 38.60699,-24.70654 9.20345,-11.88998 6.13205,-30.07518 14.01793,-42.87753 6.70819,-10.89042 21.74217,-15.49045 27.20759,-27.05499 7.73151,-16.35945 11.88378,-37.25118 5.13954,-54.04198 -9.68292,-24.1071 -33.4048,-46.47032 -58.99476,-50.91539 -18.49372,-3.21243 -33.77277,20.53496 -52.54132,20.26374 -17.53096,-0.25333 -30.99311,-21.09169 -48.5097,-20.33775 -29.94136,1.28871 -49.85532,38.32075 -79.62653,41.75975 -6.49388,0.75013 -13.91775,-0.78082 -18.98708,-4.9093 -23.49236,-19.13227 -32.00942,-52.37572 -40.11006,-81.57471 -4.64494,-16.74283 0.76477,-35.34795 -4.38761,-51.94159 -4.32357,-13.92437 -8.27928,-32.20139 -21.65204,-38.00339 z"
55 |        style="fill:#c6c6c6;fill-opacity:1;stroke:#000000;stroke-width:9.375;stroke-linecap:round;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" />
56 |     <ellipse
57 |        ry="26.345205"
58 |        rx="26.338001"
59 |        style="opacity:1;fill:#ff6600;fill-opacity:1;stroke:none;stroke-width:14.99999714;stroke-linecap:butt;stroke-linejoin:bevel;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
60 |        id="circle859"
61 |        cx="424.26465"
62 |        cy="666.44806" />
63 |     <path
64 |        id="path870"
65 |        d="m 140.92985,817.7847 c 30.21097,-0.51514 47.46661,-10.24065 80.17373,-13.77331 13.02945,-1.4073 26.36342,-0.33457 39.2046,-2.95252 19.47728,-3.97086 36.62609,-20.25256 56.47368,-19.1582 15.91097,0.8773 26.66811,7.66444 42.57931,8.53763 20.53971,1.12719 32.99689,-12.84471 53.3124,-9.61434 10.33795,1.64385 21.23853,16.47786 30.98696,20.2923 6.66501,2.60793 16.42598,-2.85812 23.55077,-2.18182 26.23303,2.4901 37.60899,11.11223 76.56469,11.20136 -26.24916,-31.99593 -55.49444,-34.28319 -79.91265,-55.68074 -19.04451,-16.68862 -26.59225,-50.28468 -51.23748,-56.08833 -18.271,-4.30259 -33.77277,20.53496 -52.54132,20.26375 -17.53096,-0.25334 -31.05093,-21.94759 -48.5097,-20.33775 -31.20217,2.87708 -52.1718,34.73706 -78.59222,51.58832 -36.23553,23.11146 -62.79616,28.21421 -92.05277,67.90365 z"
66 |        style="fill:#c6c6c6;fill-opacity:1;stroke:#000000;stroke-width:9.375;stroke-linecap:round;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" />
67 |     <path
68 |        id="path875"
69 |        d="m 215.12441,636.42929 c 21.19895,-41.58885 63.39085,-51.07646 111.1433,-42.67115 l -0.96263,31.05274 c -0.28,9.03237 -114.48493,22.20873 -110.18067,11.61841 z"
70 |        style="fill:#c6c6c6;fill-opacity:1;stroke:#000000;stroke-width:9.37499905;stroke-linecap:round;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" />
71 |     <path
72 |        style="fill:#c6c6c6;fill-opacity:1;stroke:#000000;stroke-width:9.37499905;stroke-linecap:round;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1"
73 |        d="m 491.23431,634.13913 c -19.99524,-38.10789 -59.79143,-46.80137 -104.83243,-39.0996 l 0.90796,28.45365 c 0.2641,8.27634 107.98433,20.34984 103.92447,10.64595 z"
74 |        id="path879" />
75 |   </g>
76 | </svg>
77 | 


--------------------------------------------------------------------------------
/timesearch_modules/breakdown.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import json
  3 | 
  4 | from . import common
  5 | from . import tsdb
  6 | 
  7 | 
  8 | def breakdown_database(subreddit=None, username=None):
  9 |     '''
 10 |     Given a database, return a json dict breaking down the submission / comment count for
 11 |     users (if a subreddit database) or subreddits (if a user database).
 12 |     '''
 13 |     if (subreddit is None) == (username is None):
 14 |         raise Exception('Enter subreddit or username but not both')
 15 | 
 16 |     breakdown_results = {}
 17 |     def _ingest(names, subkey):
 18 |         for name in names:
 19 |             breakdown_results.setdefault(name, {})
 20 |             breakdown_results[name].setdefault(subkey, 0)
 21 |             breakdown_results[name][subkey] += 1
 22 | 
 23 |     if subreddit:
 24 |         database = tsdb.TSDB.for_subreddit(subreddit, do_create=False)
 25 |     else:
 26 |         database = tsdb.TSDB.for_user(username, do_create=False)
 27 |     cur = database.sql.cursor()
 28 | 
 29 |     for table in ['submissions', 'comments']:
 30 |         if subreddit:
 31 |             cur.execute('SELECT author FROM %s' % table)
 32 |         elif username:
 33 |             cur.execute('SELECT subreddit FROM %s' % table)
 34 | 
 35 |         names = (row[0] for row in common.fetchgenerator(cur))
 36 |         _ingest(names, table)
 37 | 
 38 |     for name in breakdown_results:
 39 |         breakdown_results[name].setdefault('submissions', 0)
 40 |         breakdown_results[name].setdefault('comments', 0)
 41 | 
 42 |     return breakdown_results
 43 | 
 44 | def breakdown_argparse(args):
 45 |     if args.subreddit:
 46 |         database = tsdb.TSDB.for_subreddit(args.subreddit, do_create=False)
 47 |     else:
 48 |         database = tsdb.TSDB.for_user(args.username, do_create=False)
 49 | 
 50 |     breakdown_results = breakdown_database(
 51 |         subreddit=args.subreddit,
 52 |         username=args.username,
 53 |     )
 54 | 
 55 |     def sort_name(name):
 56 |         return name.lower()
 57 |     def sort_submissions(name):
 58 |         invert_score = -1 * breakdown_results[name]['submissions']
 59 |         return (invert_score, name.lower())
 60 |     def sort_comments(name):
 61 |         invert_score = -1 * breakdown_results[name]['comments']
 62 |         return (invert_score, name.lower())
 63 |     def sort_total_posts(name):
 64 |         invert_score = breakdown_results[name]['submissions'] + breakdown_results[name]['comments']
 65 |         invert_score = -1 * invert_score
 66 |         return (invert_score, name.lower())
 67 |     breakdown_sorters = {
 68 |         'name': sort_name,
 69 |         'submissions': sort_submissions,
 70 |         'comments': sort_comments,
 71 |         'total_posts': sort_total_posts,
 72 |     }
 73 | 
 74 |     breakdown_names = list(breakdown_results.keys())
 75 |     if args.sort is not None:
 76 |         try:
 77 |             sorter = breakdown_sorters[args.sort.lower()]
 78 |         except KeyError:
 79 |             message = '{sorter} is not a sorter. Choose from {options}'
 80 |             message = message.format(sorter=args.sort, options=list(breakdown_sorters.keys()))
 81 |             raise KeyError(message)
 82 |         breakdown_names.sort(key=sorter)
 83 |         dump = '    "{name}": {{"submissions": {submissions}, "comments": {comments}}}'
 84 |         dump = [dump.format(name=name, **breakdown_results[name]) for name in breakdown_names]
 85 |         dump = ',\n'.join(dump)
 86 |         dump = '{\n' + dump + '\n}\n'
 87 |     else:
 88 |         dump = json.dumps(breakdown_results)
 89 | 
 90 |     if args.sort is None:
 91 |         breakdown_basename = '%s_breakdown.json'
 92 |     else:
 93 |         breakdown_basename = '%%s_breakdown_%s.json' % args.sort
 94 | 
 95 |     breakdown_basename = breakdown_basename % database.filepath.replace_extension('').basename
 96 |     breakdown_filepath = database.breakdown_dir.with_child(breakdown_basename)
 97 |     breakdown_filepath.parent.makedirs(exist_ok=True)
 98 |     breakdown_file = breakdown_filepath.open('w')
 99 |     with breakdown_file:
100 |         breakdown_file.write(dump)
101 |     print('Wrote', breakdown_filepath.relative_path)
102 | 
103 |     return breakdown_results
104 | 


--------------------------------------------------------------------------------
/timesearch_modules/common.py:
--------------------------------------------------------------------------------
  1 | import datetime
  2 | import logging
  3 | import os
  4 | import time
  5 | import traceback
  6 | 
  7 | from voussoirkit import vlogging
  8 | 
  9 | VERSION = '2020.09.06.0'
 10 | 
 11 | try:
 12 |     import praw
 13 | except ImportError:
 14 |     praw = None
 15 | if praw is None or praw.__version__.startswith('3.'):
 16 |     import praw4
 17 |     praw = praw4
 18 | 
 19 | try:
 20 |     import bot
 21 | except ImportError:
 22 |     bot = None
 23 | if bot is None or bot.praw != praw:
 24 |     try:
 25 |         import bot4
 26 |         bot = bot4
 27 |     except ImportError:
 28 |         message = '\n'.join([
 29 |         'Could not find your PRAW4 bot file as either `bot.py` or `bot4.py`.',
 30 |         'Please see the README.md file for instructions on how to prepare it.'
 31 |         ])
 32 |         raise ImportError(message)
 33 | 
 34 | 
 35 | log = vlogging.get_logger(__name__)
 36 | 
 37 | r = bot.anonymous()
 38 | 
 39 | def assert_file_exists(filepath):
 40 |     if not os.path.exists(filepath):
 41 |         raise FileNotFoundError(filepath)
 42 | 
 43 | def b36(i):
 44 |     if isinstance(i, int):
 45 |         return base36encode(i)
 46 |     return base36decode(i)
 47 | 
 48 | def base36decode(number):
 49 |     return int(number, 36)
 50 | 
 51 | def base36encode(number, alphabet='0123456789abcdefghijklmnopqrstuvwxyz'):
 52 |     """Converts an integer to a base36 string."""
 53 |     if not isinstance(number, (int)):
 54 |         raise TypeError('number must be an integer')
 55 |     base36 = ''
 56 |     sign = ''
 57 |     if number < 0:
 58 |         sign = '-'
 59 |         number = -number
 60 |     if 0 <= number < len(alphabet):
 61 |         return sign + alphabet[number]
 62 |     while number != 0:
 63 |         number, i = divmod(number, len(alphabet))
 64 |         base36 = alphabet[i] + base36
 65 |     return sign + base36
 66 | 
 67 | def fetchgenerator(cursor):
 68 |     while True:
 69 |         item = cursor.fetchone()
 70 |         if item is None:
 71 |             break
 72 |         yield item
 73 | 
 74 | def generator_chunker(generator, chunk_size):
 75 |     '''
 76 |     Given an item generator, yield lists of length chunk_size, except maybe
 77 |     the last one.
 78 |     '''
 79 |     chunk = []
 80 |     for item in generator:
 81 |         chunk.append(item)
 82 |         if len(chunk) == chunk_size:
 83 |             yield chunk
 84 |             chunk = []
 85 |     if len(chunk) != 0:
 86 |         yield chunk
 87 | 
 88 | def get_now(stamp=True):
 89 |     now = datetime.datetime.now(datetime.timezone.utc)
 90 |     if stamp:
 91 |         return int(now.timestamp())
 92 |     return now
 93 | 
 94 | def human(timestamp):
 95 |     x = datetime.datetime.utcfromtimestamp(timestamp)
 96 |     x = datetime.datetime.strftime(x, "%b %d %Y %H:%M:%S")
 97 |     return x
 98 | 
 99 | def int_none(x):
100 |     if x is None:
101 |         return None
102 |     return int(x)
103 | 
104 | def is_xor(*args):
105 |     '''
106 |     Return True if and only if one arg is truthy.
107 |     '''
108 |     return [bool(a) for a in args].count(True) == 1
109 | 
110 | def login():
111 |     global r
112 |     log.debug('Logging in to reddit.')
113 |     r = bot.login(r)
114 | 
115 | def nofailrequest(function):
116 |     '''
117 |     Creates a function that will retry until it succeeds.
118 |     This function accepts 1 parameter, a function, and returns a modified
119 |     version of that function that will try-catch, sleep, and loop until it
120 |     finally returns.
121 |     '''
122 |     def a(*args, **kwargs):
123 |         while True:
124 |             try:
125 |                 result = function(*args, **kwargs)
126 |                 return result
127 |             except KeyboardInterrupt:
128 |                 raise
129 |             except Exception:
130 |                 traceback.print_exc()
131 |                 print('Retrying in 2...')
132 |                 time.sleep(2)
133 |     return a
134 | 
135 | def split_any(text, delimiters):
136 |     delimiters = list(delimiters)
137 |     (splitter, replacers) = (delimiters[0], delimiters[1:])
138 |     for replacer in replacers:
139 |         text = text.replace(replacer, splitter)
140 |     return text.split(splitter)
141 | 
142 | def subreddit_for_submission(submission_id):
143 |     submission_id = t3_prefix(submission_id)[3:]
144 |     submission = r.submission(submission_id)
145 |     return submission.subreddit
146 | 
147 | def t3_prefix(submission_id):
148 |     if not submission_id.startswith('t3_'):
149 |         submission_id = 't3_' + submission_id
150 |     return submission_id
151 | 


--------------------------------------------------------------------------------
/timesearch_modules/exceptions.py:
--------------------------------------------------------------------------------
 1 | class TimesearchException(Exception):
 2 |     '''
 3 |     Base type for all of the Timesearch exceptions.
 4 |     Subtypes should have a class attribute `error_message`. The error message
 5 |     may contain {format} strings which will be formatted using the
 6 |     Exception's constructor arguments.
 7 |     '''
 8 |     error_message = ''
 9 |     def __init__(self, *args, **kwargs):
10 |         self.given_args = args
11 |         self.given_kwargs = kwargs
12 |         self.error_message = self.error_message.format(*args, **kwargs)
13 |         self.args = (self.error_message, args, kwargs)
14 | 
15 |     def __str__(self):
16 |         return self.error_message
17 | 
18 | OUTOFDATE = '''
19 | Database is out of date. {current} should be {new}.
20 | Please run utilities\\database_upgrader.py "{filepath.absolute_path}"
21 | '''.strip()
22 | class DatabaseOutOfDate(TimesearchException):
23 |     '''
24 |     Raised by TSDB __init__ if the user's database is behind.
25 |     '''
26 |     error_message = OUTOFDATE
27 | 
28 | class DatabaseNotFound(TimesearchException, FileNotFoundError):
29 |     error_message = 'Database file not found: "{}"'
30 | 
31 | class NotExclusive(TimesearchException):
32 |     '''
33 |     For when two or more mutually exclusive actions have been requested.
34 |     '''
35 |     error_message = 'One and only one of {} must be passed.'
36 | 


--------------------------------------------------------------------------------
/timesearch_modules/get_comments.py:
--------------------------------------------------------------------------------
  1 | import traceback
  2 | 
  3 | from . import common
  4 | from . import exceptions
  5 | from . import pushshift
  6 | from . import tsdb
  7 | 
  8 | def get_comments(
  9 |         subreddit=None,
 10 |         username=None,
 11 |         specific_submission=None,
 12 |         do_supplement=True,
 13 |         lower=None,
 14 |         upper=None,
 15 |     ):
 16 |     if not specific_submission and not common.is_xor(subreddit, username):
 17 |         raise exceptions.NotExclusive(['subreddit', 'username'])
 18 |     if username and specific_submission:
 19 |         raise exceptions.NotExclusive(['username', 'specific_submission'])
 20 | 
 21 |     common.login()
 22 | 
 23 |     if specific_submission:
 24 |         (database, subreddit) = tsdb.TSDB.for_submission(specific_submission, do_create=True, fix_name=True)
 25 |         specific_submission = common.t3_prefix(specific_submission)[3:]
 26 |         specific_submission = common.r.submission(specific_submission)
 27 |         database.insert(specific_submission)
 28 | 
 29 |     elif subreddit:
 30 |         (database, subreddit) = tsdb.TSDB.for_subreddit(subreddit, do_create=True, fix_name=True)
 31 | 
 32 |     else:
 33 |         (database, username) = tsdb.TSDB.for_user(username, do_create=True, fix_name=True)
 34 | 
 35 |     cur = database.sql.cursor()
 36 | 
 37 |     if lower is None:
 38 |         lower = 0
 39 |     if lower == 'update':
 40 |         query_latest = 'SELECT created FROM comments ORDER BY created DESC LIMIT 1'
 41 |         if subreddit:
 42 |             # Instead of blindly taking the highest timestamp currently in the db,
 43 |             # we must consider the case that the user has previously done a
 44 |             # specific_submission scan and now wants to do a general scan, which
 45 |             # would trick the latest timestamp into missing anything before that
 46 |             # specific submission.
 47 |             query = '''
 48 |             SELECT created FROM comments WHERE NOT EXISTS (
 49 |                 SELECT 1 FROM submissions
 50 |                 WHERE submissions.idstr == comments.submission
 51 |                 AND submissions.augmented_at IS NOT NULL
 52 |             )
 53 |             ORDER BY created DESC LIMIT 1
 54 |             '''
 55 |             unaugmented = cur.execute(query).fetchone()
 56 |             if unaugmented:
 57 |                 lower = unaugmented[0] - 1
 58 |             else:
 59 |                 latest = cur.execute(query_latest).fetchone()
 60 |                 if latest:
 61 |                     lower = latest[0] - 1
 62 |         if username:
 63 |             latest = cur.execute(query_latest).fetchone()
 64 |             if latest:
 65 |                 lower = latest[0] - 1
 66 |     if lower == 'update':
 67 |         lower = 0
 68 | 
 69 |     if specific_submission:
 70 |         comments = pushshift.get_comments_from_submission(specific_submission)
 71 |     elif subreddit:
 72 |         comments = pushshift.get_comments_from_subreddit(subreddit, lower=lower, upper=upper)
 73 |     elif username:
 74 |         comments = pushshift.get_comments_from_user(username, lower=lower, upper=upper)
 75 | 
 76 |     if do_supplement:
 77 |         comments = pushshift.supplement_reddit_data(comments, chunk_size=100)
 78 |     comments = common.generator_chunker(comments, 500)
 79 | 
 80 |     form = '{lower} ({lower_unix}) - {upper} ({upper_unix}) +{gain}'
 81 |     for chunk in comments:
 82 |         step = database.insert(chunk)
 83 |         message = form.format(
 84 |             lower=common.human(chunk[0].created_utc),
 85 |             upper=common.human(chunk[-1].created_utc),
 86 |             lower_unix=int(chunk[0].created_utc),
 87 |             upper_unix=int(chunk[-1].created_utc),
 88 |             gain=step['new_comments'],
 89 |         )
 90 |         print(message)
 91 | 
 92 |     if specific_submission:
 93 |         query = '''
 94 |             UPDATE submissions
 95 |             set augmented_at = ?
 96 |             WHERE idstr == ?
 97 |         '''
 98 |         bindings = [common.get_now(), specific_submission.fullname]
 99 |         cur.execute(query, bindings)
100 |         database.sql.commit()
101 | 
102 | def get_comments_argparse(args):
103 |     return get_comments(
104 |         subreddit=args.subreddit,
105 |         username=args.username,
106 |         #limit=common.int_none(args.limit),
107 |         #threshold=common.int_none(args.threshold),
108 |         #num_thresh=common.int_none(args.num_thresh),
109 |         specific_submission=args.specific_submission,
110 |         do_supplement=args.do_supplement,
111 |         lower=args.lower,
112 |         upper=args.upper,
113 |     )
114 | 


--------------------------------------------------------------------------------
/timesearch_modules/get_styles.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import requests
 3 | 
 4 | from . import common
 5 | from . import tsdb
 6 | 
 7 | session = requests.Session()
 8 | 
 9 | def get_styles(subreddit):
10 |     (database, subreddit) = tsdb.TSDB.for_subreddit(subreddit, fix_name=True)
11 | 
12 |     print('Getting styles for /r/%s' % subreddit)
13 |     subreddit = common.r.subreddit(subreddit)
14 |     styles = subreddit.stylesheet()
15 | 
16 |     database.styles_dir.makedirs(exist_ok=True)
17 | 
18 |     stylesheet_filepath = database.styles_dir.with_child('stylesheet.css')
19 |     print('Downloading %s' % stylesheet_filepath.relative_path)
20 |     with stylesheet_filepath.open('w', encoding='utf-8') as stylesheet:
21 |         stylesheet.write(styles.stylesheet)
22 | 
23 |     for image in styles.images:
24 |         image_basename = image['name'] + '.' + image['url'].split('.')[-1]
25 |         image_filepath = database.styles_dir.with_child(image_basename)
26 |         print('Downloading %s' % image_filepath.relative_path)
27 |         with image_filepath.open('wb') as image_file:
28 |             response = session.get(image['url'])
29 |             image_file.write(response.content)
30 | 
31 | def get_styles_argparse(args):
32 |     return get_styles(args.subreddit)
33 | 


--------------------------------------------------------------------------------
/timesearch_modules/get_submissions.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | import traceback
  3 | 
  4 | from . import common
  5 | from . import exceptions
  6 | from . import pushshift
  7 | from . import tsdb
  8 | 
  9 | def _normalize_subreddit(subreddit):
 10 |     if subreddit is None:
 11 |         pass
 12 |     elif isinstance(subreddit, str):
 13 |         subreddit = common.r.subreddit(subreddit)
 14 |     elif not isinstance(subreddit, common.praw.models.Subreddit):
 15 |         raise TypeError(type(subreddit))
 16 |     return subreddit
 17 | 
 18 | def _normalize_user(user):
 19 |     if user is None:
 20 |         pass
 21 |     elif isinstance(user, str):
 22 |         user = common.r.redditor(user)
 23 |     elif not isinstance(user, common.praw.models.Redditor):
 24 |         raise TypeError(type(user))
 25 |     return user
 26 | 
 27 | def get_submissions(
 28 |         subreddit=None,
 29 |         username=None,
 30 |         lower=None,
 31 |         upper=None,
 32 |         do_supplement=True,
 33 |     ):
 34 |     '''
 35 |     Collect submissions across time.
 36 |     Please see the global DOCSTRING variable.
 37 |     '''
 38 |     if not common.is_xor(subreddit, username):
 39 |         raise exceptions.NotExclusive(['subreddit', 'username'])
 40 | 
 41 |     common.login()
 42 | 
 43 |     if subreddit:
 44 |         (database, subreddit) = tsdb.TSDB.for_subreddit(subreddit, fix_name=True)
 45 |     elif username:
 46 |         (database, username) = tsdb.TSDB.for_user(username, fix_name=True)
 47 |     cur = database.sql.cursor()
 48 | 
 49 |     subreddit = _normalize_subreddit(subreddit)
 50 |     user = _normalize_user(username)
 51 | 
 52 |     if lower == 'update':
 53 |         # Start from the latest submission
 54 |         cur.execute('SELECT created FROM submissions ORDER BY created DESC LIMIT 1')
 55 |         fetch = cur.fetchone()
 56 |         if fetch is not None:
 57 |             lower = fetch[0]
 58 |         else:
 59 |             lower = None
 60 |     if lower is None:
 61 |         lower = 0
 62 | 
 63 |     if username:
 64 |         submissions = pushshift.get_submissions_from_user(username, lower=lower, upper=upper)
 65 |     else:
 66 |         submissions = pushshift.get_submissions_from_subreddit(subreddit, lower=lower, upper=upper)
 67 | 
 68 |     if do_supplement:
 69 |         submissions = pushshift.supplement_reddit_data(submissions, chunk_size=100)
 70 |     submissions = common.generator_chunker(submissions, 200)
 71 | 
 72 |     form = '{lower} ({lower_unix}) - {upper} ({upper_unix}) +{gain}'
 73 |     for chunk in submissions:
 74 |         chunk.sort(key=lambda x: x.created_utc)
 75 |         step = database.insert(chunk)
 76 |         message = form.format(
 77 |             lower=common.human(chunk[0].created_utc),
 78 |             upper=common.human(chunk[-1].created_utc),
 79 |             lower_unix=int(chunk[0].created_utc),
 80 |             upper_unix=int(chunk[-1].created_utc),
 81 |             gain=step['new_submissions'],
 82 |         )
 83 |         print(message)
 84 | 
 85 |     cur.execute('SELECT COUNT(idint) FROM submissions')
 86 |     itemcount = cur.fetchone()[0]
 87 | 
 88 |     print('Ended with %d items in %s' % (itemcount, database.filepath.basename))
 89 | 
 90 | def get_submissions_argparse(args):
 91 |     if args.lower == 'update':
 92 |         lower = 'update'
 93 |     else:
 94 |         lower = common.int_none(args.lower)
 95 | 
 96 |     return get_submissions(
 97 |         subreddit=args.subreddit,
 98 |         username=args.username,
 99 |         lower=lower,
100 |         upper=common.int_none(args.upper),
101 |         do_supplement=args.do_supplement,
102 |     )
103 | 


--------------------------------------------------------------------------------
/timesearch_modules/get_wiki.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import markdown
 3 | 
 4 | from . import common
 5 | from . import tsdb
 6 | 
 7 | 
 8 | def get_wiki(subreddit):
 9 |     (database, subreddit) = tsdb.TSDB.for_subreddit(subreddit, fix_name=True)
10 | 
11 |     print('Getting wiki pages for /r/%s' % subreddit)
12 |     subreddit = common.r.subreddit(subreddit)
13 | 
14 |     for wikipage in subreddit.wiki:
15 |         if wikipage.name == 'config/stylesheet':
16 |             continue
17 | 
18 |         wikipage_path = database.wiki_dir.join(wikipage.name).add_extension('md')
19 |         wikipage_path.parent.makedirs(exist_ok=True)
20 |         wikipage_path.write('w', wikipage.content_md, encoding='utf-8')
21 |         print('Wrote', wikipage_path.relative_path)
22 | 
23 |         html_path = wikipage_path.replace_extension('html')
24 |         escaped = wikipage.content_md.replace('<', '&lt;').replace('>', '&rt;')
25 |         html_path.write('w', markdown.markdown(escaped, output_format='html5'), encoding='utf-8')
26 |         print('Wrote', html_path.relative_path)
27 | 
28 | def get_wiki_argparse(args):
29 |     return get_wiki(args.subreddit)
30 | 


--------------------------------------------------------------------------------
/timesearch_modules/index.py:
--------------------------------------------------------------------------------
  1 | import datetime
  2 | import os
  3 | 
  4 | from . import common
  5 | from . import exceptions
  6 | from . import tsdb
  7 | 
  8 | 
  9 | LINE_FORMAT_TXT = '''
 10 | {timestamp}: [{title}]({link}) - /u/{author} (+{score})
 11 | '''.replace('\n', '')
 12 | 
 13 | LINE_FORMAT_HTML = '''
 14 | <div>{timestamp}: <a href="{link}">[{flairtext}] {title}</a> - <a href="{authorlink}">{author}</a> (+{score})</div>
 15 | '''.replace('\n', '')
 16 | 
 17 | TIMESTAMP_FORMAT = '%Y %b %d'
 18 | # The time format.
 19 | # "%Y %b %d" = "2016 August 10"
 20 | # See http://strftime.org/
 21 | 
 22 | HTML_HEADER = '''
 23 | <html>
 24 | <head>
 25 | <meta charset="UTF-8">
 26 | <style>
 27 |     *
 28 |     {
 29 |         font-family: Consolas;
 30 |     }
 31 | </style>
 32 | </head>
 33 | 
 34 | <body>
 35 | '''
 36 | 
 37 | HTML_FOOTER = '''
 38 | </body>
 39 | </html>
 40 | '''
 41 | 
 42 | 
 43 | def index(
 44 |         subreddit=None,
 45 |         username=None,
 46 |         do_all=False,
 47 |         do_date=False,
 48 |         do_title=False,
 49 |         do_score=False,
 50 |         do_author=False,
 51 |         do_subreddit=False,
 52 |         do_flair=False,
 53 |         html=False,
 54 |         offline=False,
 55 |         score_threshold=0,
 56 |     ):
 57 |     if not common.is_xor(subreddit, username):
 58 |         raise exceptions.NotExclusive(['subreddit', 'username'])
 59 | 
 60 |     if subreddit:
 61 |         database = tsdb.TSDB.for_subreddit(subreddit, do_create=False)
 62 |     else:
 63 |         database = tsdb.TSDB.for_user(username, do_create=False)
 64 | 
 65 |     kwargs = {'html': html, 'offline': offline, 'score_threshold': score_threshold}
 66 |     wrote = None
 67 | 
 68 |     if do_all or do_date:
 69 |         print('Writing time file')
 70 |         wrote = index_worker(database, suffix='_date', orderby='created ASC', **kwargs)
 71 | 
 72 |     if do_all or do_title:
 73 |         print('Writing title file')
 74 |         wrote = index_worker(database, suffix='_title', orderby='title ASC', **kwargs)
 75 | 
 76 |     if do_all or do_score:
 77 |         print('Writing score file')
 78 |         wrote = index_worker(database, suffix='_score', orderby='score DESC', **kwargs)
 79 | 
 80 |     if not username and (do_all or do_author):
 81 |         print('Writing author file')
 82 |         wrote = index_worker(database, suffix='_author', orderby='author ASC', **kwargs)
 83 | 
 84 |     if username and (do_all or do_subreddit):
 85 |         print('Writing subreddit file')
 86 |         wrote = index_worker(database, suffix='_subreddit', orderby='subreddit ASC', **kwargs)
 87 | 
 88 |     if do_all or do_flair:
 89 |         print('Writing flair file')
 90 |         # Items with flair come before items without. Each group is sorted by time separately.
 91 |         orderby = 'flair_text IS NULL ASC, created ASC'
 92 |         wrote = index_worker(database, suffix='_flair', orderby=orderby, **kwargs)
 93 | 
 94 |     if not wrote:
 95 |         raise Exception('No sorts selected! Read the docstring')
 96 |     print('Done.')
 97 | 
 98 | def index_worker(
 99 |         database,
100 |         suffix,
101 |         orderby,
102 |         score_threshold=0,
103 |         html=False,
104 |         offline=False,
105 |     ):
106 |     cur = database.sql.cursor()
107 |     statement = 'SELECT * FROM submissions WHERE score >= {threshold} ORDER BY {order}'
108 |     statement = statement.format(threshold=score_threshold, order=orderby)
109 |     cur.execute(statement)
110 | 
111 |     database.index_dir.makedirs(exist_ok=True)
112 | 
113 |     extension = '.html' if html else '.txt'
114 |     mash_basename = database.filepath.replace_extension('').basename
115 |     mash_basename += suffix + extension
116 |     mash_filepath = database.index_dir.with_child(mash_basename)
117 | 
118 |     mash_handle = mash_filepath.open('w', encoding='UTF-8')
119 |     if html:
120 |         mash_handle.write(HTML_HEADER)
121 |         line_format = LINE_FORMAT_HTML
122 |     else:
123 |         line_format = LINE_FORMAT_TXT
124 | 
125 |     do_timestamp = '{timestamp}' in line_format
126 | 
127 |     for submission in common.fetchgenerator(cur):
128 |         submission = tsdb.DBEntry(submission)
129 | 
130 |         if do_timestamp:
131 |             timestamp = int(submission.created)
132 |             timestamp = datetime.datetime.utcfromtimestamp(timestamp)
133 |             timestamp = timestamp.strftime(TIMESTAMP_FORMAT)
134 |         else:
135 |             timestamp = ''
136 | 
137 |         if offline:
138 |             link = f'../offline_reading/{submission.idstr}.html'
139 |         else:
140 |             link = f'https://redd.it/{submission.idstr[3:]}'
141 | 
142 |         author = submission.author
143 |         if author.lower() == '[deleted]':
144 |             author_link = '#'
145 |         else:
146 |             author_link = 'https://reddit.com/u/%s' % author
147 | 
148 |         line = line_format.format(
149 |             author=author,
150 |             authorlink=author_link,
151 |             flaircss=submission.flair_css_class or '',
152 |             flairtext=submission.flair_text or '',
153 |             id=submission.idstr,
154 |             numcomments=submission.num_comments,
155 |             score=submission.score,
156 |             link=link,
157 |             subreddit=submission.subreddit,
158 |             timestamp=timestamp,
159 |             title=submission.title.replace('\n', ' '),
160 |             url=submission.url or link,
161 |         )
162 |         line += '\n'
163 |         mash_handle.write(line)
164 | 
165 |     if html:
166 |         mash_handle.write(HTML_FOOTER)
167 |     mash_handle.close()
168 |     print('Wrote', mash_filepath.relative_path)
169 |     return mash_filepath
170 | 
171 | def index_argparse(args):
172 |     return index(
173 |         subreddit=args.subreddit,
174 |         username=args.username,
175 |         do_all=args.do_all,
176 |         do_date=args.do_date,
177 |         do_title=args.do_title,
178 |         do_score=args.do_score,
179 |         do_author=args.do_author,
180 |         do_subreddit=args.do_subreddit,
181 |         do_flair=args.do_flair,
182 |         html=args.html,
183 |         offline=args.offline,
184 |         score_threshold=common.int_none(args.score_threshold),
185 |     )
186 | 


--------------------------------------------------------------------------------
/timesearch_modules/ingest_jsonfile.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import time
 3 | import traceback
 4 | 
 5 | from voussoirkit import pathclass
 6 | 
 7 | from . import common
 8 | from . import exceptions
 9 | from . import pushshift
10 | from . import tsdb
11 | 
12 | def is_submission(obj):
13 |     return (
14 |         obj.get('name', '').startswith('t3_')
15 |         or obj.get('over_18') is not None
16 |     )
17 | 
18 | def is_comment(obj):
19 |     return (
20 |         obj.get('name', '').startswith('t1_')
21 |         or obj.get('parent_id', '').startswith('t3_')
22 |         or obj.get('link_id', '').startswith('t3_')
23 |     )
24 | 
25 | def jsonfile_to_objects(filepath):
26 |     filepath = pathclass.Path(filepath)
27 |     filepath.assert_is_file()
28 | 
29 |     with filepath.open('r', encoding='utf-8') as handle:
30 |         for line in handle:
31 |             line = line.strip()
32 |             if not line:
33 |                 break
34 |             obj = json.loads(line)
35 |             if is_submission(obj):
36 |                 yield pushshift.DummySubmission(**obj)
37 |             elif is_comment(obj):
38 |                 yield pushshift.DummyComment(**obj)
39 |             else:
40 |                 raise ValueError(f'Could not recognize object type {obj}.')
41 | 
42 | def ingest_jsonfile(
43 |         filepath,
44 |         subreddit=None,
45 |         username=None,
46 |     ):
47 |     if not common.is_xor(subreddit, username):
48 |         raise exceptions.NotExclusive(['subreddit', 'username'])
49 | 
50 |     if subreddit:
51 |         (database, subreddit) = tsdb.TSDB.for_subreddit(subreddit, fix_name=True)
52 |     elif username:
53 |         (database, username) = tsdb.TSDB.for_user(username, fix_name=True)
54 |     cur = database.sql.cursor()
55 | 
56 |     objects = jsonfile_to_objects(filepath)
57 |     database.insert(objects)
58 | 
59 |     cur.execute('SELECT COUNT(idint) FROM submissions')
60 |     submissioncount = cur.fetchone()[0]
61 |     cur.execute('SELECT COUNT(idint) FROM comments')
62 |     commentcount = cur.fetchone()[0]
63 | 
64 |     print('Ended with %d submissions and %d comments in %s' % (submissioncount, commentcount, database.filepath.basename))
65 | 
66 | def ingest_jsonfile_argparse(args):
67 |     return ingest_jsonfile(
68 |         subreddit=args.subreddit,
69 |         username=args.username,
70 |         filepath=args.json_file,
71 |     )
72 | 


--------------------------------------------------------------------------------
/timesearch_modules/livestream.py:
--------------------------------------------------------------------------------
  1 | import copy
  2 | import prawcore
  3 | import time
  4 | import traceback
  5 | 
  6 | from . import common
  7 | from . import exceptions
  8 | from . import tsdb
  9 | 
 10 | from voussoirkit import vlogging
 11 | 
 12 | log = vlogging.get_logger(__name__)
 13 | 
 14 | def _listify(x):
 15 |     '''
 16 |     The user may have given us a string containing multiple subreddits / users.
 17 |     Try to split that up into a list of names.
 18 |     '''
 19 |     if not x:
 20 |         return []
 21 |     if isinstance(x, str):
 22 |         return common.split_any(x, ['+', ' ', ','])
 23 |     return x
 24 | 
 25 | def generator_printer(generator):
 26 |     '''
 27 |     Given a generator that produces livestream update steps, print them out.
 28 |     This yields None because print returns None.
 29 |     '''
 30 |     prev_message_length = 0
 31 |     for step in generator:
 32 |         newtext = '%s: +%ds, %dc' % (step['tsdb'].filepath.basename, step['new_submissions'], step['new_comments'])
 33 |         totalnew = step['new_submissions'] + step['new_comments']
 34 |         status = '{now} {new}'.format(now=common.human(common.get_now()), new=newtext)
 35 |         clear_prev = (' ' * prev_message_length) + '\r'
 36 |         print(clear_prev + status, end='')
 37 |         prev_message_length = len(status)
 38 |         if totalnew == 0 and log.level == 0 or log.level > vlogging.DEBUG:
 39 |             # Since there were no news, allow the next line to overwrite status
 40 |             print('\r', end='', flush=True)
 41 |         else:
 42 |             print()
 43 |         yield None
 44 | 
 45 | def cycle_generators(generators, only_once, sleepy):
 46 |     '''
 47 |     Given multiple generators, yield an item from each one, cycling through
 48 |     them in a round-robin fashion.
 49 | 
 50 |     This is useful if you want to convert multiple livestream generators into a
 51 |     single generator that take turns updating each of them and yields all of
 52 |     their items.
 53 |     '''
 54 |     while True:
 55 |         for generator in generators:
 56 |             yield next(generator)
 57 |         if only_once:
 58 |             break
 59 |         time.sleep(sleepy)
 60 | 
 61 | def livestream(
 62 |         subreddit=None,
 63 |         username=None,
 64 |         as_a_generator=False,
 65 |         do_submissions=True,
 66 |         do_comments=True,
 67 |         limit=100,
 68 |         only_once=False,
 69 |         sleepy=30,
 70 |     ):
 71 |     '''
 72 |     Continuously get posts from this source and insert them into the database.
 73 | 
 74 |     as_a_generator:
 75 |         Return a generator where every iteration does a single livestream loop
 76 |         and yields the return value of TSDB.insert (A summary of new
 77 |         submission & comment count).
 78 |         This is useful if you want to manage the generator yourself.
 79 |         Otherwise, this function will run the generator forever.
 80 |     '''
 81 |     subreddits = _listify(subreddit)
 82 |     usernames = _listify(username)
 83 |     kwargs = {
 84 |         'do_submissions': do_submissions,
 85 |         'do_comments': do_comments,
 86 |         'limit': limit,
 87 |         'params': {'show': 'all'},
 88 |     }
 89 | 
 90 |     subreddit_generators = [
 91 |         _livestream_as_a_generator(subreddit=subreddit, username=None, **kwargs) for subreddit in subreddits
 92 |     ]
 93 |     user_generators = [
 94 |         _livestream_as_a_generator(subreddit=None, username=username, **kwargs) for username in usernames
 95 |     ]
 96 |     generators = subreddit_generators + user_generators
 97 | 
 98 |     if as_a_generator:
 99 |         if len(generators) == 1:
100 |             return generators[0]
101 |         return generators
102 | 
103 |     generator = cycle_generators(generators, only_once=only_once, sleepy=sleepy)
104 |     generator = generator_printer(generator)
105 | 
106 |     try:
107 |         for step in generator:
108 |             pass
109 |     except KeyboardInterrupt:
110 |         print()
111 |         return
112 | 
113 | hangman = lambda: livestream(
114 |     username='gallowboob',
115 |     do_submissions=True,
116 |     do_comments=True,
117 |     sleepy=60,
118 | )
119 | 
120 | def _livestream_as_a_generator(
121 |         subreddit,
122 |         username,
123 |         do_submissions,
124 |         do_comments,
125 |         limit,
126 |         params,
127 |     ):
128 | 
129 |     if not common.is_xor(subreddit, username):
130 |         raise exceptions.NotExclusive(['subreddit', 'username'])
131 | 
132 |     if not any([do_submissions, do_comments]):
133 |         raise TypeError('Required do_submissions and/or do_comments parameter')
134 |     common.login()
135 | 
136 |     if subreddit:
137 |         log.debug('Getting subreddit %s', subreddit)
138 |         (database, subreddit) = tsdb.TSDB.for_subreddit(subreddit, fix_name=True)
139 |         subreddit = common.r.subreddit(subreddit)
140 |         submission_function = subreddit.new if do_submissions else None
141 |         comment_function = subreddit.comments if do_comments else None
142 |     else:
143 |         log.debug('Getting redditor %s', username)
144 |         (database, username) = tsdb.TSDB.for_user(username, fix_name=True)
145 |         user = common.r.redditor(username)
146 |         submission_function = user.submissions.new if do_submissions else None
147 |         comment_function = user.comments.new if do_comments else None
148 | 
149 |     while True:
150 |         try:
151 |             items = _livestream_helper(
152 |                 submission_function=submission_function,
153 |                 comment_function=comment_function,
154 |                 limit=limit,
155 |                 params=params,
156 |             )
157 |             newitems = database.insert(items)
158 |             yield newitems
159 |         except prawcore.exceptions.NotFound:
160 |             print(database.filepath.basename, '404 not found')
161 |             step = {'tsdb': database, 'new_comments': 0, 'new_submissions': 0}
162 |             yield step
163 |         except Exception:
164 |             traceback.print_exc()
165 |             print('Retrying...')
166 |             step = {'tsdb': database, 'new_comments': 0, 'new_submissions': 0}
167 |             yield step
168 | 
169 | def _livestream_helper(
170 |         submission_function=None,
171 |         comment_function=None,
172 |         *args,
173 |         **kwargs,
174 |     ):
175 |     '''
176 |     Given a submission-retrieving function and/or a comment-retrieving function,
177 |     collect submissions and comments in a list together and return that.
178 | 
179 |     args and kwargs go into the collecting functions.
180 |     '''
181 |     if not any([submission_function, comment_function]):
182 |         raise TypeError('Required submissions and/or comments parameter')
183 |     results = []
184 | 
185 |     if submission_function:
186 |         log.debug('Getting submissions %s %s', args, kwargs)
187 |         this_kwargs = copy.deepcopy(kwargs)
188 |         submission_batch = submission_function(*args, **this_kwargs)
189 |         results.extend(submission_batch)
190 |     if comment_function:
191 |         log.debug('Getting comments %s %s', args, kwargs)
192 |         this_kwargs = copy.deepcopy(kwargs)
193 |         comment_batch = comment_function(*args, **this_kwargs)
194 |         results.extend(comment_batch)
195 |     log.debug('Got %d posts', len(results))
196 |     return results
197 | 
198 | def livestream_argparse(args):
199 |     if args.submissions is args.comments is False:
200 |         args.submissions = True
201 |         args.comments = True
202 |     if args.limit is None:
203 |         limit = 100
204 |     else:
205 |         limit = int(args.limit)
206 | 
207 |     return livestream(
208 |         subreddit=args.subreddit,
209 |         username=args.username,
210 |         do_comments=args.comments,
211 |         do_submissions=args.submissions,
212 |         limit=limit,
213 |         only_once=args.once,
214 |         sleepy=int(args.sleepy),
215 |     )
216 | 


--------------------------------------------------------------------------------
/timesearch_modules/merge_db.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from . import common
 4 | from . import tsdb
 5 | 
 6 | 
 7 | MIGRATE_QUERY = '''
 8 | INSERT INTO {tablename}
 9 | SELECT othertable.* FROM other.{tablename} othertable
10 | LEFT JOIN {tablename} mytable ON mytable.idint == othertable.idint
11 | WHERE mytable.idint IS NULL;
12 | '''
13 | 
14 | def _migrate_helper(db, tablename):
15 |     query = MIGRATE_QUERY.format(tablename=tablename)
16 |     print(query)
17 | 
18 |     oldcount = db.cur.execute('SELECT count(*) FROM %s' % tablename).fetchone()[0]
19 |     db.cur.execute(query)
20 |     db.sql.commit()
21 | 
22 |     newcount = db.cur.execute('SELECT count(*) FROM %s' % tablename).fetchone()[0]
23 |     print('Gained %d items.' % (newcount - oldcount))
24 | 
25 | def merge_db(from_db_path, to_db_path):
26 |     to_db = tsdb.TSDB(to_db_path)
27 |     from_db = tsdb.TSDB(from_db_path)
28 | 
29 |     to_db.cur.execute('ATTACH DATABASE "%s" AS other' % from_db_path)
30 |     _migrate_helper(to_db, 'submissions')
31 |     _migrate_helper(to_db, 'comments')
32 | 
33 | def merge_db_argparse(args):
34 |     return merge_db(args.from_db_path, args.to_db_path)
35 | 


--------------------------------------------------------------------------------
/timesearch_modules/offline_reading.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import markdown
  3 | 
  4 | from . import common
  5 | from . import exceptions
  6 | from . import tsdb
  7 | 
  8 | 
  9 | HTML_HEADER = '''
 10 | <html>
 11 | <head>
 12 | <title>{title}</title>
 13 | <meta charset="UTF-8">
 14 | <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
 15 | 
 16 | <style>
 17 | .submission, .comment
 18 | {{
 19 |     padding-left: 20px;
 20 |     padding-right: 4px;
 21 | }}
 22 | .comment
 23 | {{
 24 |     margin-top: 4px;
 25 |     margin-bottom: 4px;
 26 |     border: 1px solid black;
 27 | }}
 28 | .submission
 29 | {{
 30 |     border: 2px solid blue;
 31 | }}
 32 | .hidden
 33 | {{
 34 |     display: none;
 35 | }}
 36 | </style>
 37 | </head>
 38 | <body>
 39 | '''.strip()
 40 | 
 41 | HTML_FOOTER = '''
 42 | </body>
 43 | 
 44 | <script>
 45 | function toggle_collapse(comment_div)
 46 | {
 47 |     var button = comment_div.getElementsByClassName("toggle_hide_button")[0];
 48 |     var collapsible = comment_div.getElementsByClassName("collapsible")[0];
 49 |     if (collapsible.classList.contains("hidden"))
 50 |     {
 51 |         collapsible.classList.remove("hidden");
 52 |         button.innerText = "[-]";
 53 |     }
 54 |     else
 55 |     {
 56 |         collapsible.classList.add("hidden");
 57 |         button.innerText = "[+]";
 58 |     }
 59 | }
 60 | </script>
 61 | </html>
 62 | '''.strip()
 63 | 
 64 | HTML_COMMENT = '''
 65 | <div class="comment" id="{id}">
 66 |     <p class="userinfo">
 67 |         <a
 68 |         class="toggle_hide_button"
 69 |         href="javascript:void(0)"
 70 |         onclick="toggle_collapse(this.parentElement.parentElement)">[-]
 71 |         </a>
 72 |         {usernamelink}
 73 |         |
 74 |         <span class="score">{score} points</span>
 75 |         |
 76 |         <a class="timestamp" href="{permalink}">{human}</a>
 77 |     </p>
 78 |     <div class="collapsible">
 79 |         {body}
 80 |         {{children}}
 81 |     </div>
 82 | </div>
 83 | '''.strip()
 84 | 
 85 | HTML_SUBMISSION = '''
 86 | <div class="submission" id="{id}">
 87 |     <p class="userinfo">
 88 |         {usernamelink}
 89 |         |
 90 |         <span class="score">{score} points</span>
 91 |         |
 92 |         <a class="timestamp" href="{permalink}">{human}</a>
 93 |     </p>
 94 |     <strong>{title}</strong>
 95 |     <p>{url_or_text}</p>
 96 | </div>
 97 | {{children}}
 98 | '''.strip()
 99 | 
100 | 
101 | class TreeNode:
102 |     def __init__(self, identifier, data, parent=None):
103 |         assert isinstance(identifier, str)
104 |         assert '\\' not in identifier
105 |         self.identifier = identifier
106 |         self.data = data
107 |         self.parent = parent
108 |         self.children = {}
109 | 
110 |     def __getitem__(self, key):
111 |         return self.children[key]
112 | 
113 |     def __repr__(self):
114 |         return 'TreeNode %s' % self.abspath()
115 | 
116 |     def abspath(self):
117 |         node = self
118 |         nodes = [node]
119 |         while node.parent is not None:
120 |             node = node.parent
121 |             nodes.append(node)
122 |         nodes.reverse()
123 |         nodes = [node.identifier for node in nodes]
124 |         return '\\'.join(nodes)
125 | 
126 |     def add_child(self, other_node, overwrite_parent=False):
127 |         self.check_child_availability(other_node.identifier)
128 |         if other_node.parent is not None and not overwrite_parent:
129 |             raise ValueError('That node already has a parent. Try `overwrite_parent=True`')
130 | 
131 |         other_node.parent = self
132 |         self.children[other_node.identifier] = other_node
133 |         return other_node
134 | 
135 |     def check_child_availability(self, identifier):
136 |         if ':' in identifier:
137 |             raise Exception('Only roots may have a colon')
138 |         if identifier in self.children:
139 |             raise Exception('Node %s already has child %s' % (self.identifier, identifier))
140 | 
141 |     def detach(self):
142 |         del self.parent.children[self.identifier]
143 |         self.parent = None
144 | 
145 |     def listnodes(self, customsort=None):
146 |         items = list(self.children.items())
147 |         if customsort is None:
148 |             items.sort(key=lambda x: x[0].lower())
149 |         else:
150 |             items.sort(key=customsort)
151 |         return [item[1] for item in items]
152 | 
153 |     def merge_other(self, othertree, otherroot=None):
154 |         newroot = None
155 |         if ':' in othertree.identifier:
156 |             if otherroot is None:
157 |                 raise Exception('Must specify a new name for the other tree\'s root')
158 |             else:
159 |                 newroot = otherroot
160 |         else:
161 |             newroot = othertree.identifier
162 |         othertree.identifier = newroot
163 |         othertree.parent = self
164 |         self.check_child_availability(newroot)
165 |         self.children[newroot] = othertree
166 | 
167 |     def printtree(self, customsort=None):
168 |         for node in self.walk(customsort):
169 |             print(node.abspath())
170 | 
171 |     def walk(self, customsort=None):
172 |         yield self
173 |         for child in self.listnodes(customsort=customsort):
174 |             #print(child)
175 |             #print(child.listnodes())
176 |             yield from child.walk(customsort=customsort)
177 | 
178 | def html_format_comment(comment):
179 |     text = HTML_COMMENT.format(
180 |         id=comment.idstr,
181 |         body=sanitize_braces(render_markdown(comment.body)),
182 |         usernamelink=html_helper_userlink(comment),
183 |         score=comment.score,
184 |         human=common.human(comment.created),
185 |         permalink=html_helper_permalink(comment),
186 |     )
187 |     return text
188 | 
189 | def html_format_submission(submission):
190 |     text = HTML_SUBMISSION.format(
191 |         id=submission.idstr,
192 |         title=sanitize_braces(submission.title),
193 |         usernamelink=html_helper_userlink(submission),
194 |         score=submission.score,
195 |         human=common.human(submission.created),
196 |         permalink=html_helper_permalink(submission),
197 |         url_or_text=html_helper_urlortext(submission),
198 |     )
199 |     return text
200 | 
201 | def html_from_database(database, specific_submission=None):
202 |     '''
203 |     Given a timesearch database, produce html pages for each
204 |     of the submissions it contains (or one particular submission fullname)
205 |     '''
206 |     if markdown is None:
207 |         raise ImportError('Page cannot be rendered without the markdown module')
208 | 
209 |     submission_trees = trees_from_database(database, specific_submission)
210 |     for submission_tree in submission_trees:
211 |         page = html_from_tree(submission_tree, sort=lambda x: x.data.score * -1)
212 |         database.offline_reading_dir.makedirs(exist_ok=True)
213 | 
214 |         html = ''
215 | 
216 |         header = HTML_HEADER.format(title=submission_tree.data.title)
217 |         html += header
218 | 
219 |         html += page
220 | 
221 |         html += HTML_FOOTER
222 |         yield (submission_tree.identifier, html)
223 | 
224 | def html_from_tree(tree, sort=None):
225 |     '''
226 |     Given a tree *whose root is the submission*, return
227 |     HTML-formatted text representing each submission's comment page.
228 |     '''
229 |     if tree.data.object_type == 'submission':
230 |         page = html_format_submission(tree.data)
231 |     elif tree.data.object_type == 'comment':
232 |         page = html_format_comment(tree.data)
233 |     children = tree.listnodes()
234 |     if sort is not None:
235 |         children.sort(key=sort)
236 |     children = [html_from_tree(child, sort) for child in children]
237 |     if len(children) == 0:
238 |         children = ''
239 |     else:
240 |         children = '\n\n'.join(children)
241 |     try:
242 |         page = page.format(children=children)
243 |     except IndexError:
244 |         print(page)
245 |         raise
246 |     return page
247 | 
248 | def html_helper_permalink(item):
249 |     '''
250 |     Given a submission or a comment, return the URL for its permalink.
251 |     '''
252 |     link = 'https://old.reddit.com/r/%s/comments/' % item.subreddit
253 |     if item.object_type == 'submission':
254 |         link += item.idstr[3:]
255 |     elif item.object_type == 'comment':
256 |         link += '%s/_/%s' % (item.submission[3:], item.idstr[3:])
257 |     return link
258 | 
259 | def html_helper_urlortext(submission):
260 |     '''
261 |     Given a submission, return either an <a> tag for its url, or its
262 |     markdown-rendered selftext.
263 |     '''
264 |     if submission.url:
265 |         text = '<a href="{url}">{url}</a>'.format(url=submission.url)
266 |     elif submission.selftext:
267 |         text = render_markdown(submission.selftext)
268 |     else:
269 |         text = ''
270 |     text = sanitize_braces(text)
271 |     return text
272 | 
273 | def html_helper_userlink(item):
274 |     '''
275 |     Given a submission or comment, return an <a> tag for its author, or [deleted].
276 |     '''
277 |     name = item.author
278 |     if name.lower() == '[deleted]':
279 |         return '[deleted]'
280 |     link = 'https://old.reddit.com/u/{name}'
281 |     link = '<a href="%s">{name}</a>' % link
282 |     link = link.format(name=name)
283 |     return link
284 | 
285 | def render_markdown(text):
286 |     # I was going to use html.escape, but then it turns html entities like
287 |     # &nbsp; into &amp;nbsp; which doesn't work.
288 |     # So I only want to escape the brackets.
289 |     escaped = text.replace('<', '&lt;').replace('>', '&rt;')
290 |     text = markdown.markdown(escaped, output_format='html5')
291 |     return text
292 | 
293 | def sanitize_braces(text):
294 |     text = text.replace('{', '{{')
295 |     text = text.replace('}', '}}')
296 |     return text
297 | 
298 | def trees_from_database(database, specific_submission=None):
299 |     '''
300 |     Given a timesearch database, take all of the submission
301 |     ids, take all of the comments for each submission id, and run them
302 |     through `tree_from_submission`.
303 | 
304 |     Yield each submission's tree as it is generated.
305 |     '''
306 |     cur1 = database.sql.cursor()
307 |     cur2 = database.sql.cursor()
308 | 
309 |     if specific_submission is None:
310 |         cur1.execute('SELECT idstr FROM submissions ORDER BY created ASC')
311 |         submission_ids = common.fetchgenerator(cur1)
312 |         # sql always returns rows as tuples, even when selecting one column.
313 |         submission_ids = (x[0] for x in submission_ids)
314 |     else:
315 |         specific_submission = common.t3_prefix(specific_submission)
316 |         submission_ids = [specific_submission]
317 | 
318 |     found_some_posts = False
319 |     for submission_id in submission_ids:
320 |         found_some_posts = True
321 |         cur2.execute('SELECT * FROM submissions WHERE idstr == ?', [submission_id])
322 |         submission = cur2.fetchone()
323 |         cur2.execute('SELECT * FROM comments WHERE submission == ?', [submission_id])
324 |         fetched_comments = cur2.fetchall()
325 |         submission_tree = tree_from_submission(submission, fetched_comments)
326 |         yield submission_tree
327 | 
328 |     if not found_some_posts:
329 |         raise Exception('Found no submissions!')
330 | 
331 | def tree_from_submission(submission_dbrow, comments_dbrows):
332 |     '''
333 |     Given the sqlite data for a submission and all of its comments,
334 |     return a tree with the submission id as the root
335 |     '''
336 |     submission = tsdb.DBEntry(submission_dbrow)
337 |     comments = [tsdb.DBEntry(c) for c in comments_dbrows]
338 |     comments.sort(key=lambda x: x.created)
339 | 
340 |     print('Building tree for %s (%d comments)' % (submission.idstr, len(comments)))
341 |     # Thanks Martin Schmidt for the algorithm
342 |     # http://stackoverflow.com/a/29942118/5430534
343 |     tree = TreeNode(identifier=submission.idstr, data=submission)
344 |     node_map = {}
345 | 
346 |     for comment in comments:
347 |         # Ensure this comment is in a node of its own
348 |         this_node = node_map.get(comment.idstr, None)
349 |         if this_node:
350 |             # This ID was detected as a parent of a previous iteration
351 |             # Now we're actually filling it in.
352 |             this_node.data = comment
353 |         else:
354 |             this_node = TreeNode(comment.idstr, comment)
355 |             node_map[comment.idstr] = this_node
356 | 
357 |         # Attach this node to the parent.
358 |         if comment.parent.startswith('t3_'):
359 |             tree.add_child(this_node)
360 |         else:
361 |             parent_node = node_map.get(comment.parent, None)
362 |             if not parent_node:
363 |                 parent_node = TreeNode(comment.parent, data=None)
364 |                 node_map[comment.parent] = parent_node
365 |             parent_node.add_child(this_node)
366 |             this_node.parent = parent_node
367 |     return tree
368 | 
369 | def offline_reading(subreddit=None, username=None, specific_submission=None):
370 |     if not specific_submission and not common.is_xor(subreddit, username):
371 |         raise exceptions.NotExclusive(['subreddit', 'username'])
372 | 
373 |     if specific_submission and not username and not subreddit:
374 |         database = tsdb.TSDB.for_submission(specific_submission, do_create=False)
375 | 
376 |     elif subreddit:
377 |         database = tsdb.TSDB.for_subreddit(subreddit, do_create=False)
378 | 
379 |     else:
380 |         database = tsdb.TSDB.for_user(username, do_create=False)
381 | 
382 |     htmls = html_from_database(database, specific_submission=specific_submission)
383 | 
384 |     for (id, html) in htmls:
385 |         html_basename = '%s.html' % id
386 |         html_filepath = database.offline_reading_dir.with_child(html_basename)
387 |         html_handle = html_filepath.open('w', encoding='utf-8')
388 |         html_handle.write(html)
389 |         html_handle.close()
390 |         print('Wrote', html_filepath.relative_path)
391 | 
392 | def offline_reading_argparse(args):
393 |     return offline_reading(
394 |         subreddit=args.subreddit,
395 |         username=args.username,
396 |         specific_submission=args.specific_submission,
397 |     )
398 | 


--------------------------------------------------------------------------------
/timesearch_modules/pushshift.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | On January 29, 2018, reddit announced the death of the ?timestamp cloudsearch
  3 | parameter for submissions. RIP.
  4 | https://www.reddit.com/r/changelog/comments/7tus5f/update_to_search_api/dtfcdn0
  5 | 
  6 | This module interfaces with api.pushshift.io to restore this functionality.
  7 | It also provides new features previously impossible through reddit alone, such
  8 | as scanning all of a user's comments.
  9 | '''
 10 | import html
 11 | import requests
 12 | import time
 13 | import traceback
 14 | 
 15 | from . import common
 16 | 
 17 | from voussoirkit import ratelimiter
 18 | from voussoirkit import vlogging
 19 | 
 20 | log = vlogging.get_logger(__name__)
 21 | 
 22 | print('Thank you Jason Baumgartner of Pushshift.io!')
 23 | 
 24 | USERAGENT = 'Timesearch ({version}) ({contact})'
 25 | API_URL = 'https://api.pushshift.io/reddit/'
 26 | 
 27 | DEFAULT_PARAMS = {
 28 |     'size': 1000,
 29 |     'order': 'asc',
 30 |     'sort': 'created_utc',
 31 | }
 32 | 
 33 | # Pushshift does not supply attributes that are null. So we fill them back in.
 34 | FALLBACK_ATTRIBUTES = {
 35 |     'distinguished': None,
 36 |     'edited': False,
 37 |     'link_flair_css_class': None,
 38 |     'link_flair_text': None,
 39 |     'score': 0,
 40 |     'selftext': '',
 41 | }
 42 | 
 43 | contact_info_message = '''
 44 | Please add a CONTACT_INFO string variable to your bot.py file.
 45 | This will be added to your pushshift useragent.
 46 | '''.strip()
 47 | if not getattr(common.bot, 'CONTACT_INFO', ''):
 48 |     raise ValueError(contact_info_message)
 49 | 
 50 | useragent = USERAGENT.format(version=common.VERSION, contact=common.bot.CONTACT_INFO)
 51 | ratelimit = None
 52 | session = requests.Session()
 53 | session.headers.update({'User-Agent': useragent})
 54 | ratelimit = ratelimiter.Ratelimiter(allowance=120, period=60)
 55 | 
 56 | class DummyObject:
 57 |     '''
 58 |     These classes are used to convert the JSON data we get from pushshift into
 59 |     objects so that the rest of timesearch can operate transparently.
 60 |     This requires a bit of whack-a-mole including:
 61 |     - Fleshing out the attributes which PS did not include because they were
 62 |         null (we use FALLBACK_ATTRIBUTES to replace them).
 63 |     - Providing the convenience methods and @properties that PRAW provides.
 64 |     - Mimicking the rich attributes like author and subreddit.
 65 |     '''
 66 |     def __init__(self, **attributes):
 67 |         for (key, val) in attributes.items():
 68 |             if key == 'author':
 69 |                 val = DummyObject(name=val)
 70 |             elif key == 'subreddit':
 71 |                 val = DummyObject(display_name=val)
 72 |             elif key in ['body', 'selftext']:
 73 |                 val = html.unescape(val)
 74 |             elif key == 'parent_id':
 75 |                 if val is None:
 76 |                     val = attributes['link_id']
 77 |                 elif isinstance(val, int):
 78 |                     val = 't1_' + common.b36(val)
 79 | 
 80 |             setattr(self, key, val)
 81 | 
 82 |         for (key, val) in FALLBACK_ATTRIBUTES.items():
 83 |             if not hasattr(self, key):
 84 |                 setattr(self, key, val)
 85 | 
 86 | # In rare cases, things sometimes don't have a subreddit.
 87 | # Promo posts seem to be one example.
 88 | FALLBACK_ATTRIBUTES['subreddit'] = DummyObject(display_name=None)
 89 | 
 90 | class DummySubmission(DummyObject):
 91 |     @property
 92 |     def fullname(self):
 93 |         return 't3_' + self.id
 94 | 
 95 | class DummyComment(DummyObject):
 96 |     @property
 97 |     def fullname(self):
 98 |         return 't1_' + self.id
 99 | 
100 | 
101 | def _normalize_subreddit(subreddit):
102 |     if isinstance(subreddit, str):
103 |         return subreddit
104 |     else:
105 |         return subreddit.display_name
106 | 
107 | def _normalize_user(user):
108 |     if isinstance(user, str):
109 |         return user
110 |     else:
111 |         return user.name
112 | 
113 | def _pagination_core(url, params, dummy_type, lower=None, upper=None):
114 |     if upper is not None:
115 |         params['before'] = upper
116 |     if lower is not None:
117 |         params['after'] = lower
118 | 
119 |     setify = lambda items: set(item['id'] for item in items)
120 |     prev_batch_ids = set()
121 | 
122 |     while True:
123 |         for retry in range(5):
124 |             try:
125 |                 batch = get(url, params)
126 |             except requests.exceptions.HTTPError as exc:
127 |                 traceback.print_exc()
128 |                 print('Retrying in 5...')
129 |                 time.sleep(5)
130 |             else:
131 |                 break
132 | 
133 |         log.debug('Got batch of %d items.', len(batch))
134 |         batch_ids = setify(batch)
135 |         if len(batch_ids) == 0 or batch_ids.issubset(prev_batch_ids):
136 |             break
137 |         submissions = [dummy_type(**x) for x in batch if x['id'] not in prev_batch_ids]
138 |         submissions.sort(key=lambda x: x.created_utc)
139 |         # Take the latest-1 to avoid the lightning strike chance that two posts
140 |         # have the same timestamp and this occurs at a page boundary.
141 |         # Since ?after=latest would cause us to miss that second one.
142 |         params['after'] = submissions[-1].created_utc - 1
143 |         yield from submissions
144 | 
145 |         prev_batch_ids = batch_ids
146 |         ratelimit.limit()
147 | 
148 | def get(url, params=None):
149 |     if not url.startswith('https://'):
150 |         url = API_URL + url.lstrip('/')
151 | 
152 |     if params is None:
153 |         params = {}
154 | 
155 |     for (key, val) in DEFAULT_PARAMS.items():
156 |         params.setdefault(key, val)
157 | 
158 |     log.debug('Requesting %s with %s', url, params)
159 |     ratelimit.limit()
160 |     response = session.get(url, params=params)
161 |     response.raise_for_status()
162 |     response = response.json()
163 |     data = response['data']
164 |     return data
165 | 
166 | def get_comments_from_submission(submission):
167 |     if isinstance(submission, str):
168 |         submission_id = common.t3_prefix(submission)[3:]
169 |     else:
170 |         submission_id = submission.id
171 | 
172 |     params = {'link_id': submission_id}
173 |     comments = _pagination_core(
174 |         url='comment/search/',
175 |         params=params,
176 |         dummy_type=DummyComment,
177 |     )
178 |     yield from comments
179 | 
180 | def get_comments_from_subreddit(subreddit, **kwargs):
181 |     subreddit = _normalize_subreddit(subreddit)
182 |     params = {'subreddit': subreddit}
183 |     comments = _pagination_core(
184 |         url='comment/search/',
185 |         params=params,
186 |         dummy_type=DummyComment,
187 |         **kwargs
188 |     )
189 |     yield from comments
190 | 
191 | def get_comments_from_user(user, **kwargs):
192 |     user = _normalize_user(user)
193 |     params = {'author': user}
194 |     comments = _pagination_core(
195 |         url='comment/search/',
196 |         params=params,
197 |         dummy_type=DummyComment,
198 |         **kwargs
199 |     )
200 |     yield from comments
201 | 
202 | def get_submissions_from_subreddit(subreddit, **kwargs):
203 |     subreddit = _normalize_subreddit(subreddit)
204 |     params = {'subreddit': subreddit}
205 |     submissions = _pagination_core(
206 |         url='submission/search/',
207 |         params=params,
208 |         dummy_type=DummySubmission,
209 |         **kwargs
210 |     )
211 |     yield from submissions
212 | 
213 | def get_submissions_from_user(user, **kwargs):
214 |     user = _normalize_user(user)
215 |     params = {'author': user}
216 |     submissions = _pagination_core(
217 |         url='submission/search/',
218 |         params=params,
219 |         dummy_type=DummySubmission,
220 |         **kwargs
221 |     )
222 |     yield from submissions
223 | 
224 | def supplement_reddit_data(dummies, chunk_size=100):
225 |     '''
226 |     Given an iterable of the Dummy Pushshift objects, yield them back and also
227 |     yield the live Reddit objects they refer to according to reddit's /api/info.
228 |     The live object will always come after the corresponding dummy object.
229 |     By doing this, we enjoy the strengths of both data sources: Pushshift
230 |     will give us deleted or removed objects that reddit would not, and reddit
231 |     gives us up-to-date scores and text bodies.
232 |     '''
233 |     chunks = common.generator_chunker(dummies, chunk_size)
234 |     for chunk in chunks:
235 |         log.debug('Supplementing %d items with live reddit data.', len(chunk))
236 |         ids = [item.fullname for item in chunk]
237 |         live_copies = list(common.r.info(ids))
238 |         live_copies = {item.fullname: item for item in live_copies}
239 |         for item in chunk:
240 |             yield item
241 |             live_item = live_copies.get(item.fullname, None)
242 |             if live_item:
243 |                 yield live_item
244 | 


--------------------------------------------------------------------------------
/timesearch_modules/tsdb.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sqlite3
  3 | import time
  4 | import types
  5 | 
  6 | from . import common
  7 | from . import exceptions
  8 | from . import pushshift
  9 | 
 10 | from voussoirkit import pathclass
 11 | from voussoirkit import sqlhelpers
 12 | from voussoirkit import vlogging
 13 | 
 14 | log = vlogging.get_logger(__name__)
 15 | 
 16 | # For backwards compatibility reasons, this list of format strings will help
 17 | # timesearch find databases that are using the old filename style.
 18 | # The final element will be used if none of the previous ones were found.
 19 | DB_FORMATS_SUBREDDIT = [
 20 |     '.\\{name}.db',
 21 |     '.\\subreddits\\{name}\\{name}.db',
 22 |     '.\\{name}\\{name}.db',
 23 |     '.\\databases\\{name}.db',
 24 |     '.\\subreddits\\{name}\\{name}.db',
 25 | ]
 26 | DB_FORMATS_USER = [
 27 |     '.\\@{name}.db',
 28 |     '.\\users\\@{name}\\@{name}.db',
 29 |     '.\\@{name}\\@{name}.db',
 30 |     '.\\databases\\@{name}.db',
 31 |     '.\\users\\@{name}\\@{name}.db',
 32 | ]
 33 | 
 34 | DATABASE_VERSION = 2
 35 | DB_VERSION_PRAGMA = f'''
 36 | PRAGMA user_version = {DATABASE_VERSION};
 37 | '''
 38 | 
 39 | DB_PRAGMAS = f'''
 40 | '''
 41 | 
 42 | DB_INIT = f'''
 43 | {DB_PRAGMAS}
 44 | {DB_VERSION_PRAGMA}
 45 | ----------------------------------------------------------------------------------------------------
 46 | CREATE TABLE IF NOT EXISTS config(
 47 |     key TEXT,
 48 |     value TEXT
 49 | );
 50 | ----------------------------------------------------------------------------------------------------
 51 | CREATE TABLE IF NOT EXISTS submissions(
 52 |     idint INT,
 53 |     idstr TEXT,
 54 |     created INT,
 55 |     self INT,
 56 |     nsfw INT,
 57 |     author TEXT,
 58 |     title TEXT,
 59 |     url TEXT,
 60 |     selftext TEXT,
 61 |     score INT,
 62 |     subreddit TEXT,
 63 |     distinguish INT,
 64 |     textlen INT,
 65 |     num_comments INT,
 66 |     flair_text TEXT,
 67 |     flair_css_class TEXT,
 68 |     augmented_at INT,
 69 |     augmented_count INT
 70 | );
 71 | CREATE INDEX IF NOT EXISTS submission_index ON submissions(idstr);
 72 | ----------------------------------------------------------------------------------------------------
 73 | CREATE TABLE IF NOT EXISTS comments(
 74 |     idint INT,
 75 |     idstr TEXT,
 76 |     created INT,
 77 |     author TEXT,
 78 |     parent TEXT,
 79 |     submission TEXT,
 80 |     body TEXT,
 81 |     score INT,
 82 |     subreddit TEXT,
 83 |     distinguish TEXT,
 84 |     textlen INT
 85 | );
 86 | CREATE INDEX IF NOT EXISTS comment_index ON comments(idstr);
 87 | ----------------------------------------------------------------------------------------------------
 88 | CREATE TABLE IF NOT EXISTS submission_edits(
 89 |     idstr TEXT,
 90 |     previous_selftext TEXT,
 91 |     replaced_at INT
 92 | );
 93 | CREATE INDEX IF NOT EXISTS submission_edits_index ON submission_edits(idstr);
 94 | ----------------------------------------------------------------------------------------------------
 95 | CREATE TABLE IF NOT EXISTS comment_edits(
 96 |     idstr TEXT,
 97 |     previous_body TEXT,
 98 |     replaced_at INT
 99 | );
100 | CREATE INDEX IF NOT EXISTS comment_edits_index ON comment_edits(idstr);
101 | '''
102 | 
103 | DEFAULT_CONFIG = {
104 |     'store_edits': True,
105 | }
106 | 
107 | SQL_SUBMISSION_COLUMNS = [
108 |     'idint',
109 |     'idstr',
110 |     'created',
111 |     'self',
112 |     'nsfw',
113 |     'author',
114 |     'title',
115 |     'url',
116 |     'selftext',
117 |     'score',
118 |     'subreddit',
119 |     'distinguish',
120 |     'textlen',
121 |     'num_comments',
122 |     'flair_text',
123 |     'flair_css_class',
124 |     'augmented_at',
125 |     'augmented_count',
126 | ]
127 | 
128 | SQL_COMMENT_COLUMNS = [
129 |     'idint',
130 |     'idstr',
131 |     'created',
132 |     'author',
133 |     'parent',
134 |     'submission',
135 |     'body',
136 |     'score',
137 |     'subreddit',
138 |     'distinguish',
139 |     'textlen',
140 | ]
141 | 
142 | SQL_EDITS_COLUMNS = [
143 |     'idstr',
144 |     'text',
145 |     'replaced_at',
146 | ]
147 | 
148 | SQL_SUBMISSION = {key:index for (index, key) in enumerate(SQL_SUBMISSION_COLUMNS)}
149 | SQL_COMMENT = {key:index for (index, key) in enumerate(SQL_COMMENT_COLUMNS)}
150 | 
151 | SUBMISSION_TYPES = (common.praw.models.Submission, pushshift.DummySubmission)
152 | COMMENT_TYPES = (common.praw.models.Comment, pushshift.DummyComment)
153 | 
154 | 
155 | class DBEntry:
156 |     '''
157 |     This class converts a tuple row from the database into an object so that
158 |     you can access the attributes with dot notation.
159 |     '''
160 |     def __init__(self, dbrow):
161 |         if dbrow[1].startswith('t3_'):
162 |             columns = SQL_SUBMISSION_COLUMNS
163 |             self.object_type = 'submission'
164 |         else:
165 |             columns = SQL_COMMENT_COLUMNS
166 |             self.object_type = 'comment'
167 | 
168 |         self.id = None
169 |         self.idstr = None
170 |         for (index, attribute) in enumerate(columns):
171 |             setattr(self, attribute, dbrow[index])
172 | 
173 |     def __repr__(self):
174 |         return 'DBEntry(\'%s\')' % self.id
175 | 
176 | 
177 | class TSDB:
178 |     def __init__(self, filepath, *, do_create=True, skip_version_check=False):
179 |         self.filepath = pathclass.Path(filepath)
180 |         if not self.filepath.is_file:
181 |             if not do_create:
182 |                 raise exceptions.DatabaseNotFound(self.filepath)
183 |             print('New database', self.filepath.relative_path)
184 | 
185 |         self.filepath.parent.makedirs(exist_ok=True)
186 | 
187 |         self.breakdown_dir = self.filepath.parent.with_child('breakdown')
188 |         self.offline_reading_dir = self.filepath.parent.with_child('offline_reading')
189 |         self.index_dir = self.filepath.parent.with_child('index')
190 |         self.styles_dir = self.filepath.parent.with_child('styles')
191 |         self.wiki_dir = self.filepath.parent.with_child('wiki')
192 | 
193 |         existing_database = self.filepath.exists
194 |         self.sql = sqlite3.connect(self.filepath.absolute_path)
195 |         self.cur = self.sql.cursor()
196 | 
197 |         if existing_database:
198 |             if not skip_version_check:
199 |                 self._check_version()
200 |             self._load_pragmas()
201 |         else:
202 |             self._first_time_setup()
203 | 
204 |         self.config = {}
205 |         for (key, default_value) in DEFAULT_CONFIG.items():
206 |             self.cur.execute('SELECT value FROM config WHERE key == ?', [key])
207 |             existing_value = self.cur.fetchone()
208 |             if existing_value is None:
209 |                 self.cur.execute('INSERT INTO config VALUES(?, ?)', [key, default_value])
210 |                 self.config[key] = default_value
211 |             else:
212 |                 existing_value = existing_value[0]
213 |                 if isinstance(default_value, int):
214 |                     existing_value = int(existing_value)
215 |                 self.config[key] = existing_value
216 | 
217 |     def _check_version(self):
218 |         '''
219 |         Compare database's user_version against DATABASE_VERSION,
220 |         raising exceptions.DatabaseOutOfDate if not correct.
221 |         '''
222 |         existing = self.cur.execute('PRAGMA user_version').fetchone()[0]
223 |         if existing != DATABASE_VERSION:
224 |             raise exceptions.DatabaseOutOfDate(
225 |                 current=existing,
226 |                 new=DATABASE_VERSION,
227 |                 filepath=self.filepath,
228 |             )
229 | 
230 |     def _first_time_setup(self):
231 |         self.sql.executescript(DB_INIT)
232 |         self.sql.commit()
233 | 
234 |     def _load_pragmas(self):
235 |         self.sql.executescript(DB_PRAGMAS)
236 |         self.sql.commit()
237 | 
238 |     def __repr__(self):
239 |         return 'TSDB(%s)' % self.filepath
240 | 
241 |     @staticmethod
242 |     def _pick_filepath(formats, name):
243 |         '''
244 |         Starting with the most specific and preferred filename format, check
245 |         if there is an existing database that matches the name we're looking
246 |         for, and return that path. If none of them exist, then use the most
247 |         preferred filepath.
248 |         '''
249 |         for form in formats:
250 |             path = form.format(name=name)
251 |             if os.path.isfile(path):
252 |                 break
253 |         return pathclass.Path(path)
254 | 
255 |     @classmethod
256 |     def _for_object_helper(cls, name, path_formats, do_create=True, fix_name=False):
257 |         if name != os.path.basename(name):
258 |             filepath = pathclass.Path(name)
259 | 
260 |         else:
261 |             filepath = cls._pick_filepath(formats=path_formats, name=name)
262 | 
263 |         database = cls(filepath=filepath, do_create=do_create)
264 |         if fix_name:
265 |             return (database, name_from_path(name))
266 |         return database
267 | 
268 |     @classmethod
269 |     def for_submission(cls, submission_id, fix_name=False, *args, **kwargs):
270 |         subreddit = common.subreddit_for_submission(submission_id)
271 |         database = cls.for_subreddit(subreddit, *args, **kwargs)
272 |         if fix_name:
273 |             return (database, subreddit.display_name)
274 |         return database
275 | 
276 |     @classmethod
277 |     def for_subreddit(cls, name, do_create=True, fix_name=False):
278 |         if isinstance(name, common.praw.models.Subreddit):
279 |             name = name.display_name
280 |         elif not isinstance(name, str):
281 |             raise TypeError(name, 'should be str or Subreddit.')
282 |         return cls._for_object_helper(
283 |             name,
284 |             do_create=do_create,
285 |             fix_name=fix_name,
286 |             path_formats=DB_FORMATS_SUBREDDIT,
287 |         )
288 | 
289 |     @classmethod
290 |     def for_user(cls, name, do_create=True, fix_name=False):
291 |         if isinstance(name, common.praw.models.Redditor):
292 |             name = name.name
293 |         elif not isinstance(name, str):
294 |             raise TypeError(name, 'should be str or Redditor.')
295 | 
296 |         return cls._for_object_helper(
297 |             name,
298 |             do_create=do_create,
299 |             fix_name=fix_name,
300 |             path_formats=DB_FORMATS_USER,
301 |         )
302 | 
303 |     def check_for_edits(self, obj, existing_entry):
304 |         '''
305 |         If the item's current text doesn't match the stored text, decide what
306 |         to do.
307 | 
308 |         Firstly, make sure to ignore deleted comments.
309 |         Then, if the database is configured to store edited text, do so.
310 |         Finally, return the body that we want to store in the main table.
311 |         '''
312 |         if isinstance(obj, SUBMISSION_TYPES):
313 |             existing_body = existing_entry[SQL_SUBMISSION['selftext']]
314 |             body = obj.selftext
315 |         else:
316 |             existing_body = existing_entry[SQL_COMMENT['body']]
317 |             body = obj.body
318 | 
319 |         if body != existing_body:
320 |             if should_keep_existing_text(obj):
321 |                 body = existing_body
322 |             elif self.config['store_edits']:
323 |                 self.insert_edited(obj, old_text=existing_body)
324 |         return body
325 | 
326 |     def insert(self, objects, commit=True):
327 |         if not isinstance(objects, (list, tuple, types.GeneratorType)):
328 |             objects = [objects]
329 | 
330 |         if isinstance(objects, types.GeneratorType):
331 |             log.debug('Trying to insert a generator of objects.')
332 |         else:
333 |             log.debug('Trying to insert %d objects.', len(objects))
334 | 
335 |         new_values = {
336 |             'tsdb': self,
337 |             'new_submissions': 0,
338 |             'new_comments': 0,
339 |         }
340 |         methods = {
341 |             common.praw.models.Submission: (self.insert_submission, 'new_submissions'),
342 |             common.praw.models.Comment: (self.insert_comment, 'new_comments'),
343 |         }
344 |         methods[pushshift.DummySubmission] = methods[common.praw.models.Submission]
345 |         methods[pushshift.DummyComment] = methods[common.praw.models.Comment]
346 | 
347 |         for obj in objects:
348 |             (method, key) = methods.get(type(obj), (None, None))
349 |             if method is None:
350 |                 raise TypeError('Unsupported', type(obj), obj)
351 |             status = method(obj)
352 |             new_values[key] += status
353 | 
354 |         if commit:
355 |             log.debug('Committing insert.')
356 |             self.sql.commit()
357 | 
358 |         log.debug('Done inserting.')
359 |         return new_values
360 | 
361 |     def insert_edited(self, obj, old_text):
362 |         '''
363 |         Having already detected that the item has been edited, add a record to
364 |         the appropriate *_edits table containing the text that is being
365 |         replaced.
366 |         '''
367 |         if isinstance(obj, SUBMISSION_TYPES):
368 |             table = 'submission_edits'
369 |             key = 'previous_selftext'
370 |         else:
371 |             table = 'comment_edits'
372 |             key = 'previous_body'
373 | 
374 |         if obj.edited is False:
375 |             replaced_at = int(time.time())
376 |         else:
377 |             replaced_at = int(obj.edited)
378 | 
379 |         postdata = {
380 |             'idstr': obj.fullname,
381 |             key: old_text,
382 |             'replaced_at': replaced_at,
383 |         }
384 |         cur = self.sql.cursor()
385 |         (qmarks, bindings) = sqlhelpers.insert_filler(postdata)
386 |         query = f'INSERT INTO {table} {qmarks}'
387 |         cur.execute(query, bindings)
388 | 
389 |     def insert_submission(self, submission):
390 |         cur = self.sql.cursor()
391 |         cur.execute('SELECT * FROM submissions WHERE idstr == ?', [submission.fullname])
392 |         existing_entry = cur.fetchone()
393 | 
394 |         if submission.author is None:
395 |             author = '[DELETED]'
396 |         else:
397 |             author = submission.author.name
398 | 
399 |         if not existing_entry:
400 |             if submission.is_self:
401 |                 # Selfpost's URL leads back to itself, so just ignore it.
402 |                 url = None
403 |             elif hasattr(submission, 'crosspost_parent') and getattr(submission, 'crosspost_parent_list'):
404 |                 url = submission.crosspost_parent_list[0]['permalink']
405 |             else:
406 |                 url = getattr(submission, 'url', None)
407 | 
408 |             if url and url.startswith('/r/'):
409 |                 url = 'https://reddit.com' + url
410 | 
411 |             postdata = {
412 |                 'idint': common.b36(submission.id),
413 |                 'idstr': submission.fullname,
414 |                 'created': submission.created_utc,
415 |                 'self': submission.is_self,
416 |                 'nsfw': submission.over_18,
417 |                 'author': author,
418 |                 'title': submission.title,
419 |                 'url': url,
420 |                 'selftext': submission.selftext,
421 |                 'score': submission.score,
422 |                 'subreddit': submission.subreddit.display_name,
423 |                 'distinguish': submission.distinguished,
424 |                 'textlen': len(submission.selftext),
425 |                 'num_comments': submission.num_comments,
426 |                 'flair_text': submission.link_flair_text,
427 |                 'flair_css_class': submission.link_flair_css_class,
428 |                 'augmented_at': None,
429 |                 'augmented_count': None,
430 |             }
431 |             (qmarks, bindings) = sqlhelpers.insert_filler(postdata)
432 |             query = f'INSERT INTO submissions {qmarks}'
433 |             cur.execute(query, bindings)
434 | 
435 |         else:
436 |             selftext = self.check_for_edits(submission, existing_entry=existing_entry)
437 | 
438 |             query = '''
439 |                 UPDATE submissions SET
440 |                 nsfw = coalesce(?, nsfw),
441 |                 score = coalesce(?, score),
442 |                 selftext = coalesce(?, selftext),
443 |                 distinguish = coalesce(?, distinguish),
444 |                 num_comments = coalesce(?, num_comments),
445 |                 flair_text = coalesce(?, flair_text),
446 |                 flair_css_class = coalesce(?, flair_css_class)
447 |                 WHERE idstr == ?
448 |             '''
449 |             bindings = [
450 |                 submission.over_18,
451 |                 submission.score,
452 |                 selftext,
453 |                 submission.distinguished,
454 |                 submission.num_comments,
455 |                 submission.link_flair_text,
456 |                 submission.link_flair_css_class,
457 |                 submission.fullname
458 |             ]
459 |             cur.execute(query, bindings)
460 | 
461 |         return existing_entry is None
462 | 
463 |     def insert_comment(self, comment):
464 |         cur = self.sql.cursor()
465 |         cur.execute('SELECT * FROM comments WHERE idstr == ?', [comment.fullname])
466 |         existing_entry = cur.fetchone()
467 | 
468 |         if comment.author is None:
469 |             author = '[DELETED]'
470 |         else:
471 |             author = comment.author.name
472 | 
473 |         if not existing_entry:
474 |             postdata = {
475 |                 'idint': common.b36(comment.id),
476 |                 'idstr': comment.fullname,
477 |                 'created': comment.created_utc,
478 |                 'author': author,
479 |                 'parent': comment.parent_id,
480 |                 'submission': comment.link_id,
481 |                 'body': comment.body,
482 |                 'score': comment.score,
483 |                 'subreddit': comment.subreddit.display_name,
484 |                 'distinguish': comment.distinguished,
485 |                 'textlen': len(comment.body),
486 |             }
487 |             (qmarks, bindings) = sqlhelpers.insert_filler(postdata)
488 |             query = f'INSERT INTO comments {qmarks}'
489 |             cur.execute(query, bindings)
490 | 
491 |         else:
492 |             body = self.check_for_edits(comment, existing_entry=existing_entry)
493 | 
494 |             query = '''
495 |                 UPDATE comments SET
496 |                 score = coalesce(?, score),
497 |                 body = coalesce(?, body),
498 |                 distinguish = coalesce(?, distinguish)
499 |                 WHERE idstr == ?
500 |             '''
501 |             bindings = [
502 |                 comment.score,
503 |                 body,
504 |                 comment.distinguished,
505 |                 comment.fullname
506 |             ]
507 |             cur.execute(query, bindings)
508 | 
509 |         return existing_entry is None
510 | 
511 | 
512 | def name_from_path(filepath):
513 |     '''
514 |     In order to support usage like
515 |     > timesearch livestream -r D:\\some\\other\\filepath\\learnpython.db
516 |     this function extracts the subreddit name / username based on the given
517 |     path, so that we can pass it into `r.subreddit` / `r.redditor` properly.
518 |     '''
519 |     if isinstance(filepath, pathclass.Path):
520 |         filepath = filepath.basename
521 |     else:
522 |         filepath = os.path.basename(filepath)
523 |     name = os.path.splitext(filepath)[0]
524 |     name = name.strip('@')
525 |     return name
526 | 
527 | def should_keep_existing_text(obj):
528 |     '''
529 |     Under certain conditions we do not want to update the entry in the db
530 |     with the most recent copy of the text. For example, if the post has
531 |     been deleted and the text now shows '[deleted]' we would prefer to
532 |     keep whatever we already have.
533 | 
534 |     This function puts away the work I would otherwise have to duplicate
535 |     for both submissions and comments.
536 |     '''
537 |     body = obj.selftext if isinstance(obj, SUBMISSION_TYPES) else obj.body
538 |     if obj.author is None and body in ['[removed]', '[deleted]']:
539 |         return True
540 | 
541 |     greasy = ['has been overwritten', 'pastebin.com/64GuVi2F']
542 |     if any(grease in body for grease in greasy):
543 |         return True
544 | 
545 |     return False
546 | 


--------------------------------------------------------------------------------
/utilities/database_upgrader.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import os
 3 | import sqlite3
 4 | import sys
 5 | 
 6 | sys.path.append(os.path.dirname(os.path.dirname(__file__)))
 7 | 
 8 | from timesearch_modules import tsdb
 9 | 
10 | 
11 | def upgrade_1_to_2(db):
12 |     '''
13 |     In this version, many of the timesearch modules were renamed, including
14 |     redmash -> index. This update will rename the existing `redmash` folder
15 |     to `index`.
16 |     '''
17 |     cur = db.sql.cursor()
18 |     redmash_dir = db.index_dir.parent.with_child('redmash')
19 |     if redmash_dir.exists:
20 |         redmash_dir.assert_is_directory()
21 |         print('Renaming redmash folder to index.')
22 |         os.rename(redmash_dir, db.index_dir)
23 | 
24 | def upgrade_all(database_filename):
25 |     '''
26 |     Given the filename of a database, apply all of the needed
27 |     upgrade_x_to_y functions in order.
28 |     '''
29 |     db = tsdb.TSDB(database_filename, do_create=False, skip_version_check=True)
30 | 
31 |     cur = db.sql.cursor()
32 | 
33 |     cur.execute('PRAGMA user_version')
34 |     current_version = cur.fetchone()[0]
35 |     needed_version = tsdb.DATABASE_VERSION
36 | 
37 |     if current_version == needed_version:
38 |         print('Already up to date with version %d.' % needed_version)
39 |         return
40 | 
41 |     for version_number in range(current_version + 1, needed_version + 1):
42 |         print('Upgrading from %d to %d' % (current_version, version_number))
43 |         upgrade_function = 'upgrade_%d_to_%d' % (current_version, version_number)
44 |         upgrade_function = eval(upgrade_function)
45 |         upgrade_function(db)
46 |         db.sql.cursor().execute('PRAGMA user_version = %d' % version_number)
47 |         db.sql.commit()
48 |         current_version = version_number
49 |     print('Upgrades finished.')
50 | 
51 | 
52 | def upgrade_all_argparse(args):
53 |     return upgrade_all(database_filename=args.database_filename)
54 | 
55 | def main(argv):
56 |     parser = argparse.ArgumentParser()
57 | 
58 |     parser.add_argument('database_filename')
59 |     parser.set_defaults(func=upgrade_all_argparse)
60 | 
61 |     args = parser.parse_args(argv)
62 |     return args.func(args)
63 | 
64 | if __name__ == '__main__':
65 |     raise SystemExit(main(sys.argv[1:]))
66 | 


--------------------------------------------------------------------------------