├── requirements.txt ├── config.cfg ├── LICENSE ├── .gitignore ├── README.md └── ebooks.py /requirements.txt: -------------------------------------------------------------------------------- 1 | markovify==0.7.1 2 | Mastodon.py==1.2.2 3 | beautifulsoup4==4.6.0 4 | ananas>=1.0.0b1 5 | -------------------------------------------------------------------------------- /config.cfg: -------------------------------------------------------------------------------- 1 | [EBOOKS] 2 | class = ebooks.ebooksBot 3 | domain = 4 | client_id = 5 | client_secret = 6 | access_token = 7 | exclude_replies = True 8 | 9 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License 2 | 3 | Copyright (c) 2018 Erin Pinheiro 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in 13 | all copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 21 | THE SOFTWARE. 22 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | 49 | # Translations 50 | *.mo 51 | *.pot 52 | 53 | # Django stuff: 54 | *.log 55 | local_settings.py 56 | 57 | # Flask stuff: 58 | instance/ 59 | .webassets-cache 60 | 61 | # Scrapy stuff: 62 | .scrapy 63 | 64 | # Sphinx documentation 65 | docs/_build/ 66 | 67 | # PyBuilder 68 | target/ 69 | 70 | # Jupyter Notebook 71 | .ipynb_checkpoints 72 | 73 | # pyenv 74 | .python-version 75 | 76 | # celery beat schedule file 77 | celerybeat-schedule 78 | 79 | # SageMath parsed files 80 | *.sage.py 81 | 82 | # dotenv 83 | .env 84 | 85 | # virtualenv 86 | .venv 87 | venv/ 88 | ENV/ 89 | 90 | # Spyder project settings 91 | .spyderproject 92 | .spyproject 93 | 94 | # Rope project settings 95 | .ropeproject 96 | 97 | # mkdocs documentation 98 | /site 99 | 100 | # mypy 101 | .mypy_cache/ 102 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ananas-ebooks 2 | 3 | Based on [Maple](https://computerfairi.es/@squirrel)'s [mastodon-ebooks.py](). 4 | 5 | This simply takes Maple's work and adapts it for the ananas framework, allowing you to skip the cron jobs and manual posting. 6 | 7 | Usage: 8 | 9 | 1. In the directory where ananas-ebooks is installed, run `pip3 install -r requirements.txt` to make sure you have all the requirements for the bot. 10 | 2. Create a Mastodon account to be the ebooks bot. 11 | 3. In Mastodon, use your bot account to follow each account you want to scrape to create your bot. **DO NOT** follow any accounts you don't want to scrape. 12 | 4. In Mastodon, under Preferences > Development, create a new app for the bot. You can just call it "Ebooks" and leave "Application Website" blank. Do not change the Redirect URI. 13 | 5. On the next screen, click on the name you gave your bot ("Ebooks" or something else). You'll get a client key, client secret, and access token. 14 | (Alternately, you can use a script [like this](https://gist.github.com/Lana-chan/b0d937968d22eca6dcd79a0524449f1d) to generate user secrets to be used by the ebooks script.) 15 | 6. Open `config.cfg`. Make the following changes: 16 | 1. Change `domain` to whatever your bot's instance is. 17 | 2. Paste your client key after `client_id`, your client secret after `client_secret`, and your access token after `access_token`. 18 | 3. By default, your bot will not scrape toots that are replies to other users. set `exclude_replies = False` if you want your bot to scrape replies in addition to "raw" toots. 19 | 7. You're done! Type `ananas config.cfg` into your command line and watch your bot spring into action. It will start by scraping each account you've followed, then toot once each hour and look for replies so it can toot back. 20 | 21 | If you want to re-scrape toots from your followed accounts, just stop the bot (with Ctrl-C) and start it again; it will check for new toots every time it starts up. 22 | 23 | ## Why do I have to follow everyone I want to scrape? Can't I just give it a list of usernames? 24 | 25 | It's an anti-harassment measure. This way, if you want to make an ebooks bot of someone's toots, they're notified that the bot exists. 26 | 27 | ## What if I already have an ananas bot running and I'd like to run them in tandem? 28 | 29 | Just copy `ebooks.py` and `requirements.txt` into that bot's directory, and modify that `config.cfg` to have an `[EBOOKS]` section as described in step 6 above. ananas bots play very nicely with each other. 30 | 31 | If you've already scraped some accounts and want to keep that data, copy over `accts.json`, `model.json`, and the `corpus` directory too. All that data will stay intact the next time you launch ananas. 32 | 33 | ## This has been running for a long time, and I've tooted a lot since then. How do I get my new toots into the corpus? 34 | 35 | The bot automatically scrapes new toots from its followed accounts once a day at 2:15 AM (to minimize collision with other tasks). 36 | 37 | ## What if I want my bot to toot at a different time? 38 | 39 | I'd love to be able to include this in the `config.cfg` file, but for various technical reasons it's not possible right now. Instead, you'll have to actually open up `ebooks.py`. 40 | 41 | In `ebooks.py`, go to line `129`. You'll see this: 42 | 43 | @hourly(minute=0) 44 | 45 | If you want your bot to post once an hour, but not **on** the hour, change the value of `minute`. For example, if you want your bot to post at 12:15, 1:15, 2:15, etc., change it to `minute=15`. 46 | 47 | If you want your bot to post **more** than once an hour, just add another line. For example: 48 | 49 | @hourly(minute=12) 50 | @hourly(minute=42) 51 | def toot(self): 52 | ... 53 | 54 | will post every half-hour, at XX:12 and at XX:42. 55 | 56 | If you want your bot to post **less** than once an hour, change `@hourly` to `@daily`. By default, this will make it post once a day, but like `@hourly`, you can stack them up. To post every four hours, you might do: 57 | 58 | @daily(hour=0, minute=15) 59 | @daily(hour=4, minute=15) 60 | @daily(hour=8, minute=15) 61 | @daily(hour=12, minute=15) 62 | @daily(hour=16, minute=15) 63 | @daily(hour=20, minute=15) 64 | def toot(self): 65 | ... 66 | 67 | Note how that uses 24-hour time. 68 | 69 | You can see more details at the [ananas repo](https://github.com/chr-1x/ananas). -------------------------------------------------------------------------------- /ebooks.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | from mastodon import Mastodon 3 | from bs4 import BeautifulSoup 4 | import markovify 5 | import html 6 | import json 7 | import os 8 | import sys 9 | import getopt 10 | from ananas import PineappleBot, hourly, schedule, reply, html_strip_tags 11 | 12 | class ebooksBot(PineappleBot): 13 | exclude_replies = True 14 | 15 | def start(self): 16 | try: 17 | self.exclude_replies = bool(self.config.exclude_replies) 18 | except: 19 | self.exclude_replies = True 20 | self.scrape() 21 | 22 | # strip html tags for text alone 23 | def strip_tags(self, content): 24 | soup = BeautifulSoup(html.unescape(content), 'html.parser') 25 | # remove mentions 26 | tags = soup.select('.mention') 27 | for i in tags: 28 | i.extract() 29 | # clear shortened link captions 30 | tags = soup.select('.invisible, .ellipsis') 31 | for i in tags: 32 | i.unwrap() 33 | # replace link text to avoid caption breaking 34 | tags = soup.select('a') 35 | for i in tags: 36 | i.replace_with(i.get_text()) 37 | # strip html tags, chr(31) joins text in different html tags 38 | return soup.get_text(chr(31)).strip() 39 | 40 | # scrapes the accounts the bot is following to build corpus 41 | @daily(hour=2, minute=15) 42 | def scrape(self): 43 | me = self.mastodon.account_verify_credentials() 44 | following = self.mastodon.account_following(me['id']) 45 | acctfile = 'accts.json' 46 | # acctfile contains info on last scraped toot id 47 | try: 48 | with open(acctfile, 'r') as f: 49 | acctjson = json.load(f) 50 | except: 51 | acctjson = {} 52 | 53 | print(acctjson) 54 | for acc in following: 55 | id = str(acc['id']) 56 | try: 57 | since_id = self.scrape_id(id, since=acctjson[id]) 58 | except: 59 | since_id = self.scrape_id(id) 60 | acctjson[id] = since_id 61 | 62 | with open(acctfile, 'w') as f: 63 | json.dump(acctjson, f) 64 | 65 | # generate the whole corpus after scraping so we don't do at every runtime 66 | combined_model = None 67 | for (dirpath, _, filenames) in os.walk("corpus"): 68 | for filename in filenames: 69 | with open(os.path.join(dirpath, filename)) as f: 70 | model = markovify.NewlineText(f, retain_original=False) 71 | if combined_model: 72 | combined_model = markovify.combine(models=[combined_model, model]) 73 | else: 74 | combined_model = model 75 | with open('model.json','w') as f: 76 | f.write(combined_model.to_json()) 77 | 78 | def scrape_id(self, id, since=None): 79 | # excluding replies was a personal choice. i haven't made an easy setting for this yet 80 | toots = self.mastodon.account_statuses(id, since_id=since, exclude_replies=self.exclude_replies) 81 | # if this fails, there are no new toots and we just return old pointer 82 | try: 83 | new_since_id = toots[0]['id'] 84 | except: 85 | return since 86 | bufferfile = 'buffer.txt' 87 | corpusfile = 'corpus/%s.txt' % id 88 | i = 0 89 | with open(bufferfile, 'w') as output: 90 | while toots != None: 91 | # writes current amount of scraped toots without breaking line 92 | i = i + len(toots) 93 | sys.stdout.write('\r%d' % i) 94 | sys.stdout.flush() 95 | filtered_toots = list(filter(lambda x: x['spoiler_text'] == "" and x['reblog'] is None and x['visibility'] in ["public", "unlisted"], toots)) 96 | for toot in filtered_toots: 97 | output.write(html_strip_tags(toot['content'])+'\n') 98 | toots = self.mastodon.fetch_next(toots) 99 | # buffer is appended to the top of old corpus 100 | if os.path.exists(corpusfile): 101 | with open(corpusfile,'r') as old_corpus: 102 | output.write(old_corpus.read()) 103 | directory = os.path.dirname(corpusfile) 104 | if not os.path.exists(directory): 105 | os.makedirs(directory) 106 | os.rename(bufferfile,corpusfile) 107 | sys.stdout.write('\n') 108 | sys.stdout.flush() 109 | return new_since_id 110 | 111 | # returns a markov generated toot 112 | def generate(self, length=None): 113 | modelfile = 'model.json' 114 | if not os.path.exists(modelfile): 115 | sys.exit('no model -- please scrape first') 116 | with open(modelfile, 'r') as f: 117 | reconstituted_model = markovify.Text.from_json(f.read()) 118 | if length: 119 | msg = reconstituted_model.make_short_sentence(length) 120 | else: 121 | msg = reconstituted_model.make_sentence() 122 | return msg.replace(chr(31), "\n") 123 | 124 | # perform a generated toot to mastodon 125 | @hourly(minute=0) 126 | def toot(self): 127 | msg = self.generate(500) 128 | self.mastodon.toot(msg) 129 | print('Tooted: %s' % msg) 130 | 131 | # scan all notifications for mentions and reply to them 132 | @reply 133 | def post_reply(self, mention, user): 134 | msg = html_strip_tags(mention["content"]) 135 | rsp = self.generate(400) 136 | tgt = user["acct"] 137 | irt = mention["id"] 138 | vis = mention["visibility"] 139 | print("Received toot from {}: {}".format(tgt, msg)) 140 | print("Responding with {} visibility: {}".format(vis, rsp)) 141 | final_rsp = "@{} {}".format(tgt, rsp) 142 | final_rsp = final_rsp[:500] 143 | self.mastodon.status_post(final_rsp, 144 | in_reply_to_id = irt, 145 | visibility = vis) 146 | --------------------------------------------------------------------------------