├── requirements.txt
├── config.cfg
├── LICENSE
├── .gitignore
├── README.md
└── ebooks.py


/requirements.txt:
--------------------------------------------------------------------------------
1 | markovify==0.7.1
2 | Mastodon.py==1.2.2
3 | beautifulsoup4==4.6.0
4 | ananas>=1.0.0b1
5 | 


--------------------------------------------------------------------------------
/config.cfg:
--------------------------------------------------------------------------------
1 | [EBOOKS]
2 | class = ebooks.ebooksBot
3 | domain = <your instance name here>
4 | client_id = <your client key here>
5 | client_secret = <your client secret here>
6 | access_token = <your access token here>
7 | exclude_replies = True
8 | 
9 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License
 2 | 
 3 | Copyright (c) 2018 Erin Pinheiro
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in
13 | all copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21 | THE SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | env/
 12 | build/
 13 | develop-eggs/
 14 | dist/
 15 | downloads/
 16 | eggs/
 17 | .eggs/
 18 | lib/
 19 | lib64/
 20 | parts/
 21 | sdist/
 22 | var/
 23 | wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | 
 28 | # PyInstaller
 29 | #  Usually these files are written by a python script from a template
 30 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 31 | *.manifest
 32 | *.spec
 33 | 
 34 | # Installer logs
 35 | pip-log.txt
 36 | pip-delete-this-directory.txt
 37 | 
 38 | # Unit test / coverage reports
 39 | htmlcov/
 40 | .tox/
 41 | .coverage
 42 | .coverage.*
 43 | .cache
 44 | nosetests.xml
 45 | coverage.xml
 46 | *.cover
 47 | .hypothesis/
 48 | 
 49 | # Translations
 50 | *.mo
 51 | *.pot
 52 | 
 53 | # Django stuff:
 54 | *.log
 55 | local_settings.py
 56 | 
 57 | # Flask stuff:
 58 | instance/
 59 | .webassets-cache
 60 | 
 61 | # Scrapy stuff:
 62 | .scrapy
 63 | 
 64 | # Sphinx documentation
 65 | docs/_build/
 66 | 
 67 | # PyBuilder
 68 | target/
 69 | 
 70 | # Jupyter Notebook
 71 | .ipynb_checkpoints
 72 | 
 73 | # pyenv
 74 | .python-version
 75 | 
 76 | # celery beat schedule file
 77 | celerybeat-schedule
 78 | 
 79 | # SageMath parsed files
 80 | *.sage.py
 81 | 
 82 | # dotenv
 83 | .env
 84 | 
 85 | # virtualenv
 86 | .venv
 87 | venv/
 88 | ENV/
 89 | 
 90 | # Spyder project settings
 91 | .spyderproject
 92 | .spyproject
 93 | 
 94 | # Rope project settings
 95 | .ropeproject
 96 | 
 97 | # mkdocs documentation
 98 | /site
 99 | 
100 | # mypy
101 | .mypy_cache/
102 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ananas-ebooks
 2 | 
 3 | Based on [Maple](https://computerfairi.es/@squirrel)'s [mastodon-ebooks.py]().
 4 | 
 5 | This simply takes Maple's work and adapts it for the ananas framework, allowing you to skip the cron jobs and manual posting.
 6 | 
 7 | Usage:
 8 | 
 9 | 1. In the directory where ananas-ebooks is installed, run `pip3 install -r requirements.txt` to make sure you have all the requirements for the bot.
10 | 2. Create a Mastodon account to be the ebooks bot.
11 | 3. In Mastodon, use your bot account to follow each account you want to scrape to create your bot. **DO NOT** follow any accounts you don't want to scrape.
12 | 4. In Mastodon, under Preferences > Development, create a new app for the bot. You can just call it "Ebooks" and leave "Application Website" blank. Do not change the Redirect URI.
13 | 5. On the next screen, click on the name you gave your bot ("Ebooks" or something else). You'll get a client key, client secret, and access token.
14 |     (Alternately, you can use a script [like this](https://gist.github.com/Lana-chan/b0d937968d22eca6dcd79a0524449f1d) to generate user secrets to be used by the ebooks script.)
15 | 6. Open `config.cfg`. Make the following changes:
16 |     1. Change `domain` to whatever your bot's instance is.
17 |     2. Paste your client key after `client_id`, your client secret after `client_secret`, and your access token after `access_token`.
18 |     3. By default, your bot will not scrape toots that are replies to other users. set `exclude_replies = False` if you want your bot to scrape replies in addition to "raw" toots.
19 | 7. You're done! Type `ananas config.cfg` into your command line and watch your bot spring into action. It will start by scraping each account you've followed, then toot once each hour and look for replies so it can toot back.
20 | 
21 | If you want to re-scrape toots from your followed accounts, just stop the bot (with Ctrl-C) and start it again; it will check for new toots every time it starts up.
22 | 
23 | ## Why do I have to follow everyone I want to scrape? Can't I just give it a list of usernames?
24 | 
25 | It's an anti-harassment measure. This way, if you want to make an ebooks bot of someone's toots, they're notified that the bot exists.
26 | 
27 | ## What if I already have an ananas bot running and I'd like to run them in tandem?
28 | 
29 | Just copy `ebooks.py` and `requirements.txt` into that bot's directory, and modify that `config.cfg` to have an `[EBOOKS]` section as described in step 6 above. ananas bots play very nicely with each other.
30 | 
31 | If you've already scraped some accounts and want to keep that data, copy over `accts.json`, `model.json`, and the `corpus` directory too. All that data will stay intact the next time you launch ananas.
32 | 
33 | ## This has been running for a long time, and I've tooted a lot since then. How do I get my new toots into the corpus?
34 | 
35 | The bot automatically scrapes new toots from its followed accounts once a day at 2:15 AM (to minimize collision with other tasks).
36 | 
37 | ## What if I want my bot to toot at a different time?
38 | 
39 | I'd love to be able to include this in the `config.cfg` file, but for various technical reasons it's not possible right now. Instead, you'll have to actually open up `ebooks.py`.
40 | 
41 | In `ebooks.py`, go to line `129`. You'll see this:
42 | 
43 |     @hourly(minute=0)
44 | 
45 | If you want your bot to post once an hour, but not **on** the hour, change the value of `minute`. For example, if you want your bot to post at 12:15, 1:15, 2:15, etc., change it to `minute=15`.
46 | 
47 | If you want your bot to post **more** than once an hour, just add another line. For example:
48 | 
49 |     @hourly(minute=12)
50 |     @hourly(minute=42)
51 |     def toot(self):
52 |       ...
53 | 
54 | will post every half-hour, at XX:12 and at XX:42.
55 | 
56 | If you want your bot to post **less** than once an hour, change `@hourly` to `@daily`. By default, this will make it post once a day, but like `@hourly`, you can stack them up. To post every four hours, you might do:
57 | 
58 |     @daily(hour=0, minute=15)
59 |     @daily(hour=4, minute=15)
60 |     @daily(hour=8, minute=15)
61 |     @daily(hour=12, minute=15)
62 |     @daily(hour=16, minute=15)
63 |     @daily(hour=20, minute=15)
64 |     def toot(self):
65 |       ...
66 | 
67 | Note how that uses 24-hour time.
68 | 
69 | You can see more details at the [ananas repo](https://github.com/chr-1x/ananas).


--------------------------------------------------------------------------------
/ebooks.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python3
  2 | from mastodon import Mastodon
  3 | from bs4 import BeautifulSoup
  4 | import markovify
  5 | import html
  6 | import json
  7 | import os
  8 | import sys
  9 | import getopt
 10 | from ananas import PineappleBot, hourly, schedule, reply, html_strip_tags
 11 | 
 12 | class ebooksBot(PineappleBot):
 13 |   exclude_replies = True
 14 | 
 15 |   def start(self):
 16 |     try:
 17 |       self.exclude_replies = bool(self.config.exclude_replies)
 18 |     except:
 19 |       self.exclude_replies = True
 20 |     self.scrape()
 21 | 
 22 |   # strip html tags for text alone
 23 |   def strip_tags(self, content):
 24 |     soup = BeautifulSoup(html.unescape(content), 'html.parser')
 25 |     # remove mentions
 26 |     tags = soup.select('.mention')
 27 |     for i in tags:
 28 |       i.extract()
 29 |     # clear shortened link captions
 30 |     tags = soup.select('.invisible, .ellipsis')
 31 |     for i in tags:
 32 |       i.unwrap()
 33 |     # replace link text to avoid caption breaking
 34 |     tags = soup.select('a')
 35 |     for i in tags:
 36 |       i.replace_with(i.get_text())
 37 |     # strip html tags, chr(31) joins text in different html tags
 38 |     return soup.get_text(chr(31)).strip()
 39 | 
 40 |   # scrapes the accounts the bot is following to build corpus
 41 |   @daily(hour=2, minute=15)
 42 |   def scrape(self):
 43 |     me = self.mastodon.account_verify_credentials()
 44 |     following = self.mastodon.account_following(me['id'])
 45 |     acctfile = 'accts.json'
 46 |     # acctfile contains info on last scraped toot id
 47 |     try:
 48 |       with open(acctfile, 'r') as f:
 49 |         acctjson = json.load(f)
 50 |     except:
 51 |       acctjson = {}
 52 |     
 53 |     print(acctjson)
 54 |     for acc in following:
 55 |       id = str(acc['id'])
 56 |       try:
 57 |         since_id = self.scrape_id(id, since=acctjson[id])
 58 |       except:
 59 |         since_id = self.scrape_id(id)
 60 |       acctjson[id] = since_id
 61 |     
 62 |     with open(acctfile, 'w') as f:
 63 |       json.dump(acctjson, f)
 64 |       
 65 |     # generate the whole corpus after scraping so we don't do at every runtime
 66 |     combined_model = None
 67 |     for (dirpath, _, filenames) in os.walk("corpus"):
 68 |       for filename in filenames:
 69 |         with open(os.path.join(dirpath, filename)) as f:
 70 |           model = markovify.NewlineText(f, retain_original=False)
 71 |           if combined_model:
 72 |             combined_model = markovify.combine(models=[combined_model, model])
 73 |           else:
 74 |             combined_model = model
 75 |     with open('model.json','w') as f:
 76 |       f.write(combined_model.to_json())
 77 | 
 78 |   def scrape_id(self, id, since=None):
 79 |     # excluding replies was a personal choice. i haven't made an easy setting for this yet
 80 |     toots = self.mastodon.account_statuses(id, since_id=since, exclude_replies=self.exclude_replies)
 81 |     # if this fails, there are no new toots and we just return old pointer
 82 |     try:
 83 |       new_since_id = toots[0]['id']
 84 |     except:
 85 |       return since
 86 |     bufferfile = 'buffer.txt'
 87 |     corpusfile = 'corpus/%s.txt' % id
 88 |     i = 0
 89 |     with open(bufferfile, 'w') as output:
 90 |       while toots != None:
 91 |         # writes current amount of scraped toots without breaking line
 92 |         i = i + len(toots)
 93 |         sys.stdout.write('\r%d' % i)
 94 |         sys.stdout.flush()
 95 |         filtered_toots = list(filter(lambda x: x['spoiler_text'] == "" and x['reblog'] is None and x['visibility'] in ["public", "unlisted"], toots))
 96 |         for toot in filtered_toots:
 97 |           output.write(html_strip_tags(toot['content'])+'\n')
 98 |         toots = self.mastodon.fetch_next(toots)
 99 |       # buffer is appended to the top of old corpus
100 |       if os.path.exists(corpusfile):
101 |         with open(corpusfile,'r') as old_corpus:
102 |           output.write(old_corpus.read())
103 |       directory = os.path.dirname(corpusfile)
104 |       if not os.path.exists(directory):
105 |         os.makedirs(directory)
106 |       os.rename(bufferfile,corpusfile)
107 |       sys.stdout.write('\n')
108 |       sys.stdout.flush()
109 |     return new_since_id
110 | 
111 |   # returns a markov generated toot
112 |   def generate(self, length=None):
113 |     modelfile = 'model.json'
114 |     if not os.path.exists(modelfile):
115 |       sys.exit('no model -- please scrape first')
116 |     with open(modelfile, 'r') as f:
117 |       reconstituted_model = markovify.Text.from_json(f.read())
118 |     if length:
119 |       msg = reconstituted_model.make_short_sentence(length)
120 |     else:
121 |       msg = reconstituted_model.make_sentence()
122 |     return msg.replace(chr(31), "\n")
123 | 
124 |   # perform a generated toot to mastodon
125 |   @hourly(minute=0)
126 |   def toot(self):
127 |     msg = self.generate(500)
128 |     self.mastodon.toot(msg)
129 |     print('Tooted: %s' % msg)
130 | 
131 |   # scan all notifications for mentions and reply to them
132 |   @reply
133 |   def post_reply(self, mention, user):
134 |     msg = html_strip_tags(mention["content"])
135 |     rsp = self.generate(400)
136 |     tgt = user["acct"]
137 |     irt = mention["id"]
138 |     vis = mention["visibility"]
139 |     print("Received toot from {}: {}".format(tgt, msg))
140 |     print("Responding with {} visibility: {}".format(vis, rsp))
141 |     final_rsp = "@{} {}".format(tgt, rsp)
142 |     final_rsp = final_rsp[:500]
143 |     self.mastodon.status_post(final_rsp,
144 |                   in_reply_to_id = irt,
145 |                   visibility = vis)
146 | 


--------------------------------------------------------------------------------