├── .gitignore
├── Dockerfile
├── README.md
├── Singularity
├── __init__.py
├── learn.py
├── raw_tweets.txt
├── requirements.txt
├── training_checkpoints
    ├── checkpoint
    ├── ckpt_50.data-00000-of-00001
    └── ckpt_50.index
├── trumpbot.py
└── tweets.txt


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Containers
 10 | *.sif
 11 | 
 12 | # Distribution / packaging
 13 | .Python
 14 | build/
 15 | develop-eggs/
 16 | dist/
 17 | downloads/
 18 | eggs/
 19 | .eggs/
 20 | lib/
 21 | lib64/
 22 | parts/
 23 | sdist/
 24 | var/
 25 | wheels/
 26 | *.egg-info/
 27 | .installed.cfg
 28 | *.egg
 29 | MANIFEST
 30 | 
 31 | # PyInstaller
 32 | #  Usually these files are written by a python script from a template
 33 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 34 | *.manifest
 35 | *.spec
 36 | 
 37 | # Installer logs
 38 | pip-log.txt
 39 | pip-delete-this-directory.txt
 40 | 
 41 | # Unit test / coverage reports
 42 | htmlcov/
 43 | .tox/
 44 | .coverage
 45 | .coverage.*
 46 | .cache
 47 | nosetests.xml
 48 | coverage.xml
 49 | *.cover
 50 | .hypothesis/
 51 | .pytest_cache/
 52 | 
 53 | # Translations
 54 | *.mo
 55 | *.pot
 56 | 
 57 | # Django stuff:
 58 | *.log
 59 | local_settings.py
 60 | db.sqlite3
 61 | 
 62 | # Flask stuff:
 63 | instance/
 64 | .webassets-cache
 65 | 
 66 | # Scrapy stuff:
 67 | .scrapy
 68 | 
 69 | # Sphinx documentation
 70 | docs/_build/
 71 | 
 72 | # PyBuilder
 73 | target/
 74 | 
 75 | # Jupyter Notebook
 76 | .ipynb_checkpoints
 77 | 
 78 | # pyenv
 79 | .python-version
 80 | 
 81 | # celery beat schedule file
 82 | celerybeat-schedule
 83 | 
 84 | # SageMath parsed files
 85 | *.sage.py
 86 | 
 87 | # Environments
 88 | .env
 89 | .venv
 90 | env/
 91 | venv/
 92 | ENV/
 93 | env.bak/
 94 | venv.bak/
 95 | 
 96 | # Spyder project settings
 97 | .spyderproject
 98 | .spyproject
 99 | 
100 | # Rope project settings
101 | .ropeproject
102 | 
103 | # mkdocs documentation
104 | /site
105 | 
106 | # mypy
107 | .mypy_cache/
108 | 


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM python:3.7
 2 | 
 3 | # docker build -t trumpbot .
 4 | 
 5 | RUN apt-get update && \
 6 |     apt-get install -y python-numpy && \
 7 |     mkdir -p /code
 8 | 
 9 | ADD . /code
10 | WORKDIR /code
11 | RUN pip install -r requirements.txt
12 | 
13 | ENTRYPOINT ["python"]
14 | CMD ["/code/trumpbot.py"]
15 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Trumpbot v0.1
  2 | Trumpbot was my attempt at creating a RNN trained on Donald Trumps(DT) tweets. I used this as a sort of practice project for learning a bit about RNN's and Tensorflow 2. The result was a chaos and a learning experience so let's dive in.
  3 | 
  4 | ## Run with Containers
  5 | 
  6 | ### Docker
  7 | 
  8 | If you don't want to install dependencies to your host, you can build a Docker container
  9 | with the included Dockerfile:
 10 | 
 11 | ```bash
 12 | $ docker build -t trumpbot .
 13 | ```
 14 | 
 15 | The entrypoint is the script to generate the tweets:
 16 | 
 17 | ```bash
 18 | $ docker run trumpbot
 19 | ...
 20 |  obamas Top and France at 900 PM on FoxNews. Anderson Congratulations to the House vote for MittRomney o
 21 | 
 22 |  hillary Clinton has been a total disaster. I have an idea for her great speech on CNN in the world  a great honor for me and his partisan hotel and every spor
 23 | 
 24 |  friends support Trump International Golf Club on the Paris About that Right School is started by the DNC and Clinton and the DNC that will be a great show with t
 25 | ```
 26 | 
 27 | If you want to interact with the container (perhaps training first) you can shell inside instead:
 28 | 
 29 | ```bash
 30 | $ docker run -it --entrypoint bash trumpbot
 31 | root@b53b98f12c34:/code# ls
 32 | Dockerfile  README.md  __init__.py  learn.py  raw_tweets.txt  requirements.txt	training_checkpoints  trumpbot.py  tweets.txt
 33 | ```
 34 | 
 35 | You'll be in the `/code` directory that contains the source code. 
 36 | 
 37 | ### Singularity
 38 | 
 39 | For users that want to perhaps use GPU (or better leverage the host) the recommendation is to
 40 | use a [Singularity](https://www.sylabs.io/guides/3.2/user-guide/) container, and a recipe file [Singularity](Singularity) is provided
 41 | to build the container.
 42 | 
 43 | ```bash
 44 | $ sudo singularity build trumpbot.sif Singularity
 45 | ```
 46 | 
 47 | And then to run (add the --nv flag if you want to leverage any host libraries).
 48 | 
 49 | ```bash
 50 | $ singularity run trumpbot.sif
 51 | ```
 52 | 
 53 | If you need to change the way that tensorflow or numpy are installed, you can edit the Singularity or Docker recipes.
 54 | 
 55 | ## Setup
 56 | Setup is pretty straightforward. It only needs numpy and tensorflow 2 alpha just run the start pip install:
 57 | 
 58 |     pip3 install -r requirements.txt
 59 | 
 60 | 
 61 | ## Dataset
 62 | 
 63 | The entire dataset was just tweets scraped from the DT twitter account. I used Jefferson Henrique's library [GetOldTweets-python](https://github.com/Jefferson-Henrique/GetOldTweets-python) that I modified a little bit. All the raw tweets can be found in the raw_tweets.txt file FYI all the links in any tweet have been removed.
 64 | 
 65 | The first thing about using Tweets as a dataset for training is that they are filled with garbage that wreaks havoc when training. Heres what I did:
 66 | 
 67 | - Removed any links or urls to photos
 68 | - Simplified all the puncuation, with Trump this is a big thing, his tweets are a clown fiesta of periods and exclemation marks.
 69 | - Cleaned out any invisible or non-english characters, any foreign characters just casuases trouble.
 70 | - Removed the '@' symbol, I'll explain why later.
 71 | - Removed the first couple of months of tweets, they were mostly about the celebrity apprentice and not really core to what I was trying to capture.
 72 | - Removed any retweets or super short @replies
 73 | 
 74 | The final training text is in tweets.txt which altogether is about 20,000 tweets.
 75 | 
 76 | ## Training
 77 | I trained the model twice, the first time for 30 epochs which took around 6 hours. The result was absolute garbage, at the time I hadn't removed hidden or foreign characters so it took 6 hours to spit out complete nonsense. So after I cleaned out the tweets again, I ran the training overnight for 50 epochs this time.
 78 | 
 79 | Just run the learn.py file to train it again if you want, the model check points are stored in the 'training_checkpoints' folder
 80 | 
 81 |     python3 learn.py
 82 | 
 83 | 
 84 | ## Generating Tweets
 85 | So now the fun part, you can run the command:
 86 | 
 87 |     python3 trumpbot.py
 88 | 
 89 | This will generate 10 tweets from a random group of topics. If you open the trumpbot.py file theres a few things you can play with:
 90 | 
 91 |     tweets - Number of messages you want generated
 92 | 
 93 |     temperature - This controls how predictable the tweet will be, by 
 94 |         default its random from 0.1 -> 0.4, anything above about 0.7 generates
 95 |         garbage.
 96 | 
 97 |     talking_points - Is a list of inputs to feed the network, try out 
 98 |         differnt words and see what works.
 99 | 
100 |     num_generate - This controls the length of the message you want to
101 |          get generated.
102 | 
103 | ## Result
104 | For my first crack at text generation Im happy with the results. Here are some sample tweets:
105 | 
106 |     hillary Clinton has been a total disaster. If you cant admit that 
107 |     the U.S. more than has been treated big baster I am a g
108 | 
109 |     Donald Trump is 45% Iran
110 | 
111 |     healthe lobbyist now wants to raise taxes for our country in the 
112 |     first place! If only one thing is clea
113 | 
114 |     friends support Trump Rally Anger Golf Club of Caporate legislation 
115 |     at the WhiteHouse today! #MakeAmericaGreatAgain Thank you for your
116 |      support! #Trump2016 
117 | 
118 |     koreau like you it was great being in the last election then will be
119 |      a great show. I have a fan o
120 | 
121 |     koreau lies and losers and losers will be a great show with the U.S.
122 |      The President has a various past c
123 | 
124 | 
125 | ## What I learned
126 | 
127 | - Tweets make for a tough training set. Things like @ mentions just pollute the hell out of the text so unless you want your bot to be constantly @ing everything I need to find a better way to deal with that.
128 | 
129 | - Things I thought the bot would love talking about stuff like #MAGA, Russia, China, and collusion just generate garbage strings.
130 | 
131 | - Text generation is really hard, and takes a ton of training time. 
132 | 
133 | - I could probably get a bit better results if I let it train a bit longer but for any drastic improvements I probably need to try another method or spend alot more time tuning the training set.
134 | 
135 | - Pick a subject that doesn't tweet like hes a dad yelling at a little league game. I think because his tweets are short little outbursts its hard to generate a predictable pattern across them.
136 | 
137 | - The words it groups together for differnt topics is probably worth looking at, like whenever you use 'hillary' as a input it usually has the words 'liar' or 'disaster' in the sentence. or how it loves telling you when its gonna be on @Foxandfriends
138 | 
139 | - With the method I used spelling its like to add random 'u' infront of words.
140 | 
141 | I feel like this is good starting point, and with some work we might have a digital orange man bot in our future.
142 | 
143 | 
144 | ## :postbox: Contact & Support
145 | 
146 | Created by [Wyatt Ferguson](@wyattxdev@mastodon.social)
147 | 
148 | For any comments or questions message me on [Mastodon](@wyattxdev@mastodon.social)
149 | 
150 | [:coffee: Buy Me A Coffee](https://www.buymeacoffee.com/wyattferguson)
151 | 
152 | 


--------------------------------------------------------------------------------
/Singularity:
--------------------------------------------------------------------------------
 1 | Bootstrap: docker
 2 | From: python:3.7
 3 | 
 4 | # sudo singularity build trumpbot.sif Singularity
 5 | 
 6 | %setup
 7 |     echo "Adding code to container."
 8 |     mkdir -p "${SINGULARITY_ROOTFS}/code"
 9 |     cp -R . "${SINGULARITY_ROOTFS}/code"
10 | 
11 | %post
12 | apt-get update && \
13 |     apt-get install -y python-numpy && \
14 |     mkdir -p /code
15 | 
16 | cd /code
17 | pip install -r requirements.txt
18 | 
19 | %runscript
20 |     exec python /code/trumpbot.py "$@"
21 | 


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wyattferguson/trumpbot-rnn/8cdab019b2712b6745b63b30f5bdbe6226ee8d46/__init__.py


--------------------------------------------------------------------------------
/learn.py:
--------------------------------------------------------------------------------
 1 | from __future__ import absolute_import, division, print_function, unicode_literals
 2 | import tensorflow as tf
 3 | import numpy as np
 4 | import os
 5 | 
 6 | 
 7 | def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
 8 |     model = tf.keras.Sequential([
 9 |         tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]),
10 |         tf.keras.layers.LSTM(rnn_units,
11 |                         return_sequences=True,
12 |                         stateful=True,
13 |                         recurrent_initializer='glorot_uniform'),
14 |         tf.keras.layers.Dense(vocab_size)
15 |     ])
16 |     return model
17 | 
18 | 
19 | def split_input_target(chunk):
20 |     input_text = chunk[:-1]
21 |     target_text = chunk[1:]
22 |     return input_text, target_text
23 | 
24 | 
25 | def loss(labels, logits):
26 |     return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)
27 | 
28 | 
29 | def parse_training_file():
30 |     training_file = './tweets.txt'
31 | 
32 |     text = open(training_file, 'rb').read().decode(encoding='utf-8')
33 |     vocab = sorted(set(text))
34 |     char2idx = {u:i for i, u in enumerate(vocab)}
35 |     idx2char = np.array(vocab)
36 |     text_as_int = np.array([char2idx[c] for c in text])
37 |     return [len(vocab), char2idx, idx2char, text_as_int]
38 | 
39 | 
40 | def model_train():
41 |     EPOCHS = 50
42 |     BATCH_SIZE = 64
43 |     BUFFER_SIZE = 10000
44 |     EMBEDDING_DIM = 256
45 |     RNN_UNITS = 1024
46 |     SEQ_LENGTH = 140
47 | 
48 |     vocab_len, __, __, text_as_int = parse_training_file()
49 | 
50 |     char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
51 |     sequences = char_dataset.batch(SEQ_LENGTH+1, drop_remainder=True)
52 | 
53 |     dataset = sequences.map(split_input_target)
54 |     dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
55 | 
56 |     model = build_model(
57 |         vocab_size = vocab_len,
58 |         embedding_dim=EMBEDDING_DIM,
59 |         rnn_units=RNN_UNITS,
60 |         batch_size=BATCH_SIZE)
61 | 
62 |     model.compile(optimizer='adam', loss=loss)
63 |     
64 |     checkpoint_prefix = os.path.join('./training_checkpoints', "ckpt_{epoch}")
65 | 
66 |     checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
67 |         filepath=checkpoint_prefix,
68 |         save_weights_only=True)
69 | 
70 |     history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])
71 |     return history
72 | 
73 | 
74 | if __name__ == "__main__":
75 |     model_train()


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | tensorflow==2.0.0a0
2 | numpy==1.16.2
3 | 


--------------------------------------------------------------------------------
/training_checkpoints/checkpoint:
--------------------------------------------------------------------------------
1 | model_checkpoint_path: "ckpt_50"
2 | all_model_checkpoint_paths: "ckpt_50"
3 | 


--------------------------------------------------------------------------------
/training_checkpoints/ckpt_50.data-00000-of-00001:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wyattferguson/trumpbot-rnn/8cdab019b2712b6745b63b30f5bdbe6226ee8d46/training_checkpoints/ckpt_50.data-00000-of-00001


--------------------------------------------------------------------------------
/training_checkpoints/ckpt_50.index:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wyattferguson/trumpbot-rnn/8cdab019b2712b6745b63b30f5bdbe6226ee8d46/training_checkpoints/ckpt_50.index


--------------------------------------------------------------------------------
/trumpbot.py:
--------------------------------------------------------------------------------
 1 | from __future__ import absolute_import, division, print_function, unicode_literals
 2 | import tensorflow as tf
 3 | import numpy as np
 4 | import random
 5 | import learn
 6 | 
 7 | def generate_text(model, start_string):
 8 |     vocab_len, char2idx, idx2char, __ = learn.parse_training_file()
 9 |     # Number of characters to generate
10 |     num_generate = random.randint(70,160)
11 | 
12 |     # Converting our start string to numbers (vectorizing)
13 |     input_eval = [char2idx[s] for s in start_string]
14 |     input_eval = tf.expand_dims(input_eval, 0)
15 | 
16 |     text_generated = []
17 | 
18 |     # Low temperatures results in more predictable text.
19 |     # Higher temperatures results in more surprising text.
20 |     temperature = random.randint(5,30) / 100
21 | 
22 |     # Here batch size == 1
23 |     model.reset_states()
24 |     for i in range(num_generate):
25 |         predictions = model(input_eval)
26 |         # remove the batch dimension
27 |         predictions = tf.squeeze(predictions, 0)
28 | 
29 |         # using a categorical distribution to predict the word returned by the model
30 |         predictions = predictions / temperature
31 |         predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
32 | 
33 |         # We pass the predicted word as the next input to the model
34 |         # along with the previous hidden state
35 |         input_eval = tf.expand_dims([predicted_id], 0)
36 | 
37 |         text_generated.append(idx2char[predicted_id])
38 | 
39 |     return (start_string + ''.join(text_generated))
40 | 
41 | 
42 | def load_model():
43 |     vocab_len, __, __, __ = learn.parse_training_file()
44 |     embedding_dim = 256
45 |     rnn_units = 1024
46 | 
47 |     model = learn.build_model(vocab_len, embedding_dim, rnn_units, batch_size=1)
48 |     model.load_weights(tf.train.latest_checkpoint('./training_checkpoints'))
49 |     model.build(tf.TensorShape([1, None]))
50 |     return model
51 | 
52 | 
53 | def verbal_magic(tweets=1):
54 |     model = load_model()
55 |     # Give the RNN a jumping off point
56 |     talking_points = ['hillary', 'health', 'obama', 'news', 'friends', 'korea', 'election', 'russia', 'loser']
57 |     
58 |     for i in range(0,tweets):
59 |         start_string = np.random.choice(talking_points)
60 |         tweet = generate_text(model, start_string=start_string)
61 |         print('\n', tweet)
62 |         
63 | 
64 | 
65 | if __name__ == "__main__":
66 |     tweets = 10
67 |     verbal_magic(tweets)
68 | 


--------------------------------------------------------------------------------