├── .gitignore ├── Dockerfile ├── README.md ├── Singularity ├── __init__.py ├── learn.py ├── raw_tweets.txt ├── requirements.txt ├── training_checkpoints ├── checkpoint ├── ckpt_50.data-00000-of-00001 └── ckpt_50.index ├── trumpbot.py └── tweets.txt /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Containers 10 | *.sif 11 | 12 | # Distribution / packaging 13 | .Python 14 | build/ 15 | develop-eggs/ 16 | dist/ 17 | downloads/ 18 | eggs/ 19 | .eggs/ 20 | lib/ 21 | lib64/ 22 | parts/ 23 | sdist/ 24 | var/ 25 | wheels/ 26 | *.egg-info/ 27 | .installed.cfg 28 | *.egg 29 | MANIFEST 30 | 31 | # PyInstaller 32 | # Usually these files are written by a python script from a template 33 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 34 | *.manifest 35 | *.spec 36 | 37 | # Installer logs 38 | pip-log.txt 39 | pip-delete-this-directory.txt 40 | 41 | # Unit test / coverage reports 42 | htmlcov/ 43 | .tox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | 53 | # Translations 54 | *.mo 55 | *.pot 56 | 57 | # Django stuff: 58 | *.log 59 | local_settings.py 60 | db.sqlite3 61 | 62 | # Flask stuff: 63 | instance/ 64 | .webassets-cache 65 | 66 | # Scrapy stuff: 67 | .scrapy 68 | 69 | # Sphinx documentation 70 | docs/_build/ 71 | 72 | # PyBuilder 73 | target/ 74 | 75 | # Jupyter Notebook 76 | .ipynb_checkpoints 77 | 78 | # pyenv 79 | .python-version 80 | 81 | # celery beat schedule file 82 | celerybeat-schedule 83 | 84 | # SageMath parsed files 85 | *.sage.py 86 | 87 | # Environments 88 | .env 89 | .venv 90 | env/ 91 | venv/ 92 | ENV/ 93 | env.bak/ 94 | venv.bak/ 95 | 96 | # Spyder project settings 97 | .spyderproject 98 | .spyproject 99 | 100 | # Rope project settings 101 | .ropeproject 102 | 103 | # mkdocs documentation 104 | /site 105 | 106 | # mypy 107 | .mypy_cache/ 108 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.7 2 | 3 | # docker build -t trumpbot . 4 | 5 | RUN apt-get update && \ 6 | apt-get install -y python-numpy && \ 7 | mkdir -p /code 8 | 9 | ADD . /code 10 | WORKDIR /code 11 | RUN pip install -r requirements.txt 12 | 13 | ENTRYPOINT ["python"] 14 | CMD ["/code/trumpbot.py"] 15 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Trumpbot v0.1 2 | Trumpbot was my attempt at creating a RNN trained on Donald Trumps(DT) tweets. I used this as a sort of practice project for learning a bit about RNN's and Tensorflow 2. The result was a chaos and a learning experience so let's dive in. 3 | 4 | ## Run with Containers 5 | 6 | ### Docker 7 | 8 | If you don't want to install dependencies to your host, you can build a Docker container 9 | with the included Dockerfile: 10 | 11 | ```bash 12 | $ docker build -t trumpbot . 13 | ``` 14 | 15 | The entrypoint is the script to generate the tweets: 16 | 17 | ```bash 18 | $ docker run trumpbot 19 | ... 20 | obamas Top and France at 900 PM on FoxNews. Anderson Congratulations to the House vote for MittRomney o 21 | 22 | hillary Clinton has been a total disaster. I have an idea for her great speech on CNN in the world a great honor for me and his partisan hotel and every spor 23 | 24 | friends support Trump International Golf Club on the Paris About that Right School is started by the DNC and Clinton and the DNC that will be a great show with t 25 | ``` 26 | 27 | If you want to interact with the container (perhaps training first) you can shell inside instead: 28 | 29 | ```bash 30 | $ docker run -it --entrypoint bash trumpbot 31 | root@b53b98f12c34:/code# ls 32 | Dockerfile README.md __init__.py learn.py raw_tweets.txt requirements.txt training_checkpoints trumpbot.py tweets.txt 33 | ``` 34 | 35 | You'll be in the `/code` directory that contains the source code. 36 | 37 | ### Singularity 38 | 39 | For users that want to perhaps use GPU (or better leverage the host) the recommendation is to 40 | use a [Singularity](https://www.sylabs.io/guides/3.2/user-guide/) container, and a recipe file [Singularity](Singularity) is provided 41 | to build the container. 42 | 43 | ```bash 44 | $ sudo singularity build trumpbot.sif Singularity 45 | ``` 46 | 47 | And then to run (add the --nv flag if you want to leverage any host libraries). 48 | 49 | ```bash 50 | $ singularity run trumpbot.sif 51 | ``` 52 | 53 | If you need to change the way that tensorflow or numpy are installed, you can edit the Singularity or Docker recipes. 54 | 55 | ## Setup 56 | Setup is pretty straightforward. It only needs numpy and tensorflow 2 alpha just run the start pip install: 57 | 58 | pip3 install -r requirements.txt 59 | 60 | 61 | ## Dataset 62 | 63 | The entire dataset was just tweets scraped from the DT twitter account. I used Jefferson Henrique's library [GetOldTweets-python](https://github.com/Jefferson-Henrique/GetOldTweets-python) that I modified a little bit. All the raw tweets can be found in the raw_tweets.txt file FYI all the links in any tweet have been removed. 64 | 65 | The first thing about using Tweets as a dataset for training is that they are filled with garbage that wreaks havoc when training. Heres what I did: 66 | 67 | - Removed any links or urls to photos 68 | - Simplified all the puncuation, with Trump this is a big thing, his tweets are a clown fiesta of periods and exclemation marks. 69 | - Cleaned out any invisible or non-english characters, any foreign characters just casuases trouble. 70 | - Removed the '@' symbol, I'll explain why later. 71 | - Removed the first couple of months of tweets, they were mostly about the celebrity apprentice and not really core to what I was trying to capture. 72 | - Removed any retweets or super short @replies 73 | 74 | The final training text is in tweets.txt which altogether is about 20,000 tweets. 75 | 76 | ## Training 77 | I trained the model twice, the first time for 30 epochs which took around 6 hours. The result was absolute garbage, at the time I hadn't removed hidden or foreign characters so it took 6 hours to spit out complete nonsense. So after I cleaned out the tweets again, I ran the training overnight for 50 epochs this time. 78 | 79 | Just run the learn.py file to train it again if you want, the model check points are stored in the 'training_checkpoints' folder 80 | 81 | python3 learn.py 82 | 83 | 84 | ## Generating Tweets 85 | So now the fun part, you can run the command: 86 | 87 | python3 trumpbot.py 88 | 89 | This will generate 10 tweets from a random group of topics. If you open the trumpbot.py file theres a few things you can play with: 90 | 91 | tweets - Number of messages you want generated 92 | 93 | temperature - This controls how predictable the tweet will be, by 94 | default its random from 0.1 -> 0.4, anything above about 0.7 generates 95 | garbage. 96 | 97 | talking_points - Is a list of inputs to feed the network, try out 98 | differnt words and see what works. 99 | 100 | num_generate - This controls the length of the message you want to 101 | get generated. 102 | 103 | ## Result 104 | For my first crack at text generation Im happy with the results. Here are some sample tweets: 105 | 106 | hillary Clinton has been a total disaster. If you cant admit that 107 | the U.S. more than has been treated big baster I am a g 108 | 109 | Donald Trump is 45% Iran 110 | 111 | healthe lobbyist now wants to raise taxes for our country in the 112 | first place! If only one thing is clea 113 | 114 | friends support Trump Rally Anger Golf Club of Caporate legislation 115 | at the WhiteHouse today! #MakeAmericaGreatAgain Thank you for your 116 | support! #Trump2016 117 | 118 | koreau like you it was great being in the last election then will be 119 | a great show. I have a fan o 120 | 121 | koreau lies and losers and losers will be a great show with the U.S. 122 | The President has a various past c 123 | 124 | 125 | ## What I learned 126 | 127 | - Tweets make for a tough training set. Things like @ mentions just pollute the hell out of the text so unless you want your bot to be constantly @ing everything I need to find a better way to deal with that. 128 | 129 | - Things I thought the bot would love talking about stuff like #MAGA, Russia, China, and collusion just generate garbage strings. 130 | 131 | - Text generation is really hard, and takes a ton of training time. 132 | 133 | - I could probably get a bit better results if I let it train a bit longer but for any drastic improvements I probably need to try another method or spend alot more time tuning the training set. 134 | 135 | - Pick a subject that doesn't tweet like hes a dad yelling at a little league game. I think because his tweets are short little outbursts its hard to generate a predictable pattern across them. 136 | 137 | - The words it groups together for differnt topics is probably worth looking at, like whenever you use 'hillary' as a input it usually has the words 'liar' or 'disaster' in the sentence. or how it loves telling you when its gonna be on @Foxandfriends 138 | 139 | - With the method I used spelling its like to add random 'u' infront of words. 140 | 141 | I feel like this is good starting point, and with some work we might have a digital orange man bot in our future. 142 | 143 | 144 | ## :postbox: Contact & Support 145 | 146 | Created by [Wyatt Ferguson](@wyattxdev@mastodon.social) 147 | 148 | For any comments or questions message me on [Mastodon](@wyattxdev@mastodon.social) 149 | 150 | [:coffee: Buy Me A Coffee](https://www.buymeacoffee.com/wyattferguson) 151 | 152 | -------------------------------------------------------------------------------- /Singularity: -------------------------------------------------------------------------------- 1 | Bootstrap: docker 2 | From: python:3.7 3 | 4 | # sudo singularity build trumpbot.sif Singularity 5 | 6 | %setup 7 | echo "Adding code to container." 8 | mkdir -p "${SINGULARITY_ROOTFS}/code" 9 | cp -R . "${SINGULARITY_ROOTFS}/code" 10 | 11 | %post 12 | apt-get update && \ 13 | apt-get install -y python-numpy && \ 14 | mkdir -p /code 15 | 16 | cd /code 17 | pip install -r requirements.txt 18 | 19 | %runscript 20 | exec python /code/trumpbot.py "$@" 21 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wyattferguson/trumpbot-rnn/8cdab019b2712b6745b63b30f5bdbe6226ee8d46/__init__.py -------------------------------------------------------------------------------- /learn.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import, division, print_function, unicode_literals 2 | import tensorflow as tf 3 | import numpy as np 4 | import os 5 | 6 | 7 | def build_model(vocab_size, embedding_dim, rnn_units, batch_size): 8 | model = tf.keras.Sequential([ 9 | tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]), 10 | tf.keras.layers.LSTM(rnn_units, 11 | return_sequences=True, 12 | stateful=True, 13 | recurrent_initializer='glorot_uniform'), 14 | tf.keras.layers.Dense(vocab_size) 15 | ]) 16 | return model 17 | 18 | 19 | def split_input_target(chunk): 20 | input_text = chunk[:-1] 21 | target_text = chunk[1:] 22 | return input_text, target_text 23 | 24 | 25 | def loss(labels, logits): 26 | return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True) 27 | 28 | 29 | def parse_training_file(): 30 | training_file = './tweets.txt' 31 | 32 | text = open(training_file, 'rb').read().decode(encoding='utf-8') 33 | vocab = sorted(set(text)) 34 | char2idx = {u:i for i, u in enumerate(vocab)} 35 | idx2char = np.array(vocab) 36 | text_as_int = np.array([char2idx[c] for c in text]) 37 | return [len(vocab), char2idx, idx2char, text_as_int] 38 | 39 | 40 | def model_train(): 41 | EPOCHS = 50 42 | BATCH_SIZE = 64 43 | BUFFER_SIZE = 10000 44 | EMBEDDING_DIM = 256 45 | RNN_UNITS = 1024 46 | SEQ_LENGTH = 140 47 | 48 | vocab_len, __, __, text_as_int = parse_training_file() 49 | 50 | char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int) 51 | sequences = char_dataset.batch(SEQ_LENGTH+1, drop_remainder=True) 52 | 53 | dataset = sequences.map(split_input_target) 54 | dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True) 55 | 56 | model = build_model( 57 | vocab_size = vocab_len, 58 | embedding_dim=EMBEDDING_DIM, 59 | rnn_units=RNN_UNITS, 60 | batch_size=BATCH_SIZE) 61 | 62 | model.compile(optimizer='adam', loss=loss) 63 | 64 | checkpoint_prefix = os.path.join('./training_checkpoints', "ckpt_{epoch}") 65 | 66 | checkpoint_callback=tf.keras.callbacks.ModelCheckpoint( 67 | filepath=checkpoint_prefix, 68 | save_weights_only=True) 69 | 70 | history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback]) 71 | return history 72 | 73 | 74 | if __name__ == "__main__": 75 | model_train() -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | tensorflow==2.0.0a0 2 | numpy==1.16.2 3 | -------------------------------------------------------------------------------- /training_checkpoints/checkpoint: -------------------------------------------------------------------------------- 1 | model_checkpoint_path: "ckpt_50" 2 | all_model_checkpoint_paths: "ckpt_50" 3 | -------------------------------------------------------------------------------- /training_checkpoints/ckpt_50.data-00000-of-00001: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wyattferguson/trumpbot-rnn/8cdab019b2712b6745b63b30f5bdbe6226ee8d46/training_checkpoints/ckpt_50.data-00000-of-00001 -------------------------------------------------------------------------------- /training_checkpoints/ckpt_50.index: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wyattferguson/trumpbot-rnn/8cdab019b2712b6745b63b30f5bdbe6226ee8d46/training_checkpoints/ckpt_50.index -------------------------------------------------------------------------------- /trumpbot.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import, division, print_function, unicode_literals 2 | import tensorflow as tf 3 | import numpy as np 4 | import random 5 | import learn 6 | 7 | def generate_text(model, start_string): 8 | vocab_len, char2idx, idx2char, __ = learn.parse_training_file() 9 | # Number of characters to generate 10 | num_generate = random.randint(70,160) 11 | 12 | # Converting our start string to numbers (vectorizing) 13 | input_eval = [char2idx[s] for s in start_string] 14 | input_eval = tf.expand_dims(input_eval, 0) 15 | 16 | text_generated = [] 17 | 18 | # Low temperatures results in more predictable text. 19 | # Higher temperatures results in more surprising text. 20 | temperature = random.randint(5,30) / 100 21 | 22 | # Here batch size == 1 23 | model.reset_states() 24 | for i in range(num_generate): 25 | predictions = model(input_eval) 26 | # remove the batch dimension 27 | predictions = tf.squeeze(predictions, 0) 28 | 29 | # using a categorical distribution to predict the word returned by the model 30 | predictions = predictions / temperature 31 | predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy() 32 | 33 | # We pass the predicted word as the next input to the model 34 | # along with the previous hidden state 35 | input_eval = tf.expand_dims([predicted_id], 0) 36 | 37 | text_generated.append(idx2char[predicted_id]) 38 | 39 | return (start_string + ''.join(text_generated)) 40 | 41 | 42 | def load_model(): 43 | vocab_len, __, __, __ = learn.parse_training_file() 44 | embedding_dim = 256 45 | rnn_units = 1024 46 | 47 | model = learn.build_model(vocab_len, embedding_dim, rnn_units, batch_size=1) 48 | model.load_weights(tf.train.latest_checkpoint('./training_checkpoints')) 49 | model.build(tf.TensorShape([1, None])) 50 | return model 51 | 52 | 53 | def verbal_magic(tweets=1): 54 | model = load_model() 55 | # Give the RNN a jumping off point 56 | talking_points = ['hillary', 'health', 'obama', 'news', 'friends', 'korea', 'election', 'russia', 'loser'] 57 | 58 | for i in range(0,tweets): 59 | start_string = np.random.choice(talking_points) 60 | tweet = generate_text(model, start_string=start_string) 61 | print('\n', tweet) 62 | 63 | 64 | 65 | if __name__ == "__main__": 66 | tweets = 10 67 | verbal_magic(tweets) 68 | --------------------------------------------------------------------------------