├── README.md ├── autoresponder.py ├── blacklist ├── thread_blacklist └── user_blacklist ├── saved_policies └── generic.t7 ├── train-from-checkpoint.sh └── util ├── fix_vocab.py └── vocab_cleaner.py /README.md: -------------------------------------------------------------------------------- 1 | # neural-chatbot 2 | Train a neural chatbot to imitate your personality using your messaging history! Never be obligated to respond to your friends on Messenger again; your bot will do it for you and say exactly what you would say! Responses are generated from a char-level language model, implemented with a recurrent neural network policy. 3 | 4 | ## Getting neural-chatbot 5 | 6 | Clone neural-chatbot using: 7 | ```bash 8 | $ git clone https://github.com/socketteer/neural-chatbot.git 9 | ``` 10 | 11 | ## Requirements 12 | 13 | Currently, this code has only been tested on Ubuntu. This code is written in Python 3. 14 | 15 | Training chatbots requires [char-rnn](https://github.com/karpathy/char-rnn). Install the requirements for char-rnn as directed. Then clone char-rnn inside the neural-chatbot folder: 16 | 17 | ```bash 18 | $ cd neural-chatbot 19 | $ git clone https://github.com/karpathy/char-rnn.git 20 | ``` 21 | 22 | Running the autoresponder script requires python module [fbchat](http://fbchat.readthedocs.io/en/master/install.html). You can get it using: 23 | 24 | ```bash 25 | $ pip install fbchat 26 | ``` 27 | 28 | ## Usage 29 | 30 | ### Preprocessing data 31 | 32 | First, you will need to train a bot using your Facebook or Skype data (you can skip this section if you want to use the generic pre-trained bot). For this, you will need [botfood-parser](https://github.com/socketteer/botfood-parser). You do not need to clone it inside the neural-chatbot directory. 33 | 34 | #### Getting Facebook data 35 | 36 | On Facebook desktop, click the down arrow at the upper right corner of the screen and go to Settings -> General. Click the link that says "Download a copy of your Facebook data" and then click "Start my archive." It will take anywhere from a few minutes to a few hours for Facebook to compile your data. You should get an email and a notification when it is ready for download. 37 | 38 | Download the zip file called facebook-.zip. The "messages" folder in the zip file is the only part you will need. 39 | 40 | Create a new folder in botfood-parser/facebook (you may call it your name or your facebook username). Copy all the html files in "messages" to this folder. Run the fb_parser.py script using: 41 | 42 | ```bash 43 | $ python fb_parser.py 44 | ``` 45 | 46 | Make sure the name argument is exactly the same as your display name for Facebook, otherwise the script will not parse the data correctly! You will get a warning if it seems like you used the wrong name. 47 | 48 | You can use the -g or --groupchat flag to enable parsing data from groupchats. This is not recommended unless you are very active in most of the groupchats you're in. 49 | 50 | Running the above command will generate a text file in botfood-parser/corpus called facebook-.txt. This prepared botfood. Follow the instructions under "Training with char-rnn" to train a bot using your data! 51 | 52 | #### Getting Skype data 53 | 54 | You will need Skype classic on desktop to export your Skype history. Select Tools -> Options and select Privacy, then Export Chat History. Copy the csv file to botfood-parser/skype. 55 | 56 | Run the skype_parser.py script using: 57 | 58 | ```bash 59 | $ python skype_parser.py 60 | ``` 61 | 62 | Running the above command will generate a text file in botfood-parser/corpus called skype-.txt. This prepared botfood. Follow the instructions under "Training with char-rnn" to train a bot using your data! 63 | 64 | ### Training with char-rnn 65 | 66 | Compile all the data you want to feed the bot in a single file and name it input.txt (it is fine to combine parsed Facebook and Skype data). Create a new folder in neural-chatbot/char-rnn/data and copy input.txt into the folder. 67 | 68 | #### Method 1 -- training from scratch 69 | 70 | This method works better if you have at least 2MB of data, preferably more. Navigate to the char-rnn directory and run 71 | 72 | ```bash 73 | $ th train.lua -data_dir data/ 74 | ``` 75 | 76 | You can play around with flags such a -rnn_size (defaults to 128; if you have more than a couple MB of data I would recommend a size of 300 or more), -num_layers (defaults to 2; you can try 3 or more) and dropout (try 0.5). 77 | 78 | If you are running on CPU, training will take at least a few hours. After training terminates, go to char-rnn/cv and copy the checkpoint with the lowest loss to neural-chatbot/saved_policies. The naming scheme for the files is lm_lstm_epoch_.t7. You can delete the rest of the checkpoints as they take up a lot of space. 79 | 80 | ##### "Loss is exploding!" 81 | 82 | Do you have a relatively small dataset (< 2 MB?) Try the train from generic checkpoint method. Otherwise, you may be using too large of an rnn rize. If training goes on for a while hours before loss explodes, the saved checkpoints may be good enough to use. Generally, if the checkpoint loss is under 1.3 or so, the policy is likely decent. 83 | 84 | #### Method 2 -- training from generic checkpoint 85 | 86 | If you have a small amount of data, there may not be enough material for the bot to develop internal representations of English syntax without becoming overfitted. A more suitable method for small datasets is training a generic bot on a large dataset first, and then initiating a second network with the same weights and training on the small dataset. This way, the bot can retain the representations it developed from the large dataset while honing to behave more like the small dataset. 87 | 88 | I have provided a generic bot trained from a 50 MB conversational dataset, located at neural-chatbot/saved_policies/generic.t7 and a bash script to automate the training process. Run the script like: 89 | 90 | ```bash 91 | $ chmod +x train-from-checkpoint.sh 92 | $ ./train-from-checkpoint.sh 93 | ``` 94 | 95 | After training terminates, go to char-rnn/cv and copy the checkpoint with the lowest loss neural-chatbot/saved_policies. The loss is the number in the filename following the underscore. You can delete the rest of the checkpoints as they take up a lot of space. 96 | 97 | ##### "Loss is exploding!" 98 | 99 | Try altering the training rate. Often, lowering it helps with preventing loss explosion, although sometimes making it higher can solve the problem as well. 100 | 101 | Note: The train from checkpoint method is a lot faster than training from scratch, and it's pretty normal for training to terminate with loss exploding. As long as the best checkpoint loss score is under 1.3 or so, the policy should be decent. 102 | 103 | ### Running bots on messenger 104 | 105 | Once you've copied your .t7 file to neural-chatbot/saved_policies, you are ready to put your bot online! (I would suggest renaming the .t7 file to something meaningful, especially if you are going to train multiple bots) 106 | 107 | Run the autoresponder script: 108 | 109 | ```bash 110 | $ python autoresponder.py <.t7 file name> --delay 111 | ``` 112 | 113 | For example, to use the generic bot: 114 | 115 | ```bash 116 | $ python autoresponder.py generic.t7 --delay 117 | ``` 118 | You will be prompted for your Facebook username/email and password. When the prompt displays "Listening...", the chatbot is online! 119 | 120 | ##### Optional arguments: 121 | 122 | --verbose: enables verbose logging during chatbot operation 123 | 124 | --maxlength [int]: maximum character length of generated responses 125 | 126 | --maxmessages [int]: maximum number of generated messages per incoming message 127 | 128 | --threadlength [int]: maximum number of messages from each thread saved in in memory 129 | 130 | --delay: insert realistic delay before sending messages. (recommended) 131 | 132 | #### Blacklists 133 | 134 | ##### Thread blacklist: 135 | 136 | Are there some group chats you're in that won't welcome bot spam every time anyone says anything? To prevent inevitable social ostracisation, add the IDs of any serious-business group chats you're in to neural-chatbot/blacklist/thread_blacklist. You can get the ID of any conversation by going to https://www.facebook.com/messages on desktop and opening the conversation. The format of the url is 137 | 138 | https://www.facebook.com/messages/t/ 139 | 140 | ##### User blacklist: 141 | 142 | Unfortunately, even though neural chatbots are trained on your data, they will not necessarily inherit your discretion. So it might be a good idea to blacklist employers, parents, romantic interests, and anyone else you don't want to risk unleashing a predictive text duplicate of yourself upon. 143 | 144 | The user blacklist also comes in handy when there are multiple bots in a group chat. By blacklisting the IDs of all bots, you can prevent the bots from responding to each other and creating infinite spam. 145 | 146 | Put user IDs you want to blacklist in neural-chatbot/blacklist/user_blacklist. 147 | 148 | You can get the ID of any user by going to https://www.facebook.com/messages on desktop and opening your conversation with them. The format of the url will be 149 | 150 | https://www.facebook.com/messages/t/ 151 | 152 | If you want to blacklist someone but don't have a conversation with them, go to their page, right click on their profile picture and select "copy link location." The format of the link you will have copied will be 153 | 154 | https://www.facebook.com/profile/picture/view/?profile_id=. 155 | -------------------------------------------------------------------------------- /autoresponder.py: -------------------------------------------------------------------------------- 1 | # -*- coding: UTF-8 -*- 2 | from fbchat import Client 3 | from fbchat.models import Message 4 | from getpass import getpass 5 | import re 6 | import subprocess 7 | import time 8 | import random 9 | import argparse 10 | sys.path.insert(0, '../') 11 | import vocab_cleaner 12 | 13 | 14 | class FBMessage: 15 | def __init__(self, text, author_id): 16 | self.text = text 17 | self.id = author_id 18 | 19 | class FBThread: 20 | messages = [] 21 | def __init__(self, thread_id, thread_type): 22 | self.thread_id = thread_id 23 | self.thread_type = thread_type 24 | self.messages.append(FBMessage("\n", "None")) 25 | 26 | def add_message(self, message, author_id, self_id): #name? #let's just take 1 message at a time 27 | #pop oldest message if more than maxthread messages in thread 28 | if(len(self.messages) >= maxthread): 29 | self.messages.pop(0) 30 | # "\n>" is also added before generating text, but not saved 31 | if(not self.messages[-1].id == author_id): 32 | self.messages[-1].text = self.messages[-1].text + "\n" 33 | if(author_id == self_id): 34 | message = '>' + message 35 | 36 | #remove non ascii characters 37 | message = vocab_cleaner.filter_non_ascii(message) 38 | message = message + '\n' 39 | self.messages.append(FBMessage(message, author_id)) 40 | 41 | def get_messages(self): 42 | return ''.join(message.text for message in self.messages) #this is what the nn sees 43 | 44 | def print_messages(self): 45 | print(self.get_messages()) 46 | 47 | 48 | class AutoResponder(Client): 49 | threads = [] 50 | 51 | def onMessage(self, author_id, message_object, thread_id, thread_type,**kwargs): #TODO: thread queue and/or interrupt 52 | if(thread_id not in thread_blacklist): 53 | self.markAsDelivered(author_id, thread_id) 54 | self.markAsRead(author_id) 55 | sender = client.fetchUserInfo(author_id)[author_id] 56 | sender_name = sender.name.split(' ')[0] 57 | #update thread 58 | current_thread = self.get_thread(thread_id, thread_type) 59 | if(author_id != self.uid): 60 | current_thread.add_message(message_object.text, author_id, self.uid) 61 | if(verbose): 62 | print("Sender name: %s" % sender_name) 63 | print("Current thread: %s" % thread_id) 64 | print("Thread content:") 65 | current_thread.print_messages() 66 | if(author_id not in user_blacklist): 67 | self.send_message(current_thread) 68 | 69 | def send_message(self, thread): 70 | thread_id = thread.thread_id 71 | thread_type = thread.thread_type 72 | messages_input = thread.get_messages() + "\n>" 73 | response = subprocess.check_output(['th', \ 74 | 'sample.lua', \ 75 | cv, \ 76 | '-length', str(maxlength), \ 77 | '-verbose', '0', \ 78 | '-seed', str(random.randint(0, 300)), \ 79 | '-primetext', messages_input], cwd="./char-rnn").decode("utf-8") 80 | 81 | response_list = self.processOutput(response, messages_input, thread) 82 | #send predicted responses 83 | if(len(response_list) > maxmessages): 84 | response_list = response_list[:maxmessages] 85 | for resp in response_list: 86 | if(delay): 87 | time.sleep(0.5 + len(resp)/3) 88 | thread.add_message(resp, self.uid, self.uid) 89 | self.send(Message(text=resp), thread_id=thread_id, thread_type=thread_type) 90 | 91 | def get_thread(self, thread_id, thread_type): 92 | for thread in self.threads: 93 | #if thread is in list 94 | if(thread.thread_id == thread_id): 95 | return thread 96 | #if thread not found, add new thread 97 | self.threads.append(FBThread(thread_id, thread_type)); 98 | return self.threads[-1] 99 | 100 | def processOutput(self, response, messages_input, thread): 101 | #delete redundant thread information from generated response 102 | response = response.replace(messages_input, "", 1) 103 | response = re.split('|', response)[0] 104 | response_list = response.split('\n') 105 | 106 | #removing '>' 107 | response_list = list(filter(None, response_list)) 108 | for i, msg in enumerate(response_list[1:]): 109 | if(msg[0] == '>'): 110 | response_list[i+1] = msg[1:] 111 | return response_list 112 | 113 | 114 | parser = argparse.ArgumentParser() 115 | parser.add_argument("policy", metavar = 'P', help="chatbot policy (.t7 file )") 116 | parser.add_argument("-v", "--verbose", help="display verbose logging during chatbot operation", 117 | action="store_true") 118 | parser.add_argument("-l", "--maxlength", type=int, default = 400, help="maximum character length of generated responses") 119 | parser.add_argument("-m", "--maxmessages", type=int, default = 3, help="maximum number of generated messages") 120 | parser.add_argument("-t", "--threadlength", type=int, default = 15, help="maximum number of messages from each thread saved in in memory") 121 | parser.add_argument("-d", "--delay", help="insert realistic delay before sending messages", 122 | action="store_true") 123 | args = parser.parse_args() 124 | 125 | cv = '../saved_policies/%s' % args.policy 126 | verbose = args.verbose 127 | maxlength = args.maxlength 128 | maxmessages = args.maxmessages 129 | delay = args.delay 130 | maxthread = args.threadlength 131 | 132 | user_blacklist = open('blacklist/user_blacklist','r').read().split('\n') 133 | user_blacklist = list(filter(None, user_blacklist)) 134 | thread_blacklist = open('blacklist/thread_blacklist','r').read().split('\n') 135 | thread_blacklist = list(filter(None, thread_blacklist)) 136 | 137 | username = str(input("Username/email: ")) 138 | client = AutoResponder(username, getpass()) 139 | client.listen() 140 | -------------------------------------------------------------------------------- /blacklist/thread_blacklist: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/socketteer/neural-chatbot/7fc72908e20df7b2c12583e7109abd7252d02b9e/blacklist/thread_blacklist -------------------------------------------------------------------------------- /blacklist/user_blacklist: -------------------------------------------------------------------------------- 1 | 4 2 | 100025264592561 3 | -------------------------------------------------------------------------------- /saved_policies/generic.t7: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/socketteer/neural-chatbot/7fc72908e20df7b2c12583e7109abd7252d02b9e/saved_policies/generic.t7 -------------------------------------------------------------------------------- /train-from-checkpoint.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #argument 4 | 5 | #fix vocab size 6 | cd util 7 | python fix_vocab.py "$@" 8 | 9 | #train from checkpoint 10 | cd ../char-rnn 11 | th train.lua -data_dir data/"$@" -init_from ../saved_policies/generic.t7 -rnn_size 600 -num_layers 3 12 | -------------------------------------------------------------------------------- /util/fix_vocab.py: -------------------------------------------------------------------------------- 1 | import sys 2 | 3 | filedir = '../char-rnn/data/%s/input.txt' % sys.argv[1] 4 | with open(filedir, 'a') as outfile: 5 | print('fixing %s/input.txt vocab size...' % sys.argv[1]) 6 | charset = {'र', 'I', 'i', '\n', '[', 'é', 'N', '1', '4', 'W', '♪', 'ö', 'n', '3', 'Ã', 'U', '0', 'k', 'F', 'क', 'ं', 'ã', '∂', 'f', 'd', 'S', '^', 'P', 'न', 'V', 'म', 'å', 'य', ']', 'T', 'E', "'", '(', '£', 'z', 'ñ', 'B', '"', 'í', '™', 'u', '~', 'L', 's', 'Ω', '˜', '{', ' ', 'ज', '\t', 'R', 'p', '/', '\x0c', '\xa0', '—', '“', 'Y', '¿', '-', 'K', '9', 'O', 'j', 'à', 'ग', 'स', 'त', 'g', 'á', 'ß', 'भ', '…', 'H', '‘', ')', 'Ô', '$', '*', 'v', 't', 'ø', 'ो', 'घ', 'ó', '@', '˛', 'D', 'b', 'm', '}', 'y', 'ी', '¬', 'â', 'े', 'छ', 'ƒ', '_', 'ा', 'l', '|', 'x', '’', 'ä', '्', ',', '–', '#', 'c', 'r', '˝', '>', '\\', '+', 'ु', '%', '<', '😈', 'A', '”', '6', 'e', 'C', 'h', '=', '8', '`', '§', ':', 'q', 'आ', 'ú', '!', 'Ÿ', 'o', 'w', 'X', 'ह', 'a', '2', '?', 'G', 'æ', '½', 'Z', 'M', 'ü', 'Q', 'è', 'ê', 'Ç', 'अ', 'च', 'ç', 'व', '.', '5', 'î', ';', 'ई', '∞', 'É', '&', '�', '7', 'J'} 7 | for char in charset: 8 | outfile.write(char) 9 | 10 | 11 | 12 | -------------------------------------------------------------------------------- /util/vocab_cleaner.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | def filter_non_ascii(string): 4 | string = re.sub('‘', '\'', string) 5 | string = re.sub('’', '\'', string) 6 | string = re.sub('“', '\"', string) 7 | string = re.sub('”', '\"', string) 8 | string = re.sub('…', '...', string) 9 | string = re.sub('–', '-', string) 10 | string = re.sub('☻', ':)', string) 11 | string = re.sub('😞', ':(', string) 12 | string = re.sub('😊', ':)', string) 13 | string = re.sub('❤', '<3', string) 14 | string = re.sub('😢', ':\'(', string) 15 | string = re.sub(r'[^\x00-\x7F]+', ' ', string) #replaces any other non-ascii characters with space 16 | return string 17 | --------------------------------------------------------------------------------