├── .gitignore ├── LICENSE.md ├── README.md ├── corpora ├── rhyme_corpus_1000_en.txt ├── rhyme_corpus_en.txt ├── rhyme_en.pickle └── top_5000_en.csv ├── deep_rhyme_detection ├── corpus.py ├── network.py └── rhyme.py ├── models └── rhyme_en.h5 ├── requirements.txt └── test_files ├── deck_thyself.txt ├── deck_thyself_r.txt ├── love_unknown.txt └── love_unknown_r.txt /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .coverage.* 42 | .cache 43 | nosetests.xml 44 | coverage.xml 45 | *,cover 46 | 47 | # Translations 48 | *.mo 49 | *.pot 50 | 51 | # Django stuff: 52 | *.log 53 | 54 | # Sphinx documentation 55 | docs/_build/ 56 | 57 | # PyBuilder 58 | target/ 59 | 60 | # Project-specific 61 | !models 62 | models/* 63 | !models/rhyme_en.h5 -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | 2 | The MIT License (MIT) 3 | 4 | Copyright (c) 2018 5 | 6 | Permission is hereby granted, free of charge, to any person obtaining a copy 7 | of this software and associated documentation files (the "Software"), to deal 8 | in the Software without restriction, including without limitation the rights 9 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 10 | copies of the Software, and to permit persons to whom the Software is 11 | furnished to do so, subject to the following conditions: 12 | 13 | The above copyright notice and this permission notice shall be included in all 14 | copies or substantial portions of the Software. 15 | 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 17 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 18 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 19 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 20 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 21 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 22 | SOFTWARE. 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # deep-rhyme-detection 2 | 3 | This software is designed to detect rhyme and rhyme schemes, especially in the context of poetry or lyrics. It trains and uses a recurrent neural network in order to predict which parts of the text rhyme with each other. This repository also contains some scripts for the assembly of rhyming corpora and for rendering of the text with rhyme scheme included. 4 | 5 | #### Example 6 | 7 | | Input | Output | 8 | | --- | --- | 9 | | `My song is love unknown,` | `/My/ {song} (is) [love] ` | 10 | | `my Savior's love to me;` | `/my/ #Savior's# [love to me;]` | 11 | | `love to the loveless shown,` | `[love to the loveless] ` | 12 | | `that they might lovely be.` | `!that! *might* [be.]` | 13 | | `O who am I, that for my sake,` | `[O who] *am I,* !that! {for} /my/ !sake,!` | 14 | | `my Lord should take frail flesh and die?` | `/my/ {Lord} [should] !take! #frail flesh# /and die?/` | 15 | 16 | 17 | #### Supported languages 18 | 19 | * English 20 | * French _(coming soon)_ 21 | 22 | 23 | ## Installation 24 | 25 | To install, first clone this repo: 26 | 27 | ``` 28 | cd path/to/desired/install/location 29 | git clone https://github.com/a-coles/deep-rhyme-detection.git 30 | ``` 31 | 32 | Then install dependencies by running: 33 | 34 | ``` 35 | pip install -r requirements.txt 36 | ``` 37 | 38 | ## Use 39 | 40 | This repo is designed to be easily integratable with other projects, with functions and classes sorted into modules. To see how it works, though, you can run the top-level script in the main of `rhyme.py` like this: 41 | 42 | ``` 43 | cd deep_rhyme_detection/ 44 | python rhyme.py [language] [input_file] [output_dir] 45 | ``` 46 | 47 | where the arguments are: 48 | 49 | * `language`, the language of the text you want to analyze. This can only currently take the value `english` (one day there will hopefully be support for other languages). 50 | * `input_file`, the path to the file containing the text to analyze. This should be a simple text file where lines in a stanza are separated by line breaks and stanzas are separated by a blank line. Punctuation and the like is fine to include. 51 | * `output_dir`, the path to the directory where the analyzed text will be dumped to a new text file. 52 | 53 | 54 | 55 | ## Other scripts 56 | 57 | ### Corpus building 58 | 59 | This repo comes with a pre-created corpus of English rhyming data, built from [WordFrequency's list of most frequent English words](https://www.wordfrequency.info/free.asp) and the [Datamuse Python API](https://github.com/gmarmstrong/python-datamuse/). The corpus is constructed by querying Datamuse for words that rhyme with each frequent English word and generating pairs of rhyming and non-rhyming words from these results. 60 | 61 | If you would like to create a corpus yourself, you can do so like this: 62 | 63 | ``` 64 | cd deep_rhyme_detection/ 65 | python corpus.py [language] [top_num] 66 | ``` 67 | 68 | where the arguments are: 69 | 70 | * `language`, the language for which you want to build a corpus. This can only currently take the value `english` (one day there will hopefully be support for other languages). 71 | * `top_num`, an integer representing the number of frequent words to use, with maximum 5000. A too-large value may introduce too much noise into the data. 72 | 73 | ### Network training 74 | 75 | This repo comes with a pretrained RNN that predicts whether two English words rhyme with each other, `rhyme_en.h5` under the `models/` directory. The English network could always use more tuning (pull requests welcome!), but for now, it consists of a 6-layer bidirectional character-level LSTM, each with 16 units, using an Adam optimizer and a cross-entropy loss. 76 | 77 | If you would like to re-train the RNN with different hyperparameters, you can adjust their values in the main method of `network.py`. Open the `network.py` file in a text editor and change the values in the main as you see fit: 78 | 79 | ``` 80 | # Set network parameters - change these if retraining needed 81 | num_lstm_units = 16 82 | num_epochs = 10 83 | learning_rate = 0.001 84 | batch_size = 4096 85 | ``` 86 | 87 | Then, from the command line, run: 88 | 89 | ``` 90 | python network.py [language] [--preprocessed] 91 | ``` 92 | 93 | where the arguments are: 94 | 95 | * `language`, the language of the training corpus. This can only currently take the value `english` (one day there will hopefully be support for other languages). 96 | * `--preprocessed`, a switch you should only set if you have an offline `pickle` of your training corpus in the onehot format created by the `Corpus` class. 97 | -------------------------------------------------------------------------------- /corpora/rhyme_en.pickle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/a-coles/deep-rhyme-detection/057938c468c76bd7770d9b026a015b6ac51f1868/corpora/rhyme_en.pickle -------------------------------------------------------------------------------- /corpora/top_5000_en.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/a-coles/deep-rhyme-detection/057938c468c76bd7770d9b026a015b6ac51f1868/corpora/top_5000_en.csv -------------------------------------------------------------------------------- /deep_rhyme_detection/corpus.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Functions for building rhyme corpus. 3 | The idea: 4 | - Assemble the (5000) most frequent English words. 5 | https://www.wordfrequency.info/ 6 | - Query a rhyming API to get lists of words that rhyme with these. 7 | - Parse through to create pairs of words that rhyme. 8 | - Create corresponding 'negative lists' to create pairs of words that do not rhyme. 9 | ''' 10 | 11 | import random 12 | import os 13 | import argparse 14 | 15 | from datamuse import datamuse 16 | from tqdm import tqdm 17 | 18 | 19 | def get_top_words(top_file, top_num=5000): 20 | with open(top_file, 'r') as fp: 21 | top_lines = fp.readlines()[:top_num] 22 | top_words = [x.strip() for x in top_lines] 23 | return top_words 24 | 25 | def get_rhyme_dict(top_words, api=None): 26 | print('Getting rhyming dictionaries...') 27 | rhyme_dict = {} 28 | if not api: 29 | api = datamuse.Datamuse() 30 | for word in tqdm(top_words): 31 | rhymes = api.words(rel_rhy=word, max=20) 32 | rhymes = [x['word'] for x in rhymes] 33 | near_rhymes = api.words(rel_nry=word, max=20) 34 | near_rhymes = [x['word'] for x in near_rhymes] 35 | all_rhymes = rhymes + near_rhymes 36 | rhyme_dict[word] = all_rhymes 37 | return rhyme_dict 38 | 39 | def get_neg_rhyme_dict(rhyme_dict): 40 | print('Getting negative rhyming dictionaries...') 41 | neg_rhyme_dict = {} 42 | for word, rhymes in tqdm(rhyme_dict.items()): 43 | no_rhyme = False 44 | while not no_rhyme: 45 | non_rhymes = random.choice(list(rhyme_dict.values())) 46 | if word not in non_rhymes: 47 | neg_rhyme_dict[word] = non_rhymes 48 | no_rhyme = True 49 | return neg_rhyme_dict 50 | 51 | def get_txt(rhyme_dict, output_path, neg=False): 52 | print('Getting text file...') 53 | if neg: 54 | does_rhyme = 0 55 | else: 56 | does_rhyme = 1 57 | if os.path.exists(output_path): 58 | append_write = 'a' # Append if already exists 59 | else: 60 | append_write = 'w' # Create file if not 61 | with open(output_path, append_write) as fp: 62 | for word, rhymes in tqdm(rhyme_dict.items()): 63 | for rhyme in rhymes: 64 | line = '{},{},{}\n'.format(word, rhyme, does_rhyme) 65 | fp.write(line) 66 | 67 | 68 | if __name__ == '__main__': 69 | parser = argparse.ArgumentParser(description='Assemble a rhyming corpus.') 70 | parser.add_argument('language', help='The language to build corpora for. Can be english.') 71 | parser.add_argument('top_num', type=int, help='The number of top words to build the corpus with (max 5000).') 72 | args = parser.parse_args() 73 | 74 | # Set up 75 | if args.language == 'english': 76 | api = datamuse.Datamuse() 77 | top_file = os.path.join('..', 'corpora', 'top_5000_en.csv') 78 | output_file = os.path.join('..', 'corpora', 'rhyme_corpus_{}_en.txt'.format(args.top_num)) 79 | else: 80 | raise ValueError('Invalid corpus name: {}. Can be one of: english.'.format(args.language)) 81 | if args.top_num > 5000: 82 | raise ValueError('Maximum number of top words exceeded: {}. Pick a number betwee 1 and 5000.'.format(args.top_num)) 83 | 84 | # Get rhyming dictionaries 85 | top_words = get_top_words(top_file, args.top_num) 86 | rhyme_dict = get_rhyme_dict(top_words, api=api) 87 | neg_rhyme_dict = get_neg_rhyme_dict(rhyme_dict) 88 | 89 | # Write to text file 90 | get_txt(rhyme_dict, output_file) 91 | get_txt(neg_rhyme_dict, output_file, neg=True) 92 | 93 | 94 | -------------------------------------------------------------------------------- /deep_rhyme_detection/network.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Neural network functions 3 | ''' 4 | import os 5 | import numpy as np 6 | import pickle 7 | import argparse 8 | 9 | from keras.utils import to_categorical, np_utils 10 | from keras.optimizers import Adam 11 | from keras.models import Sequential 12 | from keras.layers import Dense, LSTM, Bidirectional 13 | 14 | from keras.preprocessing import sequence 15 | from keras.preprocessing.sequence import pad_sequences 16 | 17 | from keras.callbacks import Callback, EarlyStopping, ModelCheckpoint, CSVLogger 18 | 19 | from sklearn.model_selection import train_test_split 20 | 21 | from tqdm import tqdm 22 | 23 | class Corpus: 24 | def __init__(self, corpus_file): 25 | self.corpus_file = corpus_file 26 | self.corpus_dir = os.path.dirname(corpus_file) 27 | self.language = corpus_file.split('.')[0][:-2] # 'en', 'fr', ... 28 | self.read_corpus() 29 | 30 | def read_corpus(self): 31 | with open(self.corpus_file, 'r') as fp: 32 | corpus_lines = fp.readlines()#[:5000] # Debug 33 | corpus_lines = [x.split(',') for x in corpus_lines] 34 | self.words1 = [x[0].lower() for x in corpus_lines] 35 | self.words2 = [x[1].lower() for x in corpus_lines] 36 | self.labels = [x[2].lower() for x in corpus_lines] 37 | 38 | self.words = list(set(self.words1 + self.words2)) 39 | self.concat_words = ' '.join(self.words) 40 | self.word_length = max(len(max(self.words1, key=len)), len(max(self.words2, key=len))) 41 | self.seq_length = (self.word_length * 2) + 1 42 | self.get_char_mapping() 43 | 44 | def get_char_mapping(self): 45 | ''' 46 | Gets a dictionary mapping each character to a unique integer. 47 | ''' 48 | chars = sorted(list(set(self.concat_words))) 49 | char_to_int = dict((c, i) for i, c in enumerate(chars)) 50 | char_to_int['&'] = len(chars) # Let & be the 'between-word' tag 51 | self.char_to_int = char_to_int 52 | self.num_chars = len(char_to_int) 53 | 54 | def get_char_to_int(self, word): 55 | int_list = [] 56 | for char in word: 57 | int_list.append(self.char_to_int[char]) 58 | return int_list 59 | 60 | def pad_to_length(self, word): 61 | while len(word) < self.word_length: 62 | word += ' ' 63 | return word 64 | 65 | def prepare_pairs(self, verbose=False): 66 | ''' 67 | Creates pairs of integer-coded padded words by character, 68 | separated by a '&' integer code. 69 | ''' 70 | dataX = [] 71 | if verbose: 72 | loop = tqdm(enumerate(self.words1)) 73 | else: 74 | loop = enumerate(self.words1) 75 | for i, word in loop: 76 | word2 = self.words2[i] 77 | seq = self.pad_to_length(word) + '&' + self.pad_to_length(word2) 78 | int_seq = self.get_char_to_int(seq) 79 | dataX.append(int_seq) 80 | return dataX 81 | 82 | def get_onehot(self, dataX): 83 | ''' 84 | Converts the integer-coded representation into a one-hot encoding. 85 | ''' 86 | dataX = np.array([to_categorical(pad_sequences((data,), self.seq_length), self.num_chars + 1) for data in dataX]) 87 | dataX = np.array([data[0] for data in dataX]) 88 | return dataX 89 | 90 | def prepare_data(self, preprocessed=False): 91 | if not preprocessed: 92 | pairs = self.prepare_pairs() 93 | onehot = self.get_onehot(pairs) 94 | else: 95 | with open(os.path.join(self.corpus_dir, 'rhyme_onehot_{}.pickle'.format(self.language))) as jar: 96 | onehot = pickle.load(jar) 97 | self.dataX = onehot 98 | 99 | 100 | class Network: 101 | def __init__(self, corpus, num_lstm_units=8, num_epochs=10, learning_rate=0.01, batch_size=2048): 102 | self.corpus = corpus 103 | self.model_dir = os.path.join(corpus.corpus_dir, '..', 'models') 104 | 105 | self.num_lstm_units = num_lstm_units 106 | self.num_epochs = num_epochs 107 | self.learning_rate = learning_rate 108 | self.batch_size = batch_size 109 | 110 | self.train_test_split() 111 | 112 | def train_test_split(self): 113 | self.X_train, self.X_test, y_train, y_test = train_test_split(corpus.dataX, corpus.labels, test_size=0.2, random_state=42) 114 | self.y_train = to_categorical(y_train) 115 | self.y_test = to_categorical(y_test) 116 | self.input_shape = self.X_train[0].shape 117 | 118 | def build_network_en(self): 119 | model = Sequential() 120 | model.add(Bidirectional(LSTM(self.num_lstm_units, return_sequences=True), input_shape=self.input_shape)) 121 | model.add(Bidirectional(LSTM(self.num_lstm_units, return_sequences=True))) 122 | model.add(Bidirectional(LSTM(self.num_lstm_units, return_sequences=True))) 123 | model.add(Bidirectional(LSTM(self.num_lstm_units, return_sequences=True))) 124 | model.add(Bidirectional(LSTM(self.num_lstm_units, return_sequences=True))) 125 | model.add(Bidirectional(LSTM(self.num_lstm_units))) 126 | model.add(Dense(2, activation='softmax')) 127 | adam = Adam(lr=self.learning_rate) 128 | model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy']) 129 | print(model.summary()) 130 | self.model = model 131 | 132 | def train_network_en(self): 133 | callbacks = [EarlyStopping(monitor='val_loss', patience=2), 134 | ModelCheckpoint(filepath=os.path.join(self.model_dir, 'rhyme_en.h5'), 135 | monitor='val_loss', save_best_only=True)] 136 | self.model.fit(self.X_train, self.y_train, 137 | batch_size=self.batch_size, epochs=self.num_epochs, 138 | validation_split=0.10, shuffle=True, callbacks=callbacks) 139 | 140 | 141 | 142 | if __name__ == '__main__': 143 | parser = argparse.ArgumentParser(description='Train a neural network on a rhyming corpus.') 144 | parser.add_argument('language', help='The language of the rhyming corpus. Can be english.') 145 | parser.add_argument('--preprocessed', help='Loads a prestored pickle of the data.', action="store_true") 146 | args = parser.parse_args() 147 | 148 | # Set up the corpus object 149 | corpus_file = os.path.join('..', 'corpora', 'rhyme_corpus_1000_en.txt') 150 | corpus = Corpus(corpus_file) 151 | corpus.prepare_data(preprocessed=args.preprocessed) 152 | 153 | # Set network parameters - change these if retraining needed 154 | num_lstm_units = 16 155 | num_epochs = 10 156 | learning_rate = 0.001 157 | batch_size = 4096 158 | 159 | # Build and train network 160 | network = Network(corpus, 161 | num_lstm_units=num_lstm_units, num_epochs=num_epochs, 162 | learning_rate=learning_rate, batch_size=batch_size) 163 | network.build_network_en() 164 | network.train_network_en() 165 | 166 | 167 | 168 | -------------------------------------------------------------------------------- /deep_rhyme_detection/rhyme.py: -------------------------------------------------------------------------------- 1 | '''Functions for getting rhyme scheme''' 2 | 3 | import argparse 4 | import os 5 | import pickle 6 | import numpy as np 7 | import string 8 | import itertools 9 | 10 | from keras.models import load_model 11 | from keras.utils import to_categorical 12 | from keras.preprocessing.sequence import pad_sequences 13 | 14 | from network import Corpus, Network 15 | from tqdm import tqdm 16 | 17 | 18 | class Stanza: 19 | def __init__(self, text_lines, corpus, model): 20 | self.text_lines = text_lines 21 | self.corpus = corpus 22 | self.model = model 23 | self.load_text() 24 | 25 | def load_text(self): 26 | self.orig_words = [word for line in self.text_lines for word in line.split(' ')] 27 | self.clean_words = [word.strip().strip(string.punctuation).lower() for word in self.orig_words] 28 | 29 | def get_char_to_int(self, word): 30 | int_list = [corpus.char_to_int[char] for char in word] 31 | return int_list 32 | 33 | def pad_to_length(self, word): 34 | while len(word) < corpus.word_length: 35 | word += ' ' 36 | return word 37 | 38 | def prepare_pair(self, word1, word2): 39 | seq = self.pad_to_length(word1) + '&' + self.pad_to_length(word2) 40 | int_seq = self.get_char_to_int(seq) 41 | return int_seq 42 | 43 | def get_onehot(self, seq): 44 | seq = [seq] 45 | onehot = np.array([to_categorical(pad_sequences((data,), corpus.seq_length), corpus.num_chars+1) for data in seq]) 46 | onehot = np.array([data[0] for data in onehot]) 47 | return onehot 48 | 49 | def get_line_endings(self): 50 | line_endings = [line.split(' ')[-1] for line in self.text_lines] 51 | return line_endings 52 | 53 | def does_rhyme(self, word1, word2): 54 | ''' 55 | Predicts whether a pair of words rhymes. 56 | ''' 57 | pair = self.prepare_pair(word1, word2) 58 | onehot = self.get_onehot(pair) 59 | probs = self.model.predict(onehot)[0] 60 | if probs[1] >= probs[0]: 61 | pair_rhymes = True 62 | else: 63 | pair_rhymes = False 64 | 65 | return pair_rhymes 66 | 67 | def get_rhyme_scheme(self): 68 | ''' 69 | Heuristically determines rhyme scheme this way: 70 | - For all words, look back at every word preceding it. 71 | - If the word and a preceding word rhyme, code them together. 72 | ''' 73 | rhyme_scheme = {} 74 | for i, word in enumerate(tqdm(self.clean_words)): 75 | # If we already found the rhyme code for the word, skip it 76 | if word in rhyme_scheme.keys(): 77 | continue 78 | 79 | rhyme_scheme[word] = 0 80 | found_rhyme = False 81 | for j in range(i): 82 | preceding_word = self.clean_words[j] 83 | if self.does_rhyme(preceding_word, word): 84 | found_rhyme = True 85 | if rhyme_scheme[preceding_word] == 0: 86 | num = 1 87 | while num in rhyme_scheme.values(): 88 | num += 1 89 | rhyme_scheme[preceding_word] = num 90 | rhyme_scheme[word] = rhyme_scheme[preceding_word] 91 | #else: 92 | #if not found_rhyme: 93 | # print('No rhyme found for {}'.format(word)) 94 | 95 | self.rhyme_scheme = rhyme_scheme 96 | #print(self.rhyme_scheme) 97 | 98 | def get_rhyming_blocks(self): 99 | ''' 100 | Consolidate the rhyme-coded text into blocks, where each word 101 | in the block shares the rhyme code. Account for line breaks. 102 | ''' 103 | self.get_rhyme_scheme() 104 | coded_words = [(word, self.rhyme_scheme[word]) for word in self.clean_words] 105 | line_endings = self.get_line_endings() 106 | rhyme_blocks = [] 107 | 108 | def in_ending(word, line_endings): 109 | word_in_ending = False 110 | for ending in line_endings: 111 | if word in ending and '\n' in ending: 112 | word_in_ending = True 113 | break 114 | return word_in_ending 115 | 116 | def combine_code(group_list, code): 117 | combined_words = ' ' .join([item[0] for item in group_list]) 118 | combined_code = (combined_words, code) 119 | return combined_code 120 | 121 | for key, group in itertools.groupby(coded_words, lambda x: x[1]): # Rhyme code 122 | group_list = list(group) 123 | if len(group_list) == 1: 124 | rhyme_blocks.append(group_list[0]) 125 | elif len(group_list) > 1: 126 | ending_found = False 127 | for i, word in enumerate(group_list): 128 | # This may break if words in endings are repeated throughout - come back 129 | if in_ending(word[0], line_endings): 130 | before_ending = group_list[:i+1] 131 | after_ending = group_list[i+1:] 132 | if before_ending: 133 | rhyme_blocks.append(combine_code(before_ending, key)) 134 | if after_ending: 135 | rhyme_blocks.append(combine_code(after_ending, key)) 136 | ending_found = True 137 | break 138 | if not ending_found: 139 | combined_words = ' ' .join([item[0] for item in group_list]) 140 | combined_code = (combined_words, key) 141 | rhyme_blocks.append(combined_code) 142 | 143 | #print(rhyme_blocks) 144 | return rhyme_blocks 145 | 146 | def scheme_to_orig(self, rhyme_blocks): 147 | ''' 148 | Back-maps the rhyme scheme to the original text, including formatting 149 | with punctuation and line breaks. 150 | ''' 151 | rhyme_block_words = [item[0] for item in rhyme_blocks] 152 | formatted_blocks = [] 153 | counter = 0 154 | for i, block in enumerate(rhyme_block_words): 155 | length = len(block.split(' ')) 156 | formatted_list = self.orig_words[counter:counter+length] 157 | formatted_string = ' '.join(formatted_list) 158 | formatted_block = (formatted_string, rhyme_blocks[i][1]) 159 | formatted_blocks.append(formatted_block) 160 | counter += length 161 | return formatted_blocks 162 | 163 | def scheme_to_text(self, rhyme_blocks=None, stanza_num=None): 164 | if stanza_num: 165 | print('Calculating rhyme scheme for stanza {}...'.format(stanza_num+1)) 166 | 167 | 168 | delimiters = [['{','}'], ['[',']'], ['(',')'], ['<','>'], ['/','/'], 169 | ['`','`'], ['!','!'], ['#','#'], ['$','$'], ['%','%'], 170 | ['^','^'], ['&','&'], ['*','*'], ['~','~'], ['?','?']] 171 | if not rhyme_blocks: 172 | rhyme_blocks = self.get_rhyming_blocks() 173 | rhyme_block_words = [item[0] for item in rhyme_blocks] 174 | 175 | # Back-map to formatted (with punctuation and line breaks) original text 176 | formatted_blocks = self.scheme_to_orig(rhyme_blocks) 177 | 178 | delimited_text = [] 179 | for block in formatted_blocks: 180 | delimiter = delimiters[block[1]] 181 | delimited = delimiter[0] + block[0] + delimiter[1] 182 | # Push newlines after the delimiter 183 | if '\n' in delimited: 184 | delimited = delimited.replace('\n', '') + '\n' 185 | delimited_text.append(delimited) 186 | delimited_string = ' '.join(delimited_text) 187 | #print(delimited_string) 188 | return delimited_string 189 | 190 | class Poem: 191 | ''' 192 | Poems have multiple stanzas. This class gives functions for handling text with many 193 | independent stanzas or verses (rhyme schemes start over at the stanza). 194 | ''' 195 | def __init__(self, full_text_lines, corpus, model): 196 | self.full_text_lines = full_text_lines 197 | self.corpus = corpus 198 | self.model = model 199 | self.get_stanzas() 200 | self.rhyme_blocks = None # To be filled by below functions as needed 201 | 202 | def get_stanzas(self): 203 | ''' 204 | Split the original text on empty lines, into stanzas/verses. 205 | ''' 206 | self.stanzas_list = [list(group) for key, group in itertools.groupby(self.full_text_lines, lambda x: x == '\n') if not key] 207 | self.stanzas = [Stanza(stanza, self.corpus, self.model) for stanza in self.stanzas_list] 208 | 209 | def get_rhyme_blocks(self): 210 | self.rhyme_blocks = [stanza.get_rhyming_blocks() for stanza in self.stanzas] 211 | 212 | def get_rhyme_scheme_text(self): 213 | #if not self.rhyme_blocks: 214 | # self.get_rhyme_blocks() 215 | self.rhyme_scheme = [stanza.scheme_to_text(stanza_num=i) for i, stanza in enumerate(self.stanzas)] 216 | 217 | if __name__ == '__main__': 218 | parser = argparse.ArgumentParser() 219 | parser.add_argument('language', help='Can be english.') 220 | parser.add_argument('input_file', help='Path to the input file.') 221 | parser.add_argument('output_path', help='Path to the output directory.') 222 | args = parser.parse_args() 223 | 224 | if args.language == 'english': 225 | model_file = os.path.join('..', 'models', 'rhyme_en.h5') 226 | corpus_file = os.path.join('..', 'corpora', 'rhyme_corpus_en.txt') 227 | 228 | # Import the model 229 | print('Loading model...') 230 | model = load_model(model_file) 231 | 232 | # Set up the corpus object 233 | print('Setting up corpus...') 234 | corpus = Corpus(corpus_file) 235 | corpus.get_char_mapping() 236 | 237 | # Load in the test file 238 | with open(args.input_file) as fp: 239 | text_lines = fp.readlines() 240 | 241 | # Get the rhyme scheme 242 | print('Getting rhyme scheme.') 243 | poem = Poem(text_lines, corpus, model) 244 | poem.get_rhyme_scheme_text() 245 | for stanza in poem.rhyme_scheme: 246 | print(stanza) 247 | #rhyme_scheme = Stanza(text_lines, corpus, model) 248 | #rhyme_scheme.scheme_to_text() 249 | 250 | # Save to output file 251 | input_filename = os.path.basename(args.input_file) 252 | input_split = input_filename.split('.') 253 | output_filename = input_split[0] + '_r.' + input_split[1] 254 | output_file = os.path.join(args.output_path, output_filename) 255 | with open(output_file, 'w') as fp: 256 | for stanza in poem.rhyme_scheme: 257 | fp.write(stanza) 258 | fp.write('\n') -------------------------------------------------------------------------------- /models/rhyme_en.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/a-coles/deep-rhyme-detection/057938c468c76bd7770d9b026a015b6ac51f1868/models/rhyme_en.h5 -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | absl-py==0.6.1 2 | astor==0.7.1 3 | bleach==3.1.1 4 | certifi==2018.11.29 5 | chardet==3.0.4 6 | cycler==0.10.0 7 | gast==0.2.0 8 | grpcio==1.16.1 9 | h5py==2.8.0 10 | html5lib==0.9999999 11 | idna==2.8 12 | Keras==2.2.4 13 | Keras-Applications==1.0.6 14 | Keras-Preprocessing==1.0.5 15 | kiwisolver==1.0.1 16 | Markdown==3.0.1 17 | matplotlib==3.0.2 18 | numpy==1.15.4 19 | protobuf==3.6.1 20 | pyparsing==2.3.0 21 | python-datamuse==1.2.1 22 | python-dateutil==2.7.5 23 | PyYAML==5.1 24 | requests==2.21.0 25 | scikit-learn==0.20.2 26 | scipy==1.1.0 27 | singledispatch==3.4.0.3 28 | six==1.11.0 29 | sklearn==0.0 30 | tensorboard==1.6.0 31 | tensorflow==1.6.0 32 | termcolor==1.1.0 33 | tqdm==4.28.1 34 | urllib3==1.24.2 35 | Werkzeug==0.14.1 36 | wincertstore==0.2 37 | -------------------------------------------------------------------------------- /test_files/deck_thyself.txt: -------------------------------------------------------------------------------- 1 | Deck thyself, my soul, with gladness, 2 | leave the gloomy haunts of sadness; 3 | come into the daylight's splendour, 4 | there with joy thy praises render 5 | unto him whose grace unbounded 6 | hath this wondrous banquet founded: 7 | high o'er all the heavens he reigneth, 8 | yet to dwell with thee he deigneth. -------------------------------------------------------------------------------- /test_files/deck_thyself_r.txt: -------------------------------------------------------------------------------- 1 | [Deck thyself,] (my) [soul,] (with gladness,) 2 | (leave the gloomy haunts) [of] (sadness;) 3 | (come into the) [daylight's splendour,] 4 | (there with joy thy praises render) 5 | (unto him whose grace unbounded hath this wondrous banquet founded: high) 6 | [o'er all] (the heavens he reigneth,) 7 | (yet to) [dwell] (with thee he deigneth.) 8 | -------------------------------------------------------------------------------- /test_files/love_unknown.txt: -------------------------------------------------------------------------------- 1 | My song is love unknown, 2 | my Savior's love to me; 3 | love to the loveless shown, 4 | that they might lovely be. 5 | O who am I, that for my sake, 6 | my Lord should take frail flesh and die? 7 | 8 | He came from His blest throne 9 | salvation to bestow; 10 | but man made strange, and none 11 | the longed-for Christ would know. 12 | But oh, my Friend, my Friend indeed, 13 | who at my need his life did spend! 14 | 15 | Sometimes they strew His way 16 | and His sweet praises sing; 17 | resounding all the day 18 | hosannas to their King. 19 | Then "Crucify!" is all their breath, 20 | and for His death they thirst and cry. 21 | 22 | Why, what hath my Lord done? 23 | What makes this rage and spite? 24 | He made the lame to run, 25 | He gave the blind their sight. 26 | Sweet injuries! Yet they at these 27 | themselves displease, and 'gainst him rise. -------------------------------------------------------------------------------- /test_files/love_unknown_r.txt: -------------------------------------------------------------------------------- 1 | /My/ {song} (is) [love] 2 | /my/ #Savior's# [love to me;] 3 | [love to the loveless] 4 | !that! `might` [be.] 5 | [O who] `am I,` !that! {for} /my/ !sake,! 6 | /my/ {Lord} [should] !take! #frail flesh# /and die?/ 7 | 8 | (He) [came from His] /throne/ 9 | {salvation} (to) (bestow;) 10 | /man/ [made] /strange, and none/ 11 | {longed-for} 12 | /my Friend, my Friend/ 13 | /my/ [his life did] /spend!/ 14 | 15 | [Sometimes] (they) /strew/ [His] (way) 16 | (and) [His sweet praises] 17 | {all} `the` (day) 18 | [hosannas] `to` !their! 19 | ("Crucify!") `is` {all} !their! `breath,` 20 | (and) !for! [His] `death` (they) `thirst` (and cry.) 21 | 22 | [Why,] (what) [my] {Lord} `done?` 23 | (What makes) /this/ [and] /spite?/ 24 | /made/ /lame/ `run,` 25 | /gave/ /blind/ {their} /sight./ 26 | /Sweet injuries! Yet they at these/ 27 | /displease,/ [and] /'gainst/ [him rise.] 28 | --------------------------------------------------------------------------------