├── .gitignore ├── build.sh ├── deps.edn ├── enable.txt ├── group1-shard1of1.bin ├── index.html ├── model.edn ├── model.json ├── oldmodel.edn ├── out └── main.js ├── src └── blabrecs │ ├── app.cljs │ ├── markov.cljc │ ├── markov_train.clj │ └── neural.cljs ├── train_cnn.py └── words.txt /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | -------------------------------------------------------------------------------- /build.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | clj --main cljs.main --optimizations simple --compile blabrecs.app 3 | -------------------------------------------------------------------------------- /deps.edn: -------------------------------------------------------------------------------- 1 | {:deps {org.clojure/clojurescript {:mvn/version "1.10.758"}}} 2 | -------------------------------------------------------------------------------- /group1-shard1of1.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mkremins/blabrecs/c011ecc442cba3d5991fba35c2e58894256186bd/group1-shard1of1.bin -------------------------------------------------------------------------------- /index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Blabrecs 7 | 124 | 125 | 126 |
127 | As part of the NeurIPS 2023 Creative AI exhibition, we're anonymously logging the words and definitions that BLABRECS players create. Our favorites will be showcased. If you want, you can opt out of logging or dismiss this notice. 128 |
129 |
130 |

BLABRECS

131 |

it's like scrabble but worse

132 |
133 |
134 | 135 |

can't play that, it's a real word!

136 |
137 | 138 |
139 |
140 | 141 | 142 |
143 | 144 | 145 | 146 | 147 | 148 |
WordMeaning
149 | 150 |
151 |

What's all this, then?

152 |

BLABRECS is a rules modification for the wordgame SCRABBLE that swaps out the dictionary of real-if-obscure English words for a capricious artificial intelligence. In BLABRECS, real English words aren't allowed! Instead, you have to play nonsense words that sound like English to the AI. These nonsense words are called – you guessed it – BLABRECS.

153 |

How do I play?

154 |

Get together your regular SCRABBLE supplies and pull up this page in a web browser. Then play SCRABBLE as normal, but before you play a word, use the "test a word..." box at the top of the page to check whether the AI will let you play it. Remember, you can only play words that the AI approves!

155 |

When you find a legal word to play, hit the "Play It" button to add this word to the lexicon. As you play, you can write in definitions for all the words you've invented in the "Meaning" textbox next to each word.

156 |

I found a real word that the AI thinks is playable! What should I do?

157 |

This happens often with proper nouns (which SCRABBLE generally disallows) and with inflected forms of base words that are disallowed. I think you should play by the spirit of BLABRECS and not the letter, but what this means is ultimately up to you. Is the game primarily about exploring the vast "shadow English" implied by the statistical distribution of letter sequences, or is it about the inherent absurdity of an external authority presuming to dictate your language to you? Either interpretation seems valid to me.

158 |

How does the AI work?

159 |

The current version of BLABRECS has two AI judges that you can switch between. Source code for both is available if you want to learn more. You can also check out our NeurIPS 2023 paper for a detailed writeup.

160 |

The original judge uses a Markov chain trained on the ENABLE word list used in a number of wordgames. It looks at the statistical patterns of letter sequences in English words and uses this information to determine how likely a sequence of letters is to be a real English word. Then it rejects both real dictionary words and fake words that it deems insufficiently plausible.

161 |

The second judge, contributed by Isaac Karth, uses a convolutional neural network trained on a substantially larger word list. The features that this judge uses to evaluate words are a bit more opaque, but still ultimately statistical in nature.

162 |

I might update BLABRECS to provide a wider range of AI "opponents" using a variety of different technologies in the future. Stay tuned!

163 |

Why is it called BLABRECS?

164 |

I generated every possible permutation of the word SCRABBLE and asked an early version of the AI to sort them from least to most statistically likely. BRABLECS won out by a significant margin, beating not only the real word SCRABBLE but also my hand-designed previous title BESCRALB. Then I switched the L and the R because BLABRECS sounds better and I'm not about to let a computer tell me what to do.

165 |

Who are you?

166 |

I'm Max Kreminski, an artifical intelligence researcher and game designer. I make a lot of weird stuff with AI; if you want to keep tabs on my work, you can follow me on Twitter. If you enjoyed BLABRECS, you can also leave me a tip via the "Download Now" button on itch.io.

167 |
168 |
169 | 170 | 171 | 172 | 202 | 203 | -------------------------------------------------------------------------------- /model.json: -------------------------------------------------------------------------------- 1 | {"format": "layers-model", "generatedBy": "keras v2.4.0", "convertedBy": "TensorFlow.js Converter v2.8.0", "modelTopology": {"keras_version": "2.4.0", "backend": "tensorflow", "model_config": {"class_name": "Sequential", "config": {"name": "sequential", "layers": [{"class_name": "InputLayer", "config": {"batch_input_shape": [null, 24], "dtype": "float32", "sparse": false, "ragged": false, "name": "embedding_input"}}, {"class_name": "Embedding", "config": {"name": "embedding", "trainable": true, "batch_input_shape": [null, 24], "dtype": "float32", "input_dim": 41, "output_dim": 200, "embeddings_initializer": {"class_name": "RandomUniform", "config": {"minval": -0.05, "maxval": 0.05, "seed": null}}, "embeddings_regularizer": null, "activity_regularizer": null, "embeddings_constraint": null, "mask_zero": false, "input_length": 24}}, {"class_name": "Dropout", "config": {"name": "dropout", "trainable": true, "dtype": "float32", "rate": 0.3, "noise_shape": null, "seed": null}}, {"class_name": "Conv1D", "config": {"name": "conv1d", "trainable": true, "dtype": "float32", "filters": 64, "kernel_size": [3], "strides": [1], "padding": "same", "data_format": "channels_last", "dilation_rate": [1], "groups": 1, "activation": "relu", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}}, "bias_initializer": {"class_name": "RandomUniform", "config": {"minval": -0.05, "maxval": 0.05, "seed": null}}, "kernel_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "bias_constraint": null}}, {"class_name": "Conv1D", "config": {"name": "conv1d_1", "trainable": true, "dtype": "float32", "filters": 64, "kernel_size": [3], "strides": [1], "padding": "same", "data_format": "channels_last", "dilation_rate": [1], "groups": 1, "activation": "relu", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}}, "bias_initializer": {"class_name": "RandomUniform", "config": {"minval": -0.05, "maxval": 0.05, "seed": null}}, "kernel_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "bias_constraint": null}}, {"class_name": "MaxPooling1D", "config": {"name": "max_pooling1d", "trainable": true, "dtype": "float32", "strides": [3], "pool_size": [3], "padding": "valid", "data_format": "channels_last"}}, {"class_name": "Dropout", "config": {"name": "dropout_1", "trainable": true, "dtype": "float32", "rate": 0.3, "noise_shape": null, "seed": null}}, {"class_name": "Conv1D", "config": {"name": "conv1d_2", "trainable": true, "dtype": "float32", "filters": 64, "kernel_size": [3], "strides": [1], "padding": "same", "data_format": "channels_last", "dilation_rate": [1], "groups": 1, "activation": "relu", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}}, "bias_initializer": {"class_name": "RandomUniform", "config": {"minval": -0.05, "maxval": 0.05, "seed": null}}, "kernel_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "bias_constraint": null}}, {"class_name": "Conv1D", "config": {"name": "conv1d_3", "trainable": true, "dtype": "float32", "filters": 64, "kernel_size": [3], "strides": [1], "padding": "same", "data_format": "channels_last", "dilation_rate": [1], "groups": 1, "activation": "relu", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}}, "bias_initializer": {"class_name": "RandomUniform", "config": {"minval": -0.05, "maxval": 0.05, "seed": null}}, "kernel_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "bias_constraint": null}}, {"class_name": "MaxPooling1D", "config": {"name": "max_pooling1d_1", "trainable": true, "dtype": "float32", "strides": [3], "pool_size": [3], "padding": "valid", "data_format": "channels_last"}}, {"class_name": "Conv1D", "config": {"name": "conv1d_4", "trainable": true, "dtype": "float32", "filters": 128, "kernel_size": [3], "strides": [1], "padding": "same", "data_format": "channels_last", "dilation_rate": [1], "groups": 1, "activation": "relu", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}}, "bias_initializer": {"class_name": "RandomUniform", "config": {"minval": -0.05, "maxval": 0.05, "seed": null}}, "kernel_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "bias_constraint": null}}, {"class_name": "Conv1D", "config": {"name": "conv1d_5", "trainable": true, "dtype": "float32", "filters": 128, "kernel_size": [3], "strides": [1], "padding": "same", "data_format": "channels_last", "dilation_rate": [1], "groups": 1, "activation": "relu", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}}, "bias_initializer": {"class_name": "RandomUniform", "config": {"minval": -0.05, "maxval": 0.05, "seed": null}}, "kernel_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "bias_constraint": null}}, {"class_name": "GlobalAveragePooling1D", "config": {"name": "global_average_pooling1d", "trainable": true, "dtype": "float32", "data_format": "channels_last"}}, {"class_name": "Dropout", "config": {"name": "dropout_2", "trainable": true, "dtype": "float32", "rate": 0.3, "noise_shape": null, "seed": null}}, {"class_name": "Dense", "config": {"name": "dense", "trainable": true, "dtype": "float32", "units": 1, "activation": "sigmoid", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}}, "bias_initializer": {"class_name": "Zeros", "config": {}}, "kernel_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "bias_constraint": null}}]}}, "training_config": {"loss": "binary_crossentropy", "metrics": ["acc"], "weighted_metrics": null, "loss_weights": null, "optimizer_config": {"class_name": "Adam", "config": {"name": "Adam", "learning_rate": 0.0010000000474974513, "decay": 0.0, "beta_1": 0.8999999761581421, "beta_2": 0.9990000128746033, "epsilon": 1e-07, "amsgrad": false}}}}, "weightsManifest": [{"paths": ["blabrecs/group1-shard1of1.bin"], "weights": [{"name": "conv1d/kernel", "shape": [3, 200, 64], "dtype": "float32"}, {"name": "conv1d/bias", "shape": [64], "dtype": "float32"}, {"name": "conv1d_1/kernel", "shape": [3, 64, 64], "dtype": "float32"}, {"name": "conv1d_1/bias", "shape": [64], "dtype": "float32"}, {"name": "conv1d_2/kernel", "shape": [3, 64, 64], "dtype": "float32"}, {"name": "conv1d_2/bias", "shape": [64], "dtype": "float32"}, {"name": "conv1d_3/kernel", "shape": [3, 64, 64], "dtype": "float32"}, {"name": "conv1d_3/bias", "shape": [64], "dtype": "float32"}, {"name": "conv1d_4/kernel", "shape": [3, 64, 128], "dtype": "float32"}, {"name": "conv1d_4/bias", "shape": [128], "dtype": "float32"}, {"name": "conv1d_5/kernel", "shape": [3, 128, 128], "dtype": "float32"}, {"name": "conv1d_5/bias", "shape": [128], "dtype": "float32"}, {"name": "dense/kernel", "shape": [128, 1], "dtype": "float32"}, {"name": "dense/bias", "shape": [1], "dtype": "float32"}, {"name": "embedding/embeddings", "shape": [41, 200], "dtype": "float32"}]}]} 2 | -------------------------------------------------------------------------------- /src/blabrecs/app.cljs: -------------------------------------------------------------------------------- 1 | (ns blabrecs.app 2 | (:require [blabrecs.markov :as markov] 3 | [blabrecs.neural :as neural] 4 | [clojure.edn :as edn] 5 | [clojure.string :as str])) 6 | 7 | ;;; util 8 | 9 | (defn load-file! [path cb] 10 | (let [req (js/XMLHttpRequest.)] 11 | (.addEventListener req "load" #(this-as this (cb this))) 12 | (.open req "GET" path) 13 | (.send req))) 14 | 15 | ;;; app-specific 16 | 17 | (def app-state 18 | (atom {:mode :markov})) 19 | 20 | (defn sufficiently-probable? [word] 21 | (let [state @app-state] 22 | (cond 23 | (= (:mode state) :markov) 24 | (> (markov/probability (:model state) word) 25 | (get (:baselines state) (count word))) 26 | (= (:mode state) :neural) 27 | (> (neural/probability (:cnn state) word) 0.82) 28 | :else 29 | false))) 30 | 31 | (defn test-word [word] 32 | (let [word (str/trim (str/lower-case word)) 33 | state @app-state] 34 | (cond 35 | (or (and (= (:mode state) :markov) (not (:model state))) 36 | (and (= (:mode state) :neural) (not (:cnn state))) 37 | (not (:words state))) 38 | {:status :empty :msg "hang on a sec, still loading…"} 39 | (= word "") 40 | {:status :empty :msg "type in a word!"} 41 | (re-find #"[^a-z]" word) 42 | {:status :err :msg "hey! letters only!"} 43 | (< (count word) 3) 44 | {:status :err :msg "that's too short to be a word!"} 45 | (> (count word) 15) 46 | {:status :err :msg "that's too long, it won't fit on the board!"} 47 | (contains? (:words state) word) 48 | {:status :err :msg "can't play that, it's in the dictionary!"} 49 | (not (sufficiently-probable? word)) 50 | {:status :err :msg "no way that's a word!"} 51 | (some #(str/includes? word %) (:badwords state)) 52 | {:status :err :msg "no way that's a word!"} 53 | :else 54 | {:status :ok :msg "looks good to me!"}))) 55 | 56 | (defn test-word! [] 57 | (let [word-tester (js/document.getElementById "wordtester") 58 | status-line (js/document.getElementById "wordinfo") 59 | submit-button (js/document.getElementById "playit") 60 | result (test-word (.-value word-tester))] 61 | (set! (.-className status-line) (name (:status result))) 62 | (set! (.-innerText status-line) (:msg result)) 63 | (set! (.-disabled submit-button) (not (= (:status result) :ok))))) 64 | 65 | (defn try-submit-word! [] 66 | (let [word-tester (js/document.getElementById "wordtester") 67 | lexicon-table (js/document.getElementById "lexicon") 68 | word (str/trim (str/lower-case (.-value word-tester))) 69 | result (test-word word)] 70 | (when (= (:status result) :ok) 71 | (set! (.-className lexicon-table) "") ; remove disabled state 72 | (let [row (.insertRow lexicon-table 1) 73 | word-cell (.insertCell row 0) 74 | def-cell (.insertCell row 1) 75 | textarea (js/document.createElement "textarea")] 76 | (set! (.-innerText word-cell) word) 77 | (.appendChild def-cell textarea)) 78 | (set! (.-value word-tester) "") 79 | (test-word!)))) 80 | 81 | ;;; init 82 | 83 | (load-file! "model.edn" 84 | (fn [res] 85 | (js/console.log "loaded markov model!") 86 | (let [model (edn/read-string (.-responseText res)) 87 | baselines (markov/gen-baseline-probs model)] 88 | (swap! app-state assoc :model model :baselines baselines) 89 | (test-word!)))) 90 | 91 | (load-file! "enable.txt" 92 | (fn [res] 93 | (js/console.log "loaded words!") 94 | (swap! app-state assoc :words (set (str/split-lines (.-responseText res)))) 95 | (test-word!))) 96 | 97 | (load-file! "https://raw.githubusercontent.com/dariusk/wordfilter/master/lib/badwords.json" 98 | (fn [res] 99 | (js/console.log "loaded badwords!") 100 | (swap! app-state assoc :badwords (js->clj (js/JSON.parse (.-responseText res)))))) 101 | 102 | (let [cnn-promise (js/tf.loadLayersModel "model.json")] 103 | (.then cnn-promise 104 | #(do (js/console.log "loaded tf cnn model!") 105 | (swap! app-state assoc :cnn %)) 106 | #(js/console.log "failed to load tf cnn model!"))) 107 | 108 | (.addEventListener (js/document.getElementById "wordtester") "input" test-word!) 109 | 110 | (.addEventListener (js/document.getElementById "wordtester") "keypress" 111 | #(when (= (.-key %) "Enter") (try-submit-word!))) 112 | 113 | (.addEventListener (js/document.getElementById "playit") "click" try-submit-word!) 114 | 115 | (let [usemarkov (js/document.getElementById "usemarkov") 116 | useneural (js/document.getElementById "useneural")] 117 | (.addEventListener usemarkov "click" 118 | #(do (js/console.log "using markov classifier!") 119 | (set! (.-disabled usemarkov) true) 120 | (set! (.-disabled useneural) false) 121 | (swap! app-state assoc :mode :markov) 122 | (test-word!))) 123 | (.addEventListener useneural "click" 124 | #(do (js/console.log "using neural classifier!") 125 | (set! (.-disabled usemarkov) false) 126 | (set! (.-disabled useneural) true) 127 | (swap! app-state assoc :mode :neural) 128 | (test-word!)))) 129 | 130 | (test-word!) 131 | -------------------------------------------------------------------------------- /src/blabrecs/markov.cljc: -------------------------------------------------------------------------------- 1 | (ns blabrecs.markov 2 | (:require [clojure.string :as str])) 3 | 4 | ;;; common utility functions 5 | 6 | (def ngram-size 3) 7 | 8 | (defn normalize [word] 9 | (str "^" (str/lower-case word) "$")) 10 | 11 | (defn word->ngrams [ngram-size word] 12 | (->> (normalize word) 13 | (partition ngram-size 1) 14 | (map str/join))) 15 | 16 | (defn ngram->path 17 | "Convert an `ngram` to a path into the Markov model." 18 | [ngram] 19 | (let [prefix (subs ngram 0 (dec (count ngram))) 20 | next-char (subs ngram (dec (count ngram)))] 21 | [prefix next-char])) 22 | 23 | ;;; Markov model training 24 | 25 | (defn counts->probs 26 | "Convert a map where keys are items and vals are item counts 27 | to a map where keys are items and vals are item probabilities." 28 | [counts] 29 | (let [total (apply + (vals counts))] 30 | (reduce-kv (fn [probs char char-count] 31 | (assoc probs char (double (/ char-count total)))) 32 | {} counts))) 33 | 34 | (defn process-word 35 | "Add the given `word` to the given Markov `model`." 36 | [model word] 37 | (->> (word->ngrams ngram-size word) 38 | (reduce (fn [model ngram] 39 | (update-in model (ngram->path ngram) (fnil inc 0))) 40 | model))) 41 | 42 | (defn build-model 43 | "Given a seq of `words`, build a Markov model." 44 | [words] 45 | (->> words 46 | (reduce process-word {}) 47 | (reduce-kv (fn [model ngram counts] 48 | (assoc model ngram (counts->probs counts))) 49 | {}))) 50 | 51 | ;;; set probability baselines for words of different lengths 52 | 53 | (defn avg-transition-prob [model] 54 | (let [all-probs (mapcat vals (vals model))] 55 | (/ (apply + all-probs) (count all-probs)))) 56 | 57 | (defn gen-baseline-probs [model] 58 | (let [avg-prob (avg-transition-prob model)] 59 | (reduce (fn [baselines n] 60 | (assoc baselines n (apply * (repeat n avg-prob)))) 61 | {} (range 25)))) 62 | 63 | ;;; probability calculation via trained Markov model 64 | 65 | (defn probability* 66 | "Given a Markov `model` and a `word`, return a vector whose first item is 67 | the model's total probability for this word and whose second item is 68 | a seq of the model's individual subprobabilities for this word." 69 | [model word] 70 | (let [subprobs (->> (word->ngrams ngram-size word) 71 | (map #(get-in model (ngram->path %)))) 72 | prob (apply * subprobs)] 73 | [prob subprobs])) 74 | 75 | (defn probability 76 | "Given a Markov `model` and a `word`, return the model's total probability 77 | for this word." 78 | [model word] 79 | (first (probability* model word))) 80 | -------------------------------------------------------------------------------- /src/blabrecs/markov_train.clj: -------------------------------------------------------------------------------- 1 | (ns blabrecs.markov-train 2 | (:require [blabrecs.markov :as markov] 3 | [clojure.string :as str])) 4 | 5 | (-> (slurp "enable.txt") 6 | (str/split-lines) 7 | (blabrecs/build-model) 8 | (pr-str) 9 | (#(spit "model.edn" %))) 10 | -------------------------------------------------------------------------------- /src/blabrecs/neural.cljs: -------------------------------------------------------------------------------- 1 | (ns blabrecs.neural 2 | "Functions for obtaining predictions from the TensorFlow.js convolutional 3 | neural network (CNN) classifier, provided by Isaac Karth.") 4 | 5 | (defn vectorize-word 6 | "Convert a string into the tokenized vector that the tensorflow 7 | model can understand." 8 | [word] 9 | (let [tokenizer 10 | {"@" 1,"e" 2,"i" 3,"s" 4,"a" 5,"n" 6,"o" 7,"r" 8,"t" 9, 11 | "l" 10, "c" 11,"u" 12,"p" 13,"d" 14,"m" 15,"h" 16,"g" 17, 12 | "y" 18,"b" 19, "f" 20,"v" 21,"k" 22,"w" 23,"z" 24,"x" 25, 13 | "q" 26,"j" 27, "'" 28, "/" 29, "\"" 30, "1" 31,"0" 32, 14 | "8" 33,"5" 34,"7" 35,"6" 36,"9" 37,"2" 38,"3" 39,"4" 40} 15 | result (partition 24 24 (repeat 0) (map #(get tokenizer % 0) word))] 16 | (first result))) 17 | 18 | (defn probability* 19 | "Given a tensorflow `model` and `words` (as a vector of strings), return the 20 | model's prediction as to whether the strings are English words or not." 21 | [model words] 22 | (let [word-vectors (map vectorize-word words) 23 | word-tensor (js/tf.tensor (clj->js word-vectors) 24 | (apply array [(count words) 24]))] 25 | (.dataSync (.predict model word-tensor {:verbose true})))) 26 | 27 | (defn probability 28 | "Given a tensorflow `model` and a string `word` return the probability of 29 | whether the string is an English word." 30 | [model word] 31 | (first (probability* model [word]))) 32 | -------------------------------------------------------------------------------- /train_cnn.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | CNN training script for BLABRECS 4 | 5 | Training this will require the following public-domain word lists: 6 | YAWL Word list: https://github.com/elasticdog/yawl/blob/master/yawl-0.3.2.03/word.list 7 | Letterpress wordlist: https://github.com/lorenbrichter/Words/blob/master/Words/en.txt 8 | Moby Word list: https://www.gutenberg.org/files/3201/files/SINGLE.TXT 9 | 10 | """ 11 | 12 | # Commented out IPython magic to ensure Python compatibility. 13 | import tensorflow as tf 14 | import numpy as np 15 | import random 16 | import string 17 | import math 18 | 19 | # %load_ext tensorboard 20 | import datetime, os 21 | 22 | from pathlib import Path 23 | 24 | 25 | """For a classification, let's use Sep CNN because that's a reasonable one I found enough information about to reimplement.""" 26 | 27 | # Based on https://developers.google.com/machine-learning/guides/text-classification/step-4 28 | 29 | from tensorflow.python.keras import models 30 | from tensorflow.python.keras import initializers 31 | from tensorflow.python.keras import regularizers 32 | 33 | from tensorflow.python.keras.layers import Dense 34 | from tensorflow.python.keras.layers import Dropout 35 | from tensorflow.python.keras.layers import Embedding 36 | from tensorflow.python.keras.layers import Conv1D 37 | from tensorflow.python.keras.layers import SeparableConv1D 38 | from tensorflow.python.keras.layers import MaxPooling1D 39 | from tensorflow.python.keras.layers import GlobalAveragePooling1D 40 | 41 | def sepcnn_model(blocks, 42 | filters, 43 | kernel_size, 44 | embedding_dim, 45 | dropout_rate, 46 | pool_size, 47 | input_shape, 48 | num_classes, 49 | num_features, 50 | use_pretrained_embedding=False, 51 | is_embedding_trainable=False, 52 | embedding_matrix=None): 53 | """Creates an instance of a separable CNN model. 54 | 55 | # Arguments 56 | blocks: int, number of pairs of sepCNN and pooling blocks in the model. 57 | filters: int, output dimension of the layers. 58 | kernel_size: int, length of the convolution window. 59 | embedding_dim: int, dimension of the embedding vectors. 60 | dropout_rate: float, percentage of input to drop at Dropout layers. 61 | pool_size: int, factor by which to downscale input at MaxPooling layer. 62 | input_shape: tuple, shape of input to the model. 63 | num_classes: int, number of output classes. 64 | num_features: int, number of words (embedding input dimension). 65 | use_pretrained_embedding: bool, true if pre-trained embedding is on. 66 | is_embedding_trainable: bool, true if embedding layer is trainable. 67 | embedding_matrix: dict, dictionary with embedding coefficients. 68 | 69 | # Returns 70 | A sepCNN model instance. 71 | """ 72 | # op_units, op_activation = _get_last_layer_units_and_activation(num_classes) 73 | op_units = 1 74 | op_activation = 'sigmoid' 75 | activation_func = 'relu' 76 | 77 | #op_units = num_classes 78 | #op_activation = 'softmax' 79 | 80 | 81 | model = models.Sequential() 82 | 83 | # Add embedding layer. If pre-trained embedding is used add weights to the 84 | # embeddings layer and set trainable to input is_embedding_trainable flag. 85 | if use_pretrained_embedding: 86 | model.add(Embedding(input_dim=num_features, 87 | output_dim=embedding_dim, 88 | input_length=input_shape[0], 89 | weights=[embedding_matrix], 90 | trainable=is_embedding_trainable)) 91 | else: 92 | model.add(Embedding(input_dim=num_features, 93 | output_dim=embedding_dim, 94 | input_length=input_shape[0])) 95 | 96 | for _ in range(blocks-1): 97 | model.add(Dropout(rate=dropout_rate)) 98 | model.add(SeparableConv1D(filters=filters, 99 | kernel_size=kernel_size, 100 | activation=activation_func, 101 | bias_initializer='random_uniform', 102 | depthwise_initializer='random_uniform', 103 | padding='same')) 104 | model.add(SeparableConv1D(filters=filters, 105 | kernel_size=kernel_size, 106 | activation=activation_func, 107 | bias_initializer='random_uniform', 108 | depthwise_initializer='random_uniform', 109 | padding='same')) 110 | model.add(MaxPooling1D(pool_size=pool_size)) 111 | 112 | model.add(SeparableConv1D(filters=filters * 2, 113 | kernel_size=kernel_size, 114 | activation=activation_func, 115 | bias_initializer='random_uniform', 116 | depthwise_initializer='random_uniform', 117 | padding='same')) 118 | model.add(SeparableConv1D(filters=filters * 2, 119 | kernel_size=kernel_size, 120 | activation=activation_func, 121 | bias_initializer='random_uniform', 122 | depthwise_initializer='random_uniform', 123 | padding='same')) 124 | model.add(GlobalAveragePooling1D()) 125 | model.add(Dropout(rate=dropout_rate)) 126 | model.add(Dense(op_units, activation=op_activation)) 127 | return model 128 | 129 | 130 | # Tensorflow JS doesn't support SeparableConv1D layers yet, 131 | # so we'll just turn it into a CNN instead of a SepCNN 132 | def non_sepcnn_model(blocks, 133 | filters, 134 | kernel_size, 135 | embedding_dim, 136 | dropout_rate, 137 | pool_size, 138 | input_shape, 139 | num_classes, 140 | num_features, 141 | use_pretrained_embedding=False, 142 | is_embedding_trainable=False, 143 | embedding_matrix=None): 144 | """Creates an instance of a non-separable CNN model. 145 | 146 | # Arguments 147 | blocks: int, number of pairs of sepCNN and pooling blocks in the model. 148 | filters: int, output dimension of the layers. 149 | kernel_size: int, length of the convolution window. 150 | embedding_dim: int, dimension of the embedding vectors. 151 | dropout_rate: float, percentage of input to drop at Dropout layers. 152 | pool_size: int, factor by which to downscale input at MaxPooling layer. 153 | input_shape: tuple, shape of input to the model. 154 | num_classes: int, number of output classes. 155 | num_features: int, number of words (embedding input dimension). 156 | use_pretrained_embedding: bool, true if pre-trained embedding is on. 157 | is_embedding_trainable: bool, true if embedding layer is trainable. 158 | embedding_matrix: dict, dictionary with embedding coefficients. 159 | 160 | # Returns 161 | A sepCNN model instance. 162 | """ 163 | # op_units, op_activation = _get_last_layer_units_and_activation(num_classes) 164 | op_units = 1 165 | op_activation = 'sigmoid' 166 | activation_func = 'relu' 167 | 168 | #op_units = num_classes 169 | #op_activation = 'softmax' 170 | 171 | 172 | model = models.Sequential() 173 | 174 | # Add embedding layer. If pre-trained embedding is used add weights to the 175 | # embeddings layer and set trainable to input is_embedding_trainable flag. 176 | if use_pretrained_embedding: 177 | model.add(Embedding(input_dim=num_features, 178 | output_dim=embedding_dim, 179 | input_length=input_shape[0], 180 | weights=[embedding_matrix], 181 | trainable=is_embedding_trainable)) 182 | else: 183 | model.add(Embedding(input_dim=num_features, 184 | output_dim=embedding_dim, 185 | input_length=input_shape[0])) 186 | 187 | for _ in range(blocks-1): 188 | model.add(Dropout(rate=dropout_rate)) 189 | model.add(Conv1D(filters=filters, 190 | kernel_size=kernel_size, 191 | activation=activation_func, 192 | bias_initializer='random_uniform', 193 | padding='same')) 194 | model.add(Conv1D(filters=filters, 195 | kernel_size=kernel_size, 196 | activation=activation_func, 197 | bias_initializer='random_uniform', 198 | padding='same')) 199 | model.add(MaxPooling1D(pool_size=pool_size)) 200 | 201 | model.add(Conv1D(filters=filters * 2, 202 | kernel_size=kernel_size, 203 | activation=activation_func, 204 | bias_initializer='random_uniform', 205 | padding='same')) 206 | model.add(Conv1D(filters=filters * 2, 207 | kernel_size=kernel_size, 208 | activation=activation_func, 209 | bias_initializer='random_uniform', 210 | padding='same')) 211 | model.add(GlobalAveragePooling1D()) 212 | model.add(Dropout(rate=dropout_rate)) 213 | model.add(Dense(op_units, activation=op_activation)) 214 | return model 215 | 216 | seed = 6890; 217 | random.seed(seed); 218 | 219 | def loadData(filename): 220 | data = "" 221 | with open(filename, 'r') as f: 222 | data = f.read() 223 | data = data.split("\n") 224 | return [d.lower() for d in data if ((len(d) >= 3) and (len(d) <= 24))] 225 | 226 | 227 | def loadPrelimData(filename): 228 | data = "" 229 | Path(filename).touch() 230 | with open(filename, 'r') as f: 231 | data = f.read() 232 | data = data.split("\n") 233 | data_list = list(filter(None, data)) 234 | assert(len(data_list) > 0) 235 | return data_list 236 | 237 | 238 | 239 | english_length_frequency = {9: 61602, 8: 59066, 10: 57133, 7: 47814, 11: 47480, 12: 36960, 6: 33362, 13: 26716, 14: 18445, 5: 17785, 15: 11902, 4: 7724, 16: 6954, 17: 3996, 3: 2244, 18: 2120, 19: 1109, 20: 532, 21: 236, 22: 104, 23: 46, 24: 25} 240 | elf_probability = [english_length_frequency[n] for n in sorted(english_length_frequency)] 241 | 242 | english_letters = {'e': 467768, 'i': 383297, 's': 357658, 'a': 340401, 'n': 303655, 'o': 303480, 'r': 298272, 't': 282172, 'l': 229025, 'c': 179481, 'u': 153098, 'p': 136546, 'd': 135605, 'm': 126065, 'h': 110164, 'g': 105274, 'y': 78283, 'b': 77576, 'f': 48176, 'v': 39965, 'k': 34437, 'w': 28496, 'z': 18893, 'x': 12318, 'q': 7062, 'j': 6461, "'": 3866, '/': 21, '"': 6, '1': 3, '0': 3, '8': 2, '3': 1, '7': 1, '4': 1, '5': 1, '6': 1, '2': 1, '9': 1} 243 | 244 | elet_chars = list(english_letters.keys()) 245 | elet_frequency = [english_letters[n] for n in english_letters.keys()] 246 | 247 | """Generate a random string of lowercase letters that is between 3 and 24 characters long. There's a slight chance this will still generate an actual dictionary word, so include an optional way to filter those out. (Which is slow, so the actual function call below uses sets instead.)""" 248 | 249 | def generateWord(forbid_list, depth=0, random_dist='english_table', letter_dist='random'): 250 | #letters = "abcdefghijklmnopqrstuvwxyz" 251 | word_length = 12 252 | word_length_max = 24 253 | if random_dist == 'biased': 254 | word_length = 3 + math.floor(abs(random.normalvariate(0, 21))) 255 | if random_dist == 'triangle': 256 | word_length = 3 + math.floor(21.0 * abs(random.triangular(0,1,0))) 257 | if random_dist == 'uniform': 258 | word_length = random.randint(3,24) 259 | if random_dist == 'gauss': 260 | word_length = 3 + math.floor(21.0 * abs(random.gauss(0,0.2))) 261 | if random_dist == 'beta': 262 | word_length = 3 + math.floor(21.0 * abs(random.betavariate(1,3))) 263 | if random_dist == 'english_table': 264 | word_length = random.choices(list(sorted(english_length_frequency)), weights=elf_probability)[0] 265 | 266 | gen_word = '' 267 | if letter_dist == 'random': 268 | gen_word = ''.join(random.choice(string.ascii_lowercase) for _ in range(word_length)) 269 | if letter_dist == 'english': 270 | gen_word = ''.join(random.choices(elet_chars, elet_frequency)[0] for _ in range(word_length)) 271 | if None != forbid_list: 272 | if gen_word in forbid_list: 273 | if depth > 4: 274 | print(depth) 275 | gen_word = generateWord(forbid_list, depth+1, random_dist=random_dist, letter_dist=letter_dist) 276 | return gen_word 277 | 278 | """You'd think that generating random pronouncable words would be useful, but this is actually a late addition, so the only thing it's being used for right now is testing the final model.""" 279 | 280 | #!pip install pronounceable 281 | from pronounceable import PronounceableWord, generate_word 282 | 283 | def generatePronounceableWord(forbid_list, depth=0, just_gen = False): 284 | gen_word = PronounceableWord().length(3, 24) 285 | if just_gen: 286 | gen_word = generate_word() 287 | if None != forbid_list: 288 | if gen_word in forbid_list: 289 | if depth > 4: 290 | print(depth) 291 | gen_word = generatePronounceableWord(forbid_list, depth+1) 292 | return gen_word 293 | 294 | [print(generatePronounceableWord(None)) for i in range(10)] 295 | 296 | """OK, here's the big data pre-processing step. Load our word lists, generate some fake words, label them both, etc. 297 | 298 | Later on this should probably get changed to use cross-validation or something. 299 | """ 300 | 301 | def saveTextData(tdata, fname): 302 | with open(fname, "w") as txt_file: 303 | for line in tdata: 304 | txt_file.write(line + "\n") 305 | #txt_file.write(" ".join(line) + "\n") 306 | 307 | from collections import Counter 308 | def wordStats(wlist): 309 | w_lens = [len(a) for a in wlist] 310 | print("Word Lengths:") 311 | print(Counter(w_lens)) 312 | wchars = sum([Counter(a) for a in wlist], Counter()) 313 | print("Character Frequency:") 314 | print(wchars) 315 | 316 | def makeUpSomeWords(random_dist='english_table', char_list='random'): 317 | 318 | seed = 26890 319 | 320 | data_size = 336000 # size for training 321 | validation_size = 84000 # size for validation 322 | test_data_size = 20000 # size for testing afterwards 323 | fake_words_multiplier = 6 # I'm not sure that it's a good idea to have so much more false examples compared to real examples, but it is more data... 324 | 325 | # YAWL Word list: yawl-0.3.2.03/word.list 326 | wordlist_1 = loadData("word.list") 327 | # Letterpress wordlist: Words/en.txt 328 | wordlist_2 = loadData("letterpress_en.txt") 329 | # Moby Word list: https://www.gutenberg.org/files/3201/files/SINGLE.TXT 330 | wordlist_3 = loadData("SINGLE.TXT") 331 | 332 | print("Loaded Words") 333 | 334 | wordlist = list(set(wordlist_1 + wordlist_2 + wordlist_3)) 335 | 336 | print("Unique-ify Words") 337 | 338 | random.seed(seed) 339 | random.shuffle(wordlist) 340 | print("Wordlist shuffled: " + str(len(wordlist))) 341 | print(f"Using {(data_size + validation_size + test_data_size)} words.") 342 | print("Data ratio: " + str((data_size + validation_size + test_data_size) / len(wordlist))) 343 | print(wordlist[:100]) 344 | 345 | wordStats(wordlist) 346 | 347 | print("Making up some words...") 348 | fakewords = [generateWord(None, random_dist=random_dist, letter_dist=char_list) for n in range(data_size * fake_words_multiplier)] 349 | print("Fake words!") 350 | morefakewords = [generateWord(None, random_dist=random_dist, letter_dist=char_list) for n in range(validation_size * fake_words_multiplier)] 351 | print("More fake words!") 352 | evenmorefakewords = [generateWord(None, random_dist=random_dist, letter_dist='english') for n in range(test_data_size)] 353 | print("Even more fake words!") 354 | print("Words generated: " + str(len(fakewords) + len(morefakewords) + len(evenmorefakewords))) 355 | 356 | fake_lengths = [len(fakewords), len(morefakewords), len(evenmorefakewords)] 357 | print(fake_lengths) 358 | print("uniquify generated words...") 359 | fakewords = list(set(fakewords) - set(wordlist)) 360 | morefakewords = list(set(morefakewords) - set(wordlist)) 361 | evenmorefakewords = list(set(evenmorefakewords) - set(wordlist)) 362 | print("...done. Removed words:") 363 | print(f"1: {fake_lengths[0] - len(fakewords)}") 364 | print(f"2: {fake_lengths[1] - len(morefakewords)}") 365 | print(f"3: {fake_lengths[2] - len(evenmorefakewords)}") 366 | print([len(fakewords), len(morefakewords), len(evenmorefakewords)]) 367 | 368 | train_data = wordlist[:data_size] + fakewords 369 | train_labels = [True for n in range(data_size)] + [False for n in fakewords] 370 | valid_data = wordlist[data_size:data_size + validation_size] + morefakewords 371 | valid_labels = [True for n in range(validation_size)] + [False for n in morefakewords] 372 | test_data = wordlist[data_size + validation_size:data_size + validation_size + test_data_size] + evenmorefakewords 373 | test_labels = [True for n in range(test_data_size)] + [False for n in evenmorefakewords] 374 | 375 | print("Labels made") 376 | 377 | seed = 26890 378 | random.seed(seed) 379 | random.shuffle(train_data) 380 | random.seed(seed) 381 | random.shuffle(train_labels) 382 | random.seed(seed) 383 | random.shuffle(test_data) 384 | random.seed(seed) 385 | random.shuffle(test_labels) 386 | random.seed(seed) 387 | random.shuffle(valid_data) 388 | random.seed(seed) 389 | random.shuffle(valid_labels) 390 | 391 | print("Datasets shuffled") 392 | 393 | train_dataset = [train_data, np.array(train_labels, dtype=bool)] 394 | valid_dataset = [valid_data, np.array(valid_labels, dtype=bool)] 395 | test_dataset = [test_data, np.array(test_labels, dtype=bool)] 396 | 397 | 398 | 399 | saveTextData(train_data, f"data_training_{random_dist}_{char_list}.txt") 400 | saveTextData(valid_data, f"data_validation_{random_dist}_{char_list}.txt") 401 | saveTextData(test_data, f"data_testing_{random_dist}_{char_list}.txt") 402 | np.savetxt(f"data_labels_train_{random_dist}_{char_list}.txt", train_dataset[1]) 403 | np.savetxt(f"data_labels_valid_{random_dist}_{char_list}.txt", valid_dataset[1]) 404 | np.savetxt(f"data_labels_test_{random_dist}_{char_list}.txt", test_dataset[1]) 405 | 406 | print("Datsets written") 407 | 408 | 409 | print(f"Loading data_training_{random_dist}_{char_list}.txt") 410 | l_train_data = loadPrelimData(f"data_training_{random_dist}_{char_list}.txt") 411 | print([len(l_train_data), len(train_data)]) 412 | match_sum = sum([l_train_data[i] == train_data[i] for i in range(len(l_train_data))]) 413 | print(match_sum) 414 | print(len(train_data)) 415 | assert(match_sum == len(train_data)) 416 | 417 | dist_type = 'english_table' 418 | letter_dist = 'english' 419 | 420 | #dist_type = 'triangle' 421 | #letter_dist = 'random' 422 | 423 | generate_new_words = True 424 | if generate_new_words: 425 | makeUpSomeWords(random_dist = dist_type, char_list = letter_dist) 426 | 427 | """Because the pre-processing can take a while, we save it to disk above and then reload it here. (It's better to have the save-and-load process run all of the time so we can make sure it behaves identically in either case.)""" 428 | 429 | l_train_data = loadPrelimData(f"data_training_{dist_type}_{letter_dist}.txt") 430 | l_train_labels = [(i[0] == '1') for i in loadPrelimData(f"data_labels_train_{dist_type}_{letter_dist}.txt")] 431 | l_valid_data = loadPrelimData(f"data_validation_{dist_type}_{letter_dist}.txt") 432 | l_valid_labels = [(i[0] == '1') for i in loadPrelimData(f"data_labels_valid_{dist_type}_{letter_dist}.txt")] 433 | l_test_data = loadPrelimData(f"data_testing_{dist_type}_{letter_dist}.txt") 434 | l_test_labels = [(i[0] == '1') for i in loadPrelimData(f"data_labels_test_{dist_type}_{letter_dist}.txt")] 435 | 436 | train_dataset = [l_train_data, np.array(l_train_labels, dtype=bool)] 437 | valid_dataset = [l_valid_data, np.array(l_valid_labels, dtype=bool)] 438 | test_dataset = [l_test_data, np.array(l_test_labels, dtype=bool)] 439 | 440 | from tensorflow.python.keras.preprocessing import sequence 441 | from tensorflow.python.keras.preprocessing import text 442 | 443 | TOKEN_MODE = 'char' 444 | TOP_K = 36 445 | MAX_WORD_LENGTH = 24 446 | 447 | def vectorize_data(training_text, validation_text, test_text): 448 | glyphs = " abcdefghijklmnopqrstuvwxyz" 449 | #trn = [' '.join([j for j in i]) for i in training_text] 450 | #val = [' '.join([j for j in i]) for i in validation_text] 451 | 452 | tokenizer = text.Tokenizer(lower=True, char_level=True, oov_token='@') 453 | tokenizer.fit_on_texts(training_text + validation_text + test_text) 454 | 455 | train = tokenizer.texts_to_sequences(training_text) 456 | validate = tokenizer.texts_to_sequences(validation_text) 457 | testing = tokenizer.texts_to_sequences(test_text) 458 | glyph_dictionary = tokenizer.word_index 459 | train = sequence.pad_sequences(train, maxlen=MAX_WORD_LENGTH, padding='post') 460 | validate = sequence.pad_sequences(validate, maxlen=MAX_WORD_LENGTH, padding='post') 461 | testing = sequence.pad_sequences(testing, maxlen=MAX_WORD_LENGTH, padding='post') 462 | return train, validate, testing, glyph_dictionary, tokenizer 463 | 464 | #[' '.join([j for j in i]) for i in ["test", "strings to process"]] 465 | 466 | #vectorize_data(["twenty one", "thirty two", "three"], ["able alpha", "baker beta", "charlie gamma"], ["test"]) 467 | 468 | train, valid, test, character_index, character_tokenizer = vectorize_data(train_dataset[0], valid_dataset[0], test_dataset[0]) 469 | print(character_index) 470 | print(len(character_index)) 471 | 472 | with open(f"tokenizer_{dist_type}_{letter_dist}.txt", "w") as f: 473 | f.write(str(character_index)) 474 | 475 | def train_model(model_name = "spell_words", 476 | blocks = 3, 477 | filters = 64, 478 | dropout_rate = 0.3, 479 | embedding_dim = 200, 480 | kernel_size = 3, 481 | pool_size = 3, 482 | epochs = 250, 483 | batch_size = 512, 484 | patience=15, 485 | loss = 'binary_crossentropy', 486 | learning_rate = 1e-3): 487 | num_classes = 2, # binary classification 488 | num_features = len(character_index) + 1 # maximum number of letters 489 | batch_size = batch_size# * (64) 490 | 491 | model = non_sepcnn_model(blocks=blocks, 492 | filters=filters, 493 | kernel_size=kernel_size, 494 | embedding_dim=embedding_dim, 495 | dropout_rate=dropout_rate, 496 | pool_size=pool_size, 497 | input_shape=train.shape[1:], 498 | num_classes=num_classes, 499 | num_features=num_features) 500 | 501 | 502 | 503 | optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate) 504 | model.compile(optimizer=optimizer, loss=loss, metrics=['acc']) 505 | 506 | try: 507 | os.mkdir("logs") 508 | except FileExistsError: 509 | pass 510 | logdir = os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S")) 511 | try: 512 | os.mkdir(logdir) 513 | except FileExistsError: 514 | pass 515 | 516 | try: 517 | os.mkdir("training") 518 | except FileExistsError: 519 | pass 520 | checkpoint_path = "training/model." + model_name + "-{epoch:02d}-{val_loss:.4f}.h5" 521 | checkpoint_dir = os.path.dirname(checkpoint_path) 522 | #!ls {checkpoint_dir} 523 | 524 | callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=patience), 525 | #tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1), 526 | tf.keras.callbacks.ModelCheckpoint(checkpoint_path, monitor='val_acc', mode='max', verbose=1, save_best_only=True)] 527 | 528 | 529 | 530 | # Train and validate model. 531 | history = model.fit( 532 | train, 533 | train_dataset[1], 534 | epochs=epochs, 535 | callbacks=callbacks, 536 | validation_data=(valid, valid_dataset[1]), 537 | verbose=2, # Logs once per epoch. 538 | batch_size=batch_size) 539 | 540 | # Print results. 541 | history = history.history 542 | print('Validation accuracy: {acc}, loss: {loss}'.format( 543 | acc=history['val_acc'][-1], loss=history['val_loss'][-1])) 544 | 545 | # Save model. 546 | model.save(f'{model_name}_{datetime.datetime.now().strftime("%Y%m%d-%H%M%S")}_nonsepcnn_model.h5') 547 | print(history['val_acc'][-1], history['val_loss'][-1]) 548 | 549 | test_loss, test_acc = model.evaluate(test, test_dataset[1], verbose=2) 550 | print(f"test loss: {test_loss}, test accuracy: {test_acc} ") 551 | 552 | return model 553 | 554 | def saveTextData(tdata, fname): 555 | with open(fname, "w") as txt_file: 556 | for line in tdata: 557 | txt_file.write(line + "\n") 558 | 559 | def chunks(lst, n): 560 | """Yield successive n-sized chunks from lst.""" 561 | for i in range(0, len(lst), n): 562 | yield lst[i:i + n] 563 | 564 | wordlist_1 = loadData("word.list") 565 | wordlist_2 = loadData("letterpress_en.txt") 566 | wordlist_3 = loadData("SINGLE.TXT") 567 | 568 | wordlist = list(set(wordlist_1 + wordlist_2 + wordlist_3)) 569 | 570 | def isInDictionary(word): 571 | return (word in wordlist) 572 | 573 | def theseAreTotallyRealWords(run_count=1000, cutoff=0.9): 574 | totally_real_words = [] 575 | is_in_dictionary = [] 576 | for i in range(run_count): 577 | real_words = [generateWord(None)] 578 | tokenized_real_words = character_tokenizer.texts_to_sequences(real_words) 579 | padded_real_words = sequence.pad_sequences(tokenized_real_words, maxlen=MAX_WORD_LENGTH, padding='post') 580 | real_words_result = model.predict(padded_real_words) 581 | if real_words_result[0] > cutoff: 582 | print(f"{i}\t{real_words[0]}") 583 | totally_real_words.append(real_words[0]) 584 | if isInDictionary(real_words[0]): 585 | is_in_dictionary.append(real_words[0]) 586 | return totally_real_words, is_in_dictionary 587 | 588 | def theseAreTotallyRealWordsOneshot(model, run_count=100, cutoff=0.9, random_dist='uniform', letter_dist='random'): 589 | totally_real_words = [] 590 | almost_real_words = [] 591 | is_in_dictionary = [] 592 | real_words = [generateWord(None, random_dist=random_dist, letter_dist=letter_dist) for i in range(run_count)] 593 | tokenized_real_words = character_tokenizer.texts_to_sequences(real_words) 594 | padded_real_words = sequence.pad_sequences(tokenized_real_words, maxlen=MAX_WORD_LENGTH, padding='post') 595 | real_words_result = model.predict(padded_real_words) 596 | rwr = real_words_result.tolist() 597 | for idx in range(len(rwr)): 598 | predict = real_words_result[idx] 599 | if predict[0] > cutoff: 600 | totally_real_words.append(real_words[idx]) 601 | else: 602 | if predict[0] > 0.5: 603 | almost_real_words.append(real_words[idx]) 604 | if isInDictionary(real_words[idx]): 605 | is_in_dictionary.append(real_words[idx]) 606 | return totally_real_words, is_in_dictionary, almost_real_words 607 | 608 | def check_model(model, model_name): 609 | #totally_real_words = ["test", "weyhws", "agglution", "glyph", "tyro", "pfxx"] 610 | #tokenized_real_words = character_tokenizer.texts_to_sequences(totally_real_words) 611 | #padded_real_words = sequence.pad_sequences(tokenized_real_words, maxlen=MAX_WORD_LENGTH, padding='post') 612 | #real_words_result = model.predict(padded_real_words) 613 | #[int(i * 100) for i in real_words_result] 614 | totally_real, in_dic, almost_words = theseAreTotallyRealWordsOneshot(model, run_count=10000, cutoff=0.8) 615 | print("\nTotally Real Words\n============") 616 | [print(i) for i in set(totally_real)] 617 | print("\nSuper Fake Words\n============") 618 | [print(i) for i in set(in_dic)] 619 | print("\nDictionary Words Found\n============") 620 | [print(i) for i in (set(totally_real) & set(in_dic))] 621 | print("\nDictionary Words Not Found (False Negatives)\n============") 622 | [print(i) for i in (set(in_dic) - set(totally_real))] 623 | print("\nAlmost Words\n============") 624 | [print(i) for i in set(almost_words)] 625 | 626 | print("\n") 627 | 628 | totally_real, in_dic, almost_words = theseAreTotallyRealWordsOneshot(model, run_count=10000, cutoff=0.8, random_dist='english_table', letter_dist='english') 629 | print("\nTotally Real Words\n============") 630 | [print(i) for i in set(totally_real)] 631 | print("\nSuper Fake Words\n============") 632 | [print(i) for i in set(in_dic)] 633 | print("\nDictionary Words Found\n============") 634 | [print(i) for i in (set(totally_real) & set(in_dic))] 635 | print("\nDictionary Words Not Found (False Negatives)\n============") 636 | [print(i) for i in (set(in_dic) - set(totally_real))] 637 | print("\nAlmost Words\n============") 638 | [print(i) for i in set(almost_words)] 639 | 640 | print("\n") 641 | 642 | 643 | 644 | wordlist_chunks = chunks(wordlist, 1000) 645 | wordlist_predict = [] 646 | for cnk in wordlist_chunks: 647 | print(cnk[0], end=' ') 648 | tokenized_real_words = character_tokenizer.texts_to_sequences(cnk) 649 | padded_real_words = sequence.pad_sequences(tokenized_real_words, maxlen=MAX_WORD_LENGTH, padding='post') 650 | cnk_predictions = model.predict(padded_real_words) 651 | print(cnk_predictions[0]) 652 | wordlist_predict = wordlist_predict + cnk_predictions.tolist() 653 | 654 | saveTextData(wordlist, "all_english_words.txt") 655 | np.savetxt("all_english_words_predictions.txt", wordlist_predict) 656 | 657 | sorted_wordlist = [[i,j] for i,j in sorted(zip(wordlist_predict, wordlist))] 658 | 659 | wlp = np.sort(np.array(sorted(wordlist_predict))) 660 | print(f"Average: {np.average(wlp)}, Median: {np.median(wlp)}") 661 | print(wlp[:10]) 662 | print(wlp[-10:]) 663 | import matplotlib.pyplot as plt 664 | plt.figure(figsize=(16,7)) 665 | plt.plot(range(len(wlp)), wlp, label="model 1") 666 | plt.plot() 667 | plt.ylabel("prediction") 668 | plt.title(f"{model_name} Prediction of All English Words") 669 | plt.legend() 670 | plt.show() 671 | 672 | [print(f"{j} {i[0]:03.2f}") for i,j in sorted_wordlist[:1000]] 673 | print() 674 | 675 | pwords = [generatePronounceableWord(None, just_gen = True) for i in range(10000)] 676 | p_tokenized_real_words = character_tokenizer.texts_to_sequences(pwords) 677 | p_padded_real_words = sequence.pad_sequences(p_tokenized_real_words, maxlen=MAX_WORD_LENGTH, padding='post') 678 | p_predictions = model.predict(p_padded_real_words) 679 | 680 | p_sorted_wordlist = [[i,j] for i,j in sorted(zip(p_predictions, pwords))] 681 | [print(i) for i in p_sorted_wordlist[:10]] 682 | [print(i) for i in p_sorted_wordlist[-10:]] 683 | 684 | p_wlp = np.sort(np.array(sorted(p_predictions))) 685 | print(f"Average: {np.average(p_wlp)}, Median: {np.median(p_wlp)}") 686 | print(p_wlp[:10]) 687 | print(p_wlp[-10:]) 688 | 689 | plt.figure(figsize=(16,7)) 690 | plt.plot(range(len(p_wlp)), p_wlp, label="generated") 691 | plt.plot() 692 | plt.ylabel("prediction") 693 | plt.title(f"{model_name} Prediction of Pronounceable Words") 694 | plt.legend() 695 | plt.show() 696 | 697 | p2words = [generateWord(None, random_dist='english_table', letter_dist='english') for i in range(10000)] 698 | p2_tokenized_real_words = character_tokenizer.texts_to_sequences(p2words) 699 | p2_padded_real_words = sequence.pad_sequences(p2_tokenized_real_words, maxlen=MAX_WORD_LENGTH, padding='post') 700 | p2_predictions = model.predict(p2_padded_real_words) 701 | 702 | p2_sorted_wordlist = [[i,j] for i,j in sorted(zip(p2_predictions, pwords))] 703 | [print(i) for i in p2_sorted_wordlist[:10]] 704 | [print(i) for i in p2_sorted_wordlist[-10:]] 705 | 706 | p2_wlp = np.sort(np.array(sorted(p2_predictions))) 707 | print(f"Average: {np.average(p2_wlp)}, Median: {np.median(p2_wlp)}") 708 | print(p2_wlp[:10]) 709 | print(p2_wlp[-10:]) 710 | 711 | plt.figure(figsize=(16,7)) 712 | plt.plot(range(len(p2_wlp)), p2_wlp, label="generated") 713 | plt.plot() 714 | plt.ylabel("prediction") 715 | plt.title(f"{model_name} Prediction of Random English-Distribution Words") 716 | plt.legend() 717 | plt.show() 718 | 719 | p3words = [generateWord(None, random_dist='english_table', letter_dist='random') for i in range(10000)] 720 | p3_tokenized_real_words = character_tokenizer.texts_to_sequences(p3words) 721 | p3_padded_real_words = sequence.pad_sequences(p3_tokenized_real_words, maxlen=MAX_WORD_LENGTH, padding='post') 722 | p3_predictions = model.predict(p3_padded_real_words) 723 | 724 | p3_sorted_wordlist = [[i,j] for i,j in sorted(zip(p3_predictions, pwords))] 725 | [print(i) for i in p3_sorted_wordlist[:10]] 726 | [print(i) for i in p3_sorted_wordlist[-10:]] 727 | 728 | p3_wlp = np.sort(np.array(sorted(p3_predictions))) 729 | print(f"Average: {np.average(p3_wlp)}, Median: {np.median(p3_wlp)}") 730 | print(p3_wlp[:10]) 731 | print(p3_wlp[-10:]) 732 | 733 | plt.figure(figsize=(16,7)) 734 | plt.plot(range(len(p3_wlp)), p3_wlp, label="generated") 735 | plt.plot() 736 | plt.ylabel("prediction") 737 | plt.title(f"{model_name} Prediction of Random-Random Words") 738 | plt.legend() 739 | plt.show() 740 | 741 | p4words = [generateWord(None, random_dist='uniform', letter_dist='random') for i in range(10000)] 742 | p4_tokenized_real_words = character_tokenizer.texts_to_sequences(p4words) 743 | p4_padded_real_words = sequence.pad_sequences(p4_tokenized_real_words, maxlen=MAX_WORD_LENGTH, padding='post') 744 | p4_predictions = model.predict(p4_padded_real_words) 745 | 746 | p4_sorted_wordlist = [[i,j] for i,j in sorted(zip(p4_predictions, pwords))] 747 | [print(i) for i in p4_sorted_wordlist[:10]] 748 | [print(i) for i in p4_sorted_wordlist[-10:]] 749 | 750 | p4_wlp = np.sort(np.array(sorted(p4_predictions))) 751 | print(f"Average: {np.average(p4_wlp)}, Median: {np.median(p4_wlp)}") 752 | print(p4_wlp[:10]) 753 | print(p4_wlp[-10:]) 754 | 755 | plt.figure(figsize=(16,7)) 756 | plt.plot(range(len(p4_wlp)), p4_wlp, label="generated") 757 | plt.plot() 758 | plt.ylabel("prediction") 759 | plt.title(f"{model_name} Prediction of Uniform-Random Words") 760 | plt.legend() 761 | plt.show() 762 | 763 | #!pip install wandb 764 | #import wandb 765 | #wandb.init() 766 | 767 | 768 | 769 | # Commented out IPython magic to ensure Python compatibility. 770 | # %tensorboard --logdir logs --port=6006 771 | 772 | base_model = train_model(model_name = f"model_{dist_type}_{letter_dist}") 773 | 774 | check_model(base_model, f"model_{dist_type}_{letter_dist}") 775 | 776 | cnk_words = ["egg", "eggbeater", "seas"] 777 | tokenized_real_words = character_tokenizer.texts_to_sequences(cnk_words) 778 | padded_real_words = sequence.pad_sequences(tokenized_real_words, maxlen=MAX_WORD_LENGTH, padding='post') 779 | cnk_predictions = base_model.predict(padded_real_words) 780 | [print(f"{a} = {int(b[0]*100)}%") for a,b in zip(cnk_words, cnk_predictions)] 781 | --------------------------------------------------------------------------------