├── LICENSE ├── Readme.md ├── cleanwebst.py └── reverse_dictionary.ipynb /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2022, Lorenz Köhl 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | 1. Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | 2. Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | 3. Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /Readme.md: -------------------------------------------------------------------------------- 1 | # Reverse Dictionary 2 | 3 | Writing well is laborsome. A good dictionary helps but it's only usable in one direction. 4 | You have to think of a word, look it up, chase references and so on. Even more work! 5 | 6 | It may be helpful to look up words by meaning, by what we as writers want to express. 7 | For example, if we had a function `find_words("I'm lost for words")` 8 | and it would present us with a choice of words: 9 | 10 | ``` 11 | astoundment, bewildered, blank, confus, distraught, perplexly, stagger, stound, unyielded 12 | ``` 13 | 14 | then we may find the right word we want, without all the gyrations of traditional dictionary use. 15 | 16 | Can we implement this function? The answer is yes and it's not hard (*if you know your way around python*)! 17 | 18 | see `reverse_dictionary.ipynb` 19 | -------------------------------------------------------------------------------- /cleanwebst.py: -------------------------------------------------------------------------------- 1 | import json, sys 2 | from html5_parser import parse 3 | 4 | # given a parent node p and a class name c, provide a list of texts from div.c children 5 | classdivs = lambda p, c: [''.join(e.itertext()) for e in p.xpath(f'//div[@class="{c}"]')] 6 | 7 | # get all text in a node as a flat list 8 | nodetexts = lambda n, cs: [text.strip() for elem in (classdivs(n, c) for c in cs) for text in elem] 9 | 10 | # get a list of definitions text for relevant dictionary entries 11 | divdefs = lambda n: nodetexts(n, ('def', 'q', 'ety', 'cs', 'note')) 12 | 13 | 14 | if __name__ == '__main__': 15 | webst = json.load(sys.stdin) 16 | 17 | webst_clean = {entry.lower(): defs 18 | for entry in webst 19 | if (defs := divdefs(parse(webst[entry]))) # skip empty defs 20 | } 21 | 22 | json.dump(webst_clean, sys.stdout, sort_keys=True, indent=4) 23 | -------------------------------------------------------------------------------- /reverse_dictionary.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "5a1ba44d-e15c-4645-9cc2-f06f343e2986", 6 | "metadata": {}, 7 | "source": [ 8 | "
Lorenz Köhl\n", 9 | "
\n", 10 | "September 2022
\n", 11 | "\n", 12 | "# Reverse Dictionary\n", 13 | "\n", 14 | "Writing well is laborsome. A good dictionary helps but it's only usable in one direction.\n", 15 | "You have to think of a word, look it up, chase references and so on. Even more work!\n", 16 | "\n", 17 | "It may be helpful to look up words by meaning, by what we as writers want to express.\n", 18 | "For example, if we had a function `find_words(\"I'm lost for words\")`\n", 19 | "and it would present us with a choice of words:\n", 20 | "\n", 21 | "```\n", 22 | "astoundment, bewildered, blank, confus, distraught, perplexly, stagger, stound, unyielded\n", 23 | "```\n", 24 | "\n", 25 | "then we may find the right word we want, without all the gyrations of traditional dictionary use.\n", 26 | "\n", 27 | "Can we implement this function? The answer is yes and it's not hard (*if you know your way around python*)!\n", 28 | "\n", 29 | "*Dependencies for execution:*\n", 30 | "\n", 31 | "- an environment with the following and a computer with enough resources (ie. nvidia gpu and lots of RAM)\n", 32 | "- pytorch
`conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge`\n", 33 | "- [sentence-transformers](https://github.com/UKPLab/sentence-transformers):
`conda install -c conda-forge sentence-transformers`\n", 34 | "- [ScaNN](https://github.com/google-research/google-research/tree/master/scann):
`pip install scann`\n", 35 | "\n", 36 | "You'll also need a cleaned up version of the webster1913 dictionary\n", 37 | "[json file](https://www.dropbox.com/s/w62l6pdfl8dtw2z/webst.json?dl=0). \n", 38 | "Please find a cleaning script in the [repo](https://github.com/mye/simple-vector-search) which depends on\n", 39 | "[html5-parser](https://html5-parser.readthedocs.io/en/latest/):\n", 40 | "
\n", 41 | "`pip install --no-binary lxml html5-parser`\n", 42 | "\n", 43 | "`python cleanwebst.py cleanwebst.json`" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "id": "789b6932-1f7f-4522-8b7d-38a50b51dd4c", 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "import torch, scann, numpy as np\n", 54 | "from sentence_transformers import SentenceTransformer\n", 55 | "import json" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 13, 61 | "id": "6a0d8746-6349-42fb-b441-8af257724dfa", 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "assert torch.cuda.is_available()" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "id": "992f11f8-bff9-42a0-bb1e-4d2c3ecb18d2", 71 | "metadata": {}, 72 | "source": [ 73 | "We start of by loading the dictionary, embedding definitions into vectors (sentence embeddings) and indexing those vectors for approximate nearest neighbor search" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 5, 79 | "id": "6b756745-62aa-45d1-891a-545b90def45b", 80 | "metadata": {}, 81 | "outputs": [ 82 | { 83 | "data": { 84 | "text/plain": [ 85 | "['The brain and spinal cord; the cerebro-spinal axis; myelencephalon.',\n", 86 | " '[NL., from Gr. νεῦρον nerve.]']" 87 | ] 88 | }, 89 | "execution_count": 5, 90 | "metadata": {}, 91 | "output_type": "execute_result" 92 | } 93 | ], 94 | "source": [ 95 | "webst = json.load(open('cleanwebst.json'))\n", 96 | "webst['neuron']" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 19, 102 | "id": "89c0be48-e783-4c0c-abb4-8201299e09b2", 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "mpnet = SentenceTransformer('all-mpnet-base-v2') # could also use all-MiniLM-L6-v2 for lighter weight model" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": 7, 112 | "id": "cf7273ff-4be4-4c7b-88cf-1e06ccedc726", 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "# this takes a while (about 30 minutes on my RTX 3060 TI)\n", 117 | "webst_embs = {word: mpnet.encode(defs) for word, defs in webst.items()} " 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 18, 123 | "id": "ab4ad1f6-6dc8-4734-8358-38348714d14c", 124 | "metadata": {}, 125 | "outputs": [], 126 | "source": [ 127 | "dataset = np.concatenate([webst_embs[w] for w in webst_embs])\n", 128 | "dataset_words = np.array([w for w in webst_embs for e in webst_embs[w]])\n", 129 | "assert len(dataset) == len(dataset_words)\n", 130 | "np.save('embs.npy', dataset) # save data so we don't have to recompute when something bad happens\n", 131 | "np.save('words.npy', dataset_words)" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 20, 137 | "id": "32231b91-a06f-4cdc-b257-44b3142f4b4b", 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "normalized_dataset = dataset / np.linalg.norm(dataset, axis=1)[:, np.newaxis]" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 22, 147 | "id": "eb1da317-0fe1-40a6-9cac-85808886a76b", 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "searcher = scann.scann_ops_pybind.builder(normalized_dataset, 10, \"dot_product\").tree(\n", 152 | " num_leaves=2000, num_leaves_to_search=100, training_sample_size=250000).score_ah(\n", 153 | " 2, anisotropic_quantization_threshold=0.2).reorder(100).build()" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "id": "4889292f-49d6-41e4-89f4-e88d93e5eb01", 159 | "metadata": {}, 160 | "source": [ 161 | "This did alot in a few cells, even if it doesn't look like much!\n", 162 | "We loaded a pretrained neural network and encoded the whole dictionary,\n", 163 | "which gives us around 270000 vectors to search through.\n", 164 | "\n", 165 | "We now have everything to implement our word finding function.\n", 166 | "We simply encode the description (the meaning) into a vector and search for its neighbors!" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 23, 172 | "id": "701435b3-edd3-429f-8856-dd52e606dec2", 173 | "metadata": {}, 174 | "outputs": [], 175 | "source": [ 176 | "def find_words(description: str):\n", 177 | " emb = mpnet.encode(description)\n", 178 | " neighbors, distances = searcher.search(emb, final_num_neighbors=10)\n", 179 | " return set(dataset_words[neighbors])" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 24, 185 | "id": "f10e036e-cd41-4f84-a2ff-17117aa98422", 186 | "metadata": {}, 187 | "outputs": [ 188 | { 189 | "data": { 190 | "text/plain": [ 191 | "{'amazeful',\n", 192 | " 'astoundment',\n", 193 | " 'bewildered',\n", 194 | " 'blank',\n", 195 | " 'confus',\n", 196 | " 'distraught',\n", 197 | " 'perplexly',\n", 198 | " 'stagger',\n", 199 | " 'stound',\n", 200 | " 'unyielded'}" 201 | ] 202 | }, 203 | "execution_count": 24, 204 | "metadata": {}, 205 | "output_type": "execute_result" 206 | } 207 | ], 208 | "source": [ 209 | "find_words(\"I'm lost for words\")" 210 | ] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "id": "bf0c7540-92f5-46e1-97ec-f729867a9202", 215 | "metadata": {}, 216 | "source": [ 217 | "Of course what we really want is more nicely formatted list with definitions" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": 26, 223 | "id": "e44afc01-0d26-4abf-b5a3-7241a5972e9a", 224 | "metadata": {}, 225 | "outputs": [], 226 | "source": [ 227 | "from IPython.display import HTML" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": 72, 233 | "id": "ed6974a4-4661-49ff-a30a-f9cfdb387aa9", 234 | "metadata": {}, 235 | "outputs": [], 236 | "source": [ 237 | "def word_html(word, ndefs=5):\n", 238 | " defs = [f'{d}' for d in webst[word][:ndefs]]\n", 239 | " html = f'
  • {word}
    {\" // \".join(defs)}
  • '\n", 240 | " return html\n", 241 | "\n", 242 | "def display_words(desc):\n", 243 | " words = find_words(desc)\n", 244 | " htmls = [word_html(word) for word in words]\n", 245 | " return HTML('')" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": 76, 251 | "id": "f653c710-d961-42d9-93cd-040be7f56c8c", 252 | "metadata": {}, 253 | "outputs": [ 254 | { 255 | "data": { 256 | "text/html": [ 257 | "" 258 | ], 259 | "text/plain": [ 260 | "" 261 | ] 262 | }, 263 | "execution_count": 76, 264 | "metadata": {}, 265 | "output_type": "execute_result" 266 | } 267 | ], 268 | "source": [ 269 | "query = \"I'm lost for words\"\n", 270 | "display_words(query)" 271 | ] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "id": "d5447062-3e92-4d0d-b9c8-eab48011dc68", 276 | "metadata": {}, 277 | "source": [ 278 | "That's a decent result for the wee bit of code we had to write.\n", 279 | "The quality of words isn't always perfect (false positives happen).\n", 280 | "Some words have a lot definitions and appear too often (eg. unyielded).\n", 281 | "We could for example think about how improve the embeddings,\n", 282 | "or we could increase the size of our dataset, and balance the number of\n", 283 | "definitions used for training. Then we could think about deploying it as a service to others.\n", 284 | "\n", 285 | "But before we do all that, let's gather some real world experience on how\n", 286 | "useful our model is in practice and get some writing done. Have fun!" 287 | ] 288 | } 289 | ], 290 | "metadata": { 291 | "kernelspec": { 292 | "display_name": "Python 3 (ipykernel)", 293 | "language": "python", 294 | "name": "python3" 295 | }, 296 | "language_info": { 297 | "codemirror_mode": { 298 | "name": "ipython", 299 | "version": 3 300 | }, 301 | "file_extension": ".py", 302 | "mimetype": "text/x-python", 303 | "name": "python", 304 | "nbconvert_exporter": "python", 305 | "pygments_lexer": "ipython3", 306 | "version": "3.10.4" 307 | } 308 | }, 309 | "nbformat": 4, 310 | "nbformat_minor": 5 311 | } 312 | --------------------------------------------------------------------------------