├── LICENSE
├── Readme.md
├── cleanwebst.py
└── reverse_dictionary.ipynb


/LICENSE:
--------------------------------------------------------------------------------
 1 | BSD 3-Clause License
 2 | 
 3 | Copyright (c) 2022, Lorenz Köhl
 4 | All rights reserved.
 5 | 
 6 | Redistribution and use in source and binary forms, with or without
 7 | modification, are permitted provided that the following conditions are met:
 8 | 
 9 | 1. Redistributions of source code must retain the above copyright notice, this
10 |    list of conditions and the following disclaimer.
11 | 
12 | 2. Redistributions in binary form must reproduce the above copyright notice,
13 |    this list of conditions and the following disclaimer in the documentation
14 |    and/or other materials provided with the distribution.
15 | 
16 | 3. Neither the name of the copyright holder nor the names of its
17 |    contributors may be used to endorse or promote products derived from
18 |    this software without specific prior written permission.
19 | 
20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 | 


--------------------------------------------------------------------------------
/Readme.md:
--------------------------------------------------------------------------------
 1 | # Reverse Dictionary
 2 | 
 3 | Writing well is laborsome. A good dictionary helps but it's only usable in one direction.
 4 | You have to think of a word, look it up, chase references and so on. Even more work!
 5 | 
 6 | It may be helpful to look up words by meaning, by what we as writers want to express.
 7 | For example, if we had a function `find_words("I'm lost for words")`
 8 | and it would present us with a choice of words:
 9 | 
10 | ```
11 | astoundment, bewildered, blank, confus, distraught, perplexly, stagger, stound, unyielded
12 | ```
13 | 
14 | then we may find the right word we want, without all the gyrations of traditional dictionary use.
15 | 
16 | Can we implement this function? The answer is yes and it's not hard (*if you know your way around python*)!
17 | 
18 | see `reverse_dictionary.ipynb`
19 | 


--------------------------------------------------------------------------------
/cleanwebst.py:
--------------------------------------------------------------------------------
 1 | import json, sys
 2 | from html5_parser import parse
 3 | 
 4 | # given a parent node p and a class name c, provide a list of texts from div.c children
 5 | classdivs = lambda p, c: [''.join(e.itertext()) for e in p.xpath(f'//div[@class="{c}"]')]
 6 | 
 7 | # get all text in a node as a flat list
 8 | nodetexts = lambda n, cs: [text.strip() for elem in (classdivs(n, c) for c in cs) for text in elem]
 9 | 
10 | # get a list of definitions text for relevant dictionary entries
11 | divdefs = lambda n: nodetexts(n, ('def', 'q', 'ety', 'cs', 'note'))
12 | 
13 | 
14 | if __name__ == '__main__':
15 | 	webst = json.load(sys.stdin)
16 | 
17 | 	webst_clean = {entry.lower(): defs
18 | 		for entry in webst
19 | 		if (defs := divdefs(parse(webst[entry]))) # skip empty defs
20 | 	}
21 | 
22 | 	json.dump(webst_clean, sys.stdout, sort_keys=True, indent=4)
23 | 


--------------------------------------------------------------------------------
/reverse_dictionary.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "5a1ba44d-e15c-4645-9cc2-f06f343e2986",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "<div style=\"text-align: right; font-style: italic\">Lorenz Köhl\n",
  9 |     "<br>\n",
 10 |     "September 2022</div>\n",
 11 |     "\n",
 12 |     "# Reverse Dictionary\n",
 13 |     "\n",
 14 |     "Writing well is laborsome. A good dictionary helps but it's only usable in one direction.\n",
 15 |     "You have to think of a word, look it up, chase references and so on. Even more work!\n",
 16 |     "\n",
 17 |     "It may be helpful to look up words by meaning, by what we as writers want to express.\n",
 18 |     "For example, if we had a function `find_words(\"I'm lost for words\")`\n",
 19 |     "and it would present us with a choice of words:\n",
 20 |     "\n",
 21 |     "```\n",
 22 |     "astoundment, bewildered, blank, confus, distraught, perplexly, stagger, stound, unyielded\n",
 23 |     "```\n",
 24 |     "\n",
 25 |     "then we may find the right word we want, without all the gyrations of traditional dictionary use.\n",
 26 |     "\n",
 27 |     "Can we implement this function? The answer is yes and it's not hard (*if you know your way around python*)!\n",
 28 |     "\n",
 29 |     "*Dependencies for execution:*\n",
 30 |     "\n",
 31 |     "- an environment with the following and a computer with enough resources (ie. nvidia gpu and lots of RAM)\n",
 32 |     "- pytorch<br> `conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge`\n",
 33 |     "- [sentence-transformers](https://github.com/UKPLab/sentence-transformers):<br> `conda install -c conda-forge sentence-transformers`\n",
 34 |     "- [ScaNN](https://github.com/google-research/google-research/tree/master/scann):<br> `pip install scann`\n",
 35 |     "\n",
 36 |     "You'll also need a cleaned up version of the webster1913 dictionary\n",
 37 |     "[json file](https://www.dropbox.com/s/w62l6pdfl8dtw2z/webst.json?dl=0). \n",
 38 |     "Please find a cleaning script in the [repo](https://github.com/mye/simple-vector-search) which depends on\n",
 39 |     "[html5-parser](https://html5-parser.readthedocs.io/en/latest/):\n",
 40 |     "<br> \n",
 41 |     "`pip install --no-binary lxml html5-parser`\n",
 42 |     "\n",
 43 |     "`python cleanwebst.py <webst.json >cleanwebst.json`"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "code",
 48 |    "execution_count": null,
 49 |    "id": "789b6932-1f7f-4522-8b7d-38a50b51dd4c",
 50 |    "metadata": {},
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "import torch, scann, numpy as np\n",
 54 |     "from sentence_transformers import SentenceTransformer\n",
 55 |     "import json"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "code",
 60 |    "execution_count": 13,
 61 |    "id": "6a0d8746-6349-42fb-b441-8af257724dfa",
 62 |    "metadata": {},
 63 |    "outputs": [],
 64 |    "source": [
 65 |     "assert torch.cuda.is_available()"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "markdown",
 70 |    "id": "992f11f8-bff9-42a0-bb1e-4d2c3ecb18d2",
 71 |    "metadata": {},
 72 |    "source": [
 73 |     "We start of by loading the dictionary, embedding definitions into vectors (sentence embeddings) and indexing those vectors for approximate nearest neighbor search"
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": 5,
 79 |    "id": "6b756745-62aa-45d1-891a-545b90def45b",
 80 |    "metadata": {},
 81 |    "outputs": [
 82 |     {
 83 |      "data": {
 84 |       "text/plain": [
 85 |        "['The brain and spinal cord; the cerebro-spinal axis; myelencephalon.',\n",
 86 |        " '[NL., from Gr. νεῦρον nerve.]']"
 87 |       ]
 88 |      },
 89 |      "execution_count": 5,
 90 |      "metadata": {},
 91 |      "output_type": "execute_result"
 92 |     }
 93 |    ],
 94 |    "source": [
 95 |     "webst = json.load(open('cleanwebst.json'))\n",
 96 |     "webst['neuron']"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": 19,
102 |    "id": "89c0be48-e783-4c0c-abb4-8201299e09b2",
103 |    "metadata": {},
104 |    "outputs": [],
105 |    "source": [
106 |     "mpnet = SentenceTransformer('all-mpnet-base-v2') # could also use all-MiniLM-L6-v2 for lighter weight model"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": 7,
112 |    "id": "cf7273ff-4be4-4c7b-88cf-1e06ccedc726",
113 |    "metadata": {},
114 |    "outputs": [],
115 |    "source": [
116 |     "# this takes a while (about 30 minutes on my RTX 3060 TI)\n",
117 |     "webst_embs = {word: mpnet.encode(defs) for word, defs in webst.items()} "
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "code",
122 |    "execution_count": 18,
123 |    "id": "ab4ad1f6-6dc8-4734-8358-38348714d14c",
124 |    "metadata": {},
125 |    "outputs": [],
126 |    "source": [
127 |     "dataset = np.concatenate([webst_embs[w] for w in webst_embs])\n",
128 |     "dataset_words = np.array([w for w in webst_embs for e in webst_embs[w]])\n",
129 |     "assert len(dataset) == len(dataset_words)\n",
130 |     "np.save('embs.npy', dataset) # save data so we don't have to recompute when something bad happens\n",
131 |     "np.save('words.npy', dataset_words)"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": 20,
137 |    "id": "32231b91-a06f-4cdc-b257-44b3142f4b4b",
138 |    "metadata": {},
139 |    "outputs": [],
140 |    "source": [
141 |     "normalized_dataset = dataset / np.linalg.norm(dataset, axis=1)[:, np.newaxis]"
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "code",
146 |    "execution_count": 22,
147 |    "id": "eb1da317-0fe1-40a6-9cac-85808886a76b",
148 |    "metadata": {},
149 |    "outputs": [],
150 |    "source": [
151 |     "searcher = scann.scann_ops_pybind.builder(normalized_dataset, 10, \"dot_product\").tree(\n",
152 |     "    num_leaves=2000, num_leaves_to_search=100, training_sample_size=250000).score_ah(\n",
153 |     "    2, anisotropic_quantization_threshold=0.2).reorder(100).build()"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "markdown",
158 |    "id": "4889292f-49d6-41e4-89f4-e88d93e5eb01",
159 |    "metadata": {},
160 |    "source": [
161 |     "This did alot in a few cells, even if it doesn't look like much!\n",
162 |     "We loaded a pretrained neural network and encoded the whole dictionary,\n",
163 |     "which gives us around 270000 vectors to search through.\n",
164 |     "\n",
165 |     "We now have everything to implement our word finding function.\n",
166 |     "We simply encode the description (the meaning) into a vector and search for its neighbors!"
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "code",
171 |    "execution_count": 23,
172 |    "id": "701435b3-edd3-429f-8856-dd52e606dec2",
173 |    "metadata": {},
174 |    "outputs": [],
175 |    "source": [
176 |     "def find_words(description: str):\n",
177 |     "    emb = mpnet.encode(description)\n",
178 |     "    neighbors, distances = searcher.search(emb, final_num_neighbors=10)\n",
179 |     "    return set(dataset_words[neighbors])"
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "code",
184 |    "execution_count": 24,
185 |    "id": "f10e036e-cd41-4f84-a2ff-17117aa98422",
186 |    "metadata": {},
187 |    "outputs": [
188 |     {
189 |      "data": {
190 |       "text/plain": [
191 |        "{'amazeful',\n",
192 |        " 'astoundment',\n",
193 |        " 'bewildered',\n",
194 |        " 'blank',\n",
195 |        " 'confus',\n",
196 |        " 'distraught',\n",
197 |        " 'perplexly',\n",
198 |        " 'stagger',\n",
199 |        " 'stound',\n",
200 |        " 'unyielded'}"
201 |       ]
202 |      },
203 |      "execution_count": 24,
204 |      "metadata": {},
205 |      "output_type": "execute_result"
206 |     }
207 |    ],
208 |    "source": [
209 |     "find_words(\"I'm lost for words\")"
210 |    ]
211 |   },
212 |   {
213 |    "cell_type": "markdown",
214 |    "id": "bf0c7540-92f5-46e1-97ec-f729867a9202",
215 |    "metadata": {},
216 |    "source": [
217 |     "Of course what we really want is more nicely formatted list with definitions"
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "code",
222 |    "execution_count": 26,
223 |    "id": "e44afc01-0d26-4abf-b5a3-7241a5972e9a",
224 |    "metadata": {},
225 |    "outputs": [],
226 |    "source": [
227 |     "from IPython.display import HTML"
228 |    ]
229 |   },
230 |   {
231 |    "cell_type": "code",
232 |    "execution_count": 72,
233 |    "id": "ed6974a4-4661-49ff-a30a-f9cfdb387aa9",
234 |    "metadata": {},
235 |    "outputs": [],
236 |    "source": [
237 |     "def word_html(word, ndefs=5):\n",
238 |     "    defs = [f'<i style=\"font-size: small\">{d}</i>' for d in webst[word][:ndefs]]\n",
239 |     "    html = f'<li><b>{word}</b><br>{\"  //  \".join(defs)}</li>'\n",
240 |     "    return html\n",
241 |     "\n",
242 |     "def display_words(desc):\n",
243 |     "    words = find_words(desc)\n",
244 |     "    htmls = [word_html(word) for word in words]\n",
245 |     "    return HTML('<ul>' + \"\".join(htmls) + '</ul>')"
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "code",
250 |    "execution_count": 76,
251 |    "id": "f653c710-d961-42d9-93cd-040be7f56c8c",
252 |    "metadata": {},
253 |    "outputs": [
254 |     {
255 |      "data": {
256 |       "text/html": [
257 |        "<ul><li><b>confus</b><br><i style=\"font-size: small\">Confused, disturbed.</i>  //  <i style=\"font-size: small\">[F. See Confuse, adjective]</i></li><li><b>unyielded</b><br><i style=\"font-size: small\">To past particles, or to adjectives formed after the analogy of past particles, to indicate the absence of the condition or state expressed by them</i>  //  <i style=\"font-size: small\">See abased.</i>  //  <i style=\"font-size: small\">See abashed.</i>  //  <i style=\"font-size: small\">See abated.</i>  //  <i style=\"font-size: small\">See abolished.</i></li><li><b>blank</b><br><i style=\"font-size: small\">Of a white or pale color; without color.</i>  //  <i style=\"font-size: small\">Free from writing, printing, or marks; having an empty space to be filled in with some special writing; – said of checks, official documents, etc.; as, blank paper; a blank check; a blank ballot.</i>  //  <i style=\"font-size: small\">Utterly confounded or discomfited.</i>  //  <i style=\"font-size: small\">Empty; void; without result; fruitless; as, a blank space; a blank day.</i>  //  <i style=\"font-size: small\">Lacking characteristics which give variety; as, a blank desert; a blank wall; destitute of interests, affections, hopes, etc.; as, to live a blank existence; destitute of sensations; as, blank unconsciousness.</i></li><li><b>stound</b><br><i style=\"font-size: small\">To be in pain or sorrow.</i>  //  <i style=\"font-size: small\">Stunned.</i>  //  <i style=\"font-size: small\">A sudden, severe pain or grief; peril; alarm.</i>  //  <i style=\"font-size: small\">Astonishment; amazement.</i>  //  <i style=\"font-size: small\">Hour; time; season.</i></li><li><b>stagger</b><br><i style=\"font-size: small\">To move to one side and the other, as if about to fall, in standing or walking; not to stand or walk with steadiness; to sway; to reel or totter.</i>  //  <i style=\"font-size: small\">To cease to stand firm; to begin to give way; to fail.</i>  //  <i style=\"font-size: small\">To begin to doubt and waver in purpose; to become less confident or determined; to hesitate.</i>  //  <i style=\"font-size: small\">To cause to reel or totter.</i>  //  <i style=\"font-size: small\">To cause to doubt and waver; to make to hesitate; to make less steady or confident; to shock.</i></li><li><b>astoundment</b><br><i style=\"font-size: small\">Amazement.</i></li><li><b>bewildered</b><br><i style=\"font-size: small\">Greatly perplexed; as, a bewildered mind.</i></li><li><b>amazeful</b><br><i style=\"font-size: small\">Full of amazement.</i></li><li><b>perplexly</b><br><i style=\"font-size: small\">Perplexedly.</i></li><li><b>distraught</b><br><i style=\"font-size: small\">Torn asunder; separated.</i>  //  <i style=\"font-size: small\">Distracted; perplexed.</i>  //  <i style=\"font-size: small\">As if thou wert distraught and mad with terror. Shak.</i>  //  <i style=\"font-size: small\">To doubt betwixt our senses and our souls Which are the most distraught and full of pain. Mrs. Browning.</i>  //  <i style=\"font-size: small\">[OE. distract, distrauht. See Distract, adjective]</i></li></ul>"
258 |       ],
259 |       "text/plain": [
260 |        "<IPython.core.display.HTML object>"
261 |       ]
262 |      },
263 |      "execution_count": 76,
264 |      "metadata": {},
265 |      "output_type": "execute_result"
266 |     }
267 |    ],
268 |    "source": [
269 |     "query = \"I'm lost for words\"\n",
270 |     "display_words(query)"
271 |    ]
272 |   },
273 |   {
274 |    "cell_type": "markdown",
275 |    "id": "d5447062-3e92-4d0d-b9c8-eab48011dc68",
276 |    "metadata": {},
277 |    "source": [
278 |     "That's a decent result for the wee bit of code we had to write.\n",
279 |     "The quality of words isn't always perfect (false positives happen).\n",
280 |     "Some words have a lot definitions and appear too often (eg. unyielded).\n",
281 |     "We could for example think about how improve the embeddings,\n",
282 |     "or we could increase the size of our dataset, and balance the number of\n",
283 |     "definitions used for training. Then we could think about deploying it as a service to others.\n",
284 |     "\n",
285 |     "But before we do all that, let's gather some real world experience on how\n",
286 |     "useful our model is in practice and get some writing done. Have fun!"
287 |    ]
288 |   }
289 |  ],
290 |  "metadata": {
291 |   "kernelspec": {
292 |    "display_name": "Python 3 (ipykernel)",
293 |    "language": "python",
294 |    "name": "python3"
295 |   },
296 |   "language_info": {
297 |    "codemirror_mode": {
298 |     "name": "ipython",
299 |     "version": 3
300 |    },
301 |    "file_extension": ".py",
302 |    "mimetype": "text/x-python",
303 |    "name": "python",
304 |    "nbconvert_exporter": "python",
305 |    "pygments_lexer": "ipython3",
306 |    "version": "3.10.4"
307 |   }
308 |  },
309 |  "nbformat": 4,
310 |  "nbformat_minor": 5
311 | }
312 | 


--------------------------------------------------------------------------------