├── LICENSE
├── Readme.md
├── cleanwebst.py
└── reverse_dictionary.ipynb
/LICENSE:
--------------------------------------------------------------------------------
1 | BSD 3-Clause License
2 |
3 | Copyright (c) 2022, Lorenz Köhl
4 | All rights reserved.
5 |
6 | Redistribution and use in source and binary forms, with or without
7 | modification, are permitted provided that the following conditions are met:
8 |
9 | 1. Redistributions of source code must retain the above copyright notice, this
10 | list of conditions and the following disclaimer.
11 |
12 | 2. Redistributions in binary form must reproduce the above copyright notice,
13 | this list of conditions and the following disclaimer in the documentation
14 | and/or other materials provided with the distribution.
15 |
16 | 3. Neither the name of the copyright holder nor the names of its
17 | contributors may be used to endorse or promote products derived from
18 | this software without specific prior written permission.
19 |
20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 |
--------------------------------------------------------------------------------
/Readme.md:
--------------------------------------------------------------------------------
1 | # Reverse Dictionary
2 |
3 | Writing well is laborsome. A good dictionary helps but it's only usable in one direction.
4 | You have to think of a word, look it up, chase references and so on. Even more work!
5 |
6 | It may be helpful to look up words by meaning, by what we as writers want to express.
7 | For example, if we had a function `find_words("I'm lost for words")`
8 | and it would present us with a choice of words:
9 |
10 | ```
11 | astoundment, bewildered, blank, confus, distraught, perplexly, stagger, stound, unyielded
12 | ```
13 |
14 | then we may find the right word we want, without all the gyrations of traditional dictionary use.
15 |
16 | Can we implement this function? The answer is yes and it's not hard (*if you know your way around python*)!
17 |
18 | see `reverse_dictionary.ipynb`
19 |
--------------------------------------------------------------------------------
/cleanwebst.py:
--------------------------------------------------------------------------------
1 | import json, sys
2 | from html5_parser import parse
3 |
4 | # given a parent node p and a class name c, provide a list of texts from div.c children
5 | classdivs = lambda p, c: [''.join(e.itertext()) for e in p.xpath(f'//div[@class="{c}"]')]
6 |
7 | # get all text in a node as a flat list
8 | nodetexts = lambda n, cs: [text.strip() for elem in (classdivs(n, c) for c in cs) for text in elem]
9 |
10 | # get a list of definitions text for relevant dictionary entries
11 | divdefs = lambda n: nodetexts(n, ('def', 'q', 'ety', 'cs', 'note'))
12 |
13 |
14 | if __name__ == '__main__':
15 | webst = json.load(sys.stdin)
16 |
17 | webst_clean = {entry.lower(): defs
18 | for entry in webst
19 | if (defs := divdefs(parse(webst[entry]))) # skip empty defs
20 | }
21 |
22 | json.dump(webst_clean, sys.stdout, sort_keys=True, indent=4)
23 |
--------------------------------------------------------------------------------
/reverse_dictionary.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "5a1ba44d-e15c-4645-9cc2-f06f343e2986",
6 | "metadata": {},
7 | "source": [
8 | "
Lorenz Köhl\n",
9 | "
\n",
10 | "September 2022
\n",
11 | "\n",
12 | "# Reverse Dictionary\n",
13 | "\n",
14 | "Writing well is laborsome. A good dictionary helps but it's only usable in one direction.\n",
15 | "You have to think of a word, look it up, chase references and so on. Even more work!\n",
16 | "\n",
17 | "It may be helpful to look up words by meaning, by what we as writers want to express.\n",
18 | "For example, if we had a function `find_words(\"I'm lost for words\")`\n",
19 | "and it would present us with a choice of words:\n",
20 | "\n",
21 | "```\n",
22 | "astoundment, bewildered, blank, confus, distraught, perplexly, stagger, stound, unyielded\n",
23 | "```\n",
24 | "\n",
25 | "then we may find the right word we want, without all the gyrations of traditional dictionary use.\n",
26 | "\n",
27 | "Can we implement this function? The answer is yes and it's not hard (*if you know your way around python*)!\n",
28 | "\n",
29 | "*Dependencies for execution:*\n",
30 | "\n",
31 | "- an environment with the following and a computer with enough resources (ie. nvidia gpu and lots of RAM)\n",
32 | "- pytorch
`conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge`\n",
33 | "- [sentence-transformers](https://github.com/UKPLab/sentence-transformers):
`conda install -c conda-forge sentence-transformers`\n",
34 | "- [ScaNN](https://github.com/google-research/google-research/tree/master/scann):
`pip install scann`\n",
35 | "\n",
36 | "You'll also need a cleaned up version of the webster1913 dictionary\n",
37 | "[json file](https://www.dropbox.com/s/w62l6pdfl8dtw2z/webst.json?dl=0). \n",
38 | "Please find a cleaning script in the [repo](https://github.com/mye/simple-vector-search) which depends on\n",
39 | "[html5-parser](https://html5-parser.readthedocs.io/en/latest/):\n",
40 | "
\n",
41 | "`pip install --no-binary lxml html5-parser`\n",
42 | "\n",
43 | "`python cleanwebst.py cleanwebst.json`"
44 | ]
45 | },
46 | {
47 | "cell_type": "code",
48 | "execution_count": null,
49 | "id": "789b6932-1f7f-4522-8b7d-38a50b51dd4c",
50 | "metadata": {},
51 | "outputs": [],
52 | "source": [
53 | "import torch, scann, numpy as np\n",
54 | "from sentence_transformers import SentenceTransformer\n",
55 | "import json"
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": 13,
61 | "id": "6a0d8746-6349-42fb-b441-8af257724dfa",
62 | "metadata": {},
63 | "outputs": [],
64 | "source": [
65 | "assert torch.cuda.is_available()"
66 | ]
67 | },
68 | {
69 | "cell_type": "markdown",
70 | "id": "992f11f8-bff9-42a0-bb1e-4d2c3ecb18d2",
71 | "metadata": {},
72 | "source": [
73 | "We start of by loading the dictionary, embedding definitions into vectors (sentence embeddings) and indexing those vectors for approximate nearest neighbor search"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": 5,
79 | "id": "6b756745-62aa-45d1-891a-545b90def45b",
80 | "metadata": {},
81 | "outputs": [
82 | {
83 | "data": {
84 | "text/plain": [
85 | "['The brain and spinal cord; the cerebro-spinal axis; myelencephalon.',\n",
86 | " '[NL., from Gr. νεῦρον nerve.]']"
87 | ]
88 | },
89 | "execution_count": 5,
90 | "metadata": {},
91 | "output_type": "execute_result"
92 | }
93 | ],
94 | "source": [
95 | "webst = json.load(open('cleanwebst.json'))\n",
96 | "webst['neuron']"
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": 19,
102 | "id": "89c0be48-e783-4c0c-abb4-8201299e09b2",
103 | "metadata": {},
104 | "outputs": [],
105 | "source": [
106 | "mpnet = SentenceTransformer('all-mpnet-base-v2') # could also use all-MiniLM-L6-v2 for lighter weight model"
107 | ]
108 | },
109 | {
110 | "cell_type": "code",
111 | "execution_count": 7,
112 | "id": "cf7273ff-4be4-4c7b-88cf-1e06ccedc726",
113 | "metadata": {},
114 | "outputs": [],
115 | "source": [
116 | "# this takes a while (about 30 minutes on my RTX 3060 TI)\n",
117 | "webst_embs = {word: mpnet.encode(defs) for word, defs in webst.items()} "
118 | ]
119 | },
120 | {
121 | "cell_type": "code",
122 | "execution_count": 18,
123 | "id": "ab4ad1f6-6dc8-4734-8358-38348714d14c",
124 | "metadata": {},
125 | "outputs": [],
126 | "source": [
127 | "dataset = np.concatenate([webst_embs[w] for w in webst_embs])\n",
128 | "dataset_words = np.array([w for w in webst_embs for e in webst_embs[w]])\n",
129 | "assert len(dataset) == len(dataset_words)\n",
130 | "np.save('embs.npy', dataset) # save data so we don't have to recompute when something bad happens\n",
131 | "np.save('words.npy', dataset_words)"
132 | ]
133 | },
134 | {
135 | "cell_type": "code",
136 | "execution_count": 20,
137 | "id": "32231b91-a06f-4cdc-b257-44b3142f4b4b",
138 | "metadata": {},
139 | "outputs": [],
140 | "source": [
141 | "normalized_dataset = dataset / np.linalg.norm(dataset, axis=1)[:, np.newaxis]"
142 | ]
143 | },
144 | {
145 | "cell_type": "code",
146 | "execution_count": 22,
147 | "id": "eb1da317-0fe1-40a6-9cac-85808886a76b",
148 | "metadata": {},
149 | "outputs": [],
150 | "source": [
151 | "searcher = scann.scann_ops_pybind.builder(normalized_dataset, 10, \"dot_product\").tree(\n",
152 | " num_leaves=2000, num_leaves_to_search=100, training_sample_size=250000).score_ah(\n",
153 | " 2, anisotropic_quantization_threshold=0.2).reorder(100).build()"
154 | ]
155 | },
156 | {
157 | "cell_type": "markdown",
158 | "id": "4889292f-49d6-41e4-89f4-e88d93e5eb01",
159 | "metadata": {},
160 | "source": [
161 | "This did alot in a few cells, even if it doesn't look like much!\n",
162 | "We loaded a pretrained neural network and encoded the whole dictionary,\n",
163 | "which gives us around 270000 vectors to search through.\n",
164 | "\n",
165 | "We now have everything to implement our word finding function.\n",
166 | "We simply encode the description (the meaning) into a vector and search for its neighbors!"
167 | ]
168 | },
169 | {
170 | "cell_type": "code",
171 | "execution_count": 23,
172 | "id": "701435b3-edd3-429f-8856-dd52e606dec2",
173 | "metadata": {},
174 | "outputs": [],
175 | "source": [
176 | "def find_words(description: str):\n",
177 | " emb = mpnet.encode(description)\n",
178 | " neighbors, distances = searcher.search(emb, final_num_neighbors=10)\n",
179 | " return set(dataset_words[neighbors])"
180 | ]
181 | },
182 | {
183 | "cell_type": "code",
184 | "execution_count": 24,
185 | "id": "f10e036e-cd41-4f84-a2ff-17117aa98422",
186 | "metadata": {},
187 | "outputs": [
188 | {
189 | "data": {
190 | "text/plain": [
191 | "{'amazeful',\n",
192 | " 'astoundment',\n",
193 | " 'bewildered',\n",
194 | " 'blank',\n",
195 | " 'confus',\n",
196 | " 'distraught',\n",
197 | " 'perplexly',\n",
198 | " 'stagger',\n",
199 | " 'stound',\n",
200 | " 'unyielded'}"
201 | ]
202 | },
203 | "execution_count": 24,
204 | "metadata": {},
205 | "output_type": "execute_result"
206 | }
207 | ],
208 | "source": [
209 | "find_words(\"I'm lost for words\")"
210 | ]
211 | },
212 | {
213 | "cell_type": "markdown",
214 | "id": "bf0c7540-92f5-46e1-97ec-f729867a9202",
215 | "metadata": {},
216 | "source": [
217 | "Of course what we really want is more nicely formatted list with definitions"
218 | ]
219 | },
220 | {
221 | "cell_type": "code",
222 | "execution_count": 26,
223 | "id": "e44afc01-0d26-4abf-b5a3-7241a5972e9a",
224 | "metadata": {},
225 | "outputs": [],
226 | "source": [
227 | "from IPython.display import HTML"
228 | ]
229 | },
230 | {
231 | "cell_type": "code",
232 | "execution_count": 72,
233 | "id": "ed6974a4-4661-49ff-a30a-f9cfdb387aa9",
234 | "metadata": {},
235 | "outputs": [],
236 | "source": [
237 | "def word_html(word, ndefs=5):\n",
238 | " defs = [f'{d}' for d in webst[word][:ndefs]]\n",
239 | " html = f'{word}
{\" // \".join(defs)}'\n",
240 | " return html\n",
241 | "\n",
242 | "def display_words(desc):\n",
243 | " words = find_words(desc)\n",
244 | " htmls = [word_html(word) for word in words]\n",
245 | " return HTML('')"
246 | ]
247 | },
248 | {
249 | "cell_type": "code",
250 | "execution_count": 76,
251 | "id": "f653c710-d961-42d9-93cd-040be7f56c8c",
252 | "metadata": {},
253 | "outputs": [
254 | {
255 | "data": {
256 | "text/html": [
257 | "- confus
Confused, disturbed. // [F. See Confuse, adjective] - unyielded
To past particles, or to adjectives formed after the analogy of past particles, to indicate the absence of the condition or state expressed by them // See abased. // See abashed. // See abated. // See abolished. - blank
Of a white or pale color; without color. // Free from writing, printing, or marks; having an empty space to be filled in with some special writing; – said of checks, official documents, etc.; as, blank paper; a blank check; a blank ballot. // Utterly confounded or discomfited. // Empty; void; without result; fruitless; as, a blank space; a blank day. // Lacking characteristics which give variety; as, a blank desert; a blank wall; destitute of interests, affections, hopes, etc.; as, to live a blank existence; destitute of sensations; as, blank unconsciousness. - stound
To be in pain or sorrow. // Stunned. // A sudden, severe pain or grief; peril; alarm. // Astonishment; amazement. // Hour; time; season. - stagger
To move to one side and the other, as if about to fall, in standing or walking; not to stand or walk with steadiness; to sway; to reel or totter. // To cease to stand firm; to begin to give way; to fail. // To begin to doubt and waver in purpose; to become less confident or determined; to hesitate. // To cause to reel or totter. // To cause to doubt and waver; to make to hesitate; to make less steady or confident; to shock. - astoundment
Amazement. - bewildered
Greatly perplexed; as, a bewildered mind. - amazeful
Full of amazement. - perplexly
Perplexedly. - distraught
Torn asunder; separated. // Distracted; perplexed. // As if thou wert distraught and mad with terror. Shak. // To doubt betwixt our senses and our souls Which are the most distraught and full of pain. Mrs. Browning. // [OE. distract, distrauht. See Distract, adjective]
"
258 | ],
259 | "text/plain": [
260 | ""
261 | ]
262 | },
263 | "execution_count": 76,
264 | "metadata": {},
265 | "output_type": "execute_result"
266 | }
267 | ],
268 | "source": [
269 | "query = \"I'm lost for words\"\n",
270 | "display_words(query)"
271 | ]
272 | },
273 | {
274 | "cell_type": "markdown",
275 | "id": "d5447062-3e92-4d0d-b9c8-eab48011dc68",
276 | "metadata": {},
277 | "source": [
278 | "That's a decent result for the wee bit of code we had to write.\n",
279 | "The quality of words isn't always perfect (false positives happen).\n",
280 | "Some words have a lot definitions and appear too often (eg. unyielded).\n",
281 | "We could for example think about how improve the embeddings,\n",
282 | "or we could increase the size of our dataset, and balance the number of\n",
283 | "definitions used for training. Then we could think about deploying it as a service to others.\n",
284 | "\n",
285 | "But before we do all that, let's gather some real world experience on how\n",
286 | "useful our model is in practice and get some writing done. Have fun!"
287 | ]
288 | }
289 | ],
290 | "metadata": {
291 | "kernelspec": {
292 | "display_name": "Python 3 (ipykernel)",
293 | "language": "python",
294 | "name": "python3"
295 | },
296 | "language_info": {
297 | "codemirror_mode": {
298 | "name": "ipython",
299 | "version": 3
300 | },
301 | "file_extension": ".py",
302 | "mimetype": "text/x-python",
303 | "name": "python",
304 | "nbconvert_exporter": "python",
305 | "pygments_lexer": "ipython3",
306 | "version": "3.10.4"
307 | }
308 | },
309 | "nbformat": 4,
310 | "nbformat_minor": 5
311 | }
312 |
--------------------------------------------------------------------------------