├── JupyterNotebooks
├── 1-content_extraction.ipynb
├── 2-content_indexing.ipynb
├── 3-azure_search_query.ipynb
├── AugmentingSearch_CreatingASynonymMap.ipynb
├── AugmentingSearch_UploadingSynonymMapToAzureSearch.ipynb
├── SmartStoplist.txt
├── SmartStoplist_extended.txt
├── rake.py
└── sample_page.png
├── LICENSE
├── Python
├── azsearch_mgmt.py
├── azsearch_query.py
├── azsearch_queryall.py
└── keyphrase_extract.py
├── README.md
└── sample
├── html
├── 1.1.1.1.1.1.html
├── 1.1.1.1.1.2.html
├── 1.1.1.1.1.3.html
├── 1.1.1.1.2.1.html
├── 1.1.1.1.4.1.1.html
├── 1.1.1.1.4.1.2.html
├── 1.1.1.1.4.1.3.html
├── 1.1.1.1.4.1.4.html
├── 1.1.1.1.4.1.5.html
├── 1.1.1.1.4.1.6.html
├── 1.1.1.1.4.1.7.html
├── 1.1.1.1.4.1.8.html
├── 1.1.1.11.2.1.2.html
└── styles
│ ├── css_Q4z0-iME7xTpui0Tzf4MEFv02rRuJ1dHZbo9kP_JLBg.css
│ ├── css_XgGKW_fNRFCK5BruHWlbChY4U8WE0xT4CWGilKSjSXA.css
│ ├── css_dolo-SIAwemLdrlTs99Lrug9kFXMYlMG3OlznBv4Kho.css
│ ├── css_kShW4RPmRstZ3SpIC-ZvVGNFVAi0WEMuCnI0ZkYIaFw.css
│ ├── css_rJ3pqftttKVzxtjsOG18hAid4RqqjfFMw3d1C89lWd4.css
│ └── css_tuqeOBz1ozigHOvScJR2wasCmXBizZ9rfd58u6_20EE.css
├── parsed_content.xlsx
├── parsed_content_sample.xlsx
├── raw_text_enriched_with_keywords_sample.xlsx
├── sample_page.png
├── sample_queries.txt
├── sample_queries.xlsx
├── sample_query_answers.txt
└── sample_query_answers.xlsx
/JupyterNotebooks/2-content_indexing.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## Building a Custom Search Engine\n",
8 | "### Step 2 - Create Azure Search Index\n",
9 | "- Define new index structure\n",
10 | "- Create Azure Search index\n",
11 | "- Upload and index parsed content from step 1\n",
12 | "- Optional: Simple management of Azure Search index\n",
13 | "\n",
14 | "Dependencies: Please install pyexcel, pyexcel-xls and pyexcel-xlsx
\n",
15 | "To install dependencies: pip install pyexcel pyexcel-xls pyexcel-xlsx"
16 | ]
17 | },
18 | {
19 | "cell_type": "code",
20 | "execution_count": 1,
21 | "metadata": {
22 | "collapsed": true
23 | },
24 | "outputs": [],
25 | "source": [
26 | "# Import base packages\n",
27 | "import requests\n",
28 | "import json\n",
29 | "import csv\n",
30 | "import os\n",
31 | "import pyexcel as pe"
32 | ]
33 | },
34 | {
35 | "cell_type": "markdown",
36 | "metadata": {},
37 | "source": [
38 | "First, initialize Azure Search configuration parameters to be used for index creation"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": 2,
44 | "metadata": {
45 | "collapsed": true
46 | },
47 | "outputs": [],
48 | "source": [
49 | "# This is the service you've already created in Azure Portal\n",
50 | "serviceName = 'your_azure_search_service_name'\n",
51 | "\n",
52 | "# Index to be created\n",
53 | "indexName = 'name_of_index_to_create'\n",
54 | "\n",
55 | "# Set your service API key, either via an environment variable or enter it below\n",
56 | "#apiKey = os.getenv('SEARCH_KEY_DEV', '')\n",
57 | "apiKey = 'your_azure_search_service_api_key'\n",
58 | "apiVersion = '2016-09-01'"
59 | ]
60 | },
61 | {
62 | "cell_type": "markdown",
63 | "metadata": {},
64 | "source": [
65 | "Set the path to the parsed content file from step 1, and define a basic mapping of the input fields to the desired target field names in the new index. Input and output field names do not need to be the same. However, the target names should match the index definition in getIndexDefinition()."
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": null,
71 | "metadata": {
72 | "collapsed": true
73 | },
74 | "outputs": [],
75 | "source": [
76 | "# Input parsed content Excel file, e.g., output of step #1 in\n",
77 | "# https://github.com/CatalystCode/CustomSearch/tree/master/JupyterNotebooks/1-content_extraction.ipynb\n",
78 | "inputfile = os.path.join(os.getcwd(), '../sample/parsed_content.xlsx')\n",
79 | "\n",
80 | "# Define fields mapping from Excel file column names to search index field names (except Index)\n",
81 | "# Change this mapping to match your content fields and rename output fields as desired\n",
82 | "# Search field names should match their definition in getIndexDefinition()\n",
83 | "fields_map = [ ('File' , 'File'),\n",
84 | " ('ChapterTitle' , 'ChapterTitle'),\n",
85 | " ('SectionTitle' , 'SectionTitle'),\n",
86 | " ('SubsectionTitle' , 'SubsectionTitle'),\n",
87 | " ('SubsectionText' , 'SubsectionText'),\n",
88 | " ('Keywords' , 'Keywords') ]"
89 | ]
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "metadata": {},
94 | "source": [
95 | "Now, let's define the structure of the new index to be created. In this example, all titles, content text and keywords fields are full-text searchable. Queries will use all searchable fields by default to retrieve a ranked list of results.\n",
96 | "\n",
97 | "For more details, refer to [Create an Azure Search Index](https://docs.microsoft.com/en-us/azure/search/search-what-is-an-index)."
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": null,
103 | "metadata": {
104 | "collapsed": true
105 | },
106 | "outputs": [],
107 | "source": [
108 | "# Fields: Index\tFile\tChapterTitle\tSectionTitle\tSubsectionTitle\t\tSubsectionText\tKeywords\n",
109 | "def getIndexDefinition():\n",
110 | " return {\n",
111 | " \"name\": indexName, \n",
112 | " \"fields\": [\n",
113 | " {\"name\": \"Index\", \"type\": \"Edm.String\", \"key\": True, \"retrievable\": True, \"searchable\": False, \"filterable\": False, \"sortable\": True, \"facetable\": False},\n",
114 | "\n",
115 | " {\"name\": \"File\", \"type\": \"Edm.String\", \"retrievable\": True, \"searchable\": False, \"filterable\": True, \"sortable\": True, \"facetable\": False},\n",
116 | "\n",
117 | " {\"name\": \"ChapterTitle\", \"type\": \"Edm.String\", \"retrievable\": True, \"searchable\": True, \"filterable\": True, \"sortable\": True, \"facetable\": True},\n",
118 | "\n",
119 | " {\"name\": \"SectionTitle\", \"type\": \"Edm.String\", \"retrievable\": True, \"searchable\": True, \"filterable\": True, \"sortable\": False, \"facetable\": True},\n",
120 | "\n",
121 | " {\"name\": \"SubsectionTitle\", \"type\": \"Edm.String\", \"retrievable\": True, \"searchable\": True, \"filterable\": True, \"sortable\": True, \"facetable\": False},\n",
122 | "\n",
123 | " {\"name\": \"SubsectionText\", \"type\": \"Edm.String\", \"retrievable\": True, \"searchable\": True, \"filterable\": False, \"sortable\": False, \"facetable\": False, \"analyzer\": \"en.microsoft\"},\n",
124 | "\n",
125 | " {\"name\": \"Keywords\", \"type\": \"Edm.String\", \"retrievable\": True, \"searchable\": True, \"filterable\": False, \"sortable\": False, \"facetable\": False, \"analyzer\": \"en.microsoft\"}\n",
126 | " ]\n",
127 | " }"
128 | ]
129 | },
130 | {
131 | "cell_type": "markdown",
132 | "metadata": {},
133 | "source": [
134 | "#### Helper functions for basic REST API operations"
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": null,
140 | "metadata": {
141 | "collapsed": true
142 | },
143 | "outputs": [],
144 | "source": [
145 | "def getServiceUrl():\n",
146 | " return 'https://' + serviceName + '.search.windows.net'\n",
147 | "\n",
148 | "def getMethod(servicePath):\n",
149 | " headers = {'Content-type': 'application/json', 'api-key': apiKey}\n",
150 | " r = requests.get(getServiceUrl() + servicePath, headers=headers)\n",
151 | " #print(r.text)\n",
152 | " return r\n",
153 | "\n",
154 | "def postMethod(servicePath, body):\n",
155 | " headers = {'Content-type': 'application/json', 'api-key': apiKey}\n",
156 | " r = requests.post(getServiceUrl() + servicePath, headers=headers, data=body)\n",
157 | " #print(r, r.text)\n",
158 | " return r"
159 | ]
160 | },
161 | {
162 | "cell_type": "markdown",
163 | "metadata": {},
164 | "source": [
165 | "#### Simple index management functions\n",
166 | "- Create a new index\n",
167 | "- Delete an existing index\n",
168 | "- Check if index exists"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": null,
174 | "metadata": {
175 | "collapsed": true
176 | },
177 | "outputs": [],
178 | "source": [
179 | "def createIndex():\n",
180 | " indexDefinition = json.dumps(getIndexDefinition()) \n",
181 | " servicePath = '/indexes/?api-version=%s' % apiVersion\n",
182 | " r = postMethod(servicePath, indexDefinition)\n",
183 | " #print r.text\n",
184 | " if r.status_code == 201:\n",
185 | " print('Index %s created' % indexName) \n",
186 | " else:\n",
187 | " print('Failed to create index %s' % indexName)\n",
188 | " exit(1)\n",
189 | "\n",
190 | "def deleteIndex():\n",
191 | " servicePath = '/indexes/%s?api-version=%s&delete' % (indexName, apiVersion)\n",
192 | " headers = {'Content-type': 'application/json', 'api-key': apiKey}\n",
193 | " r = requests.delete(getServiceUrl() + servicePath, headers=headers)\n",
194 | " #print(r.text)\n",
195 | "\n",
196 | "def getIndex():\n",
197 | " servicePath = '/indexes/%s?api-version=%s' % (indexName, apiVersion)\n",
198 | " r = getMethod(servicePath)\n",
199 | " if r.status_code == 200: \n",
200 | " return True\n",
201 | " else:\n",
202 | " return False"
203 | ]
204 | },
205 | {
206 | "cell_type": "markdown",
207 | "metadata": {},
208 | "source": [
209 | "#### Helper functions to fetch one or more documents from the parsed content file\n",
210 | "\n",
211 | "Note: In this exercise, a *document* corresponds to one row from the parsed content Excel file."
212 | ]
213 | },
214 | {
215 | "cell_type": "code",
216 | "execution_count": null,
217 | "metadata": {
218 | "collapsed": true
219 | },
220 | "outputs": [],
221 | "source": [
222 | "def getDocumentObject(): \n",
223 | " valarry = []\n",
224 | " cnt = 1\n",
225 | " records = pe.iget_records(file_name=inputfile)\n",
226 | " for row in records:\n",
227 | " outdict = {}\n",
228 | " outdict['@search.action'] = 'upload'\n",
229 | "\n",
230 | " if (row[fields_map[0][0]]):\n",
231 | " outdict['Index'] = str(row['Index'])\n",
232 | " for (in_fld, out_fld) in fields_map:\n",
233 | " outdict[out_fld] = row[in_fld]\n",
234 | " valarry.append(outdict)\n",
235 | " cnt+=1\n",
236 | "\n",
237 | " return {'value' : valarry}\n",
238 | "\n",
239 | "def getDocumentObjectByChunk(start, end): \n",
240 | " valarry = []\n",
241 | " cnt = 1\n",
242 | " records = pe.iget_records(file_name=inputfile)\n",
243 | " for i, row in enumerate(records):\n",
244 | " if start <= i < end:\n",
245 | " outdict = {}\n",
246 | " outdict['@search.action'] = 'upload'\n",
247 | "\n",
248 | " if (row[fields_map[0][0]]):\n",
249 | " outdict['Index'] = str(row['Index'])\n",
250 | " for (in_fld, out_fld) in fields_map:\n",
251 | " outdict[out_fld] = row[in_fld]\n",
252 | " valarry.append(outdict)\n",
253 | " cnt+=1\n",
254 | "\n",
255 | " return {'value' : valarry}"
256 | ]
257 | },
258 | {
259 | "cell_type": "markdown",
260 | "metadata": {},
261 | "source": [
262 | "#### Main functions to upload and index documents in Azure Search\n",
263 | "\n",
264 | "Three methods are provided:\n",
265 | "- Upload all documents (rows) at once\n",
266 | "- Upload documents in chunks\n",
267 | "- Upload one document at a time\n",
268 | "\n",
269 | "**Note:** The method choice depends on the content size and whether it would fit in one or more REST request. "
270 | ]
271 | },
272 | {
273 | "cell_type": "code",
274 | "execution_count": null,
275 | "metadata": {
276 | "collapsed": true
277 | },
278 | "outputs": [],
279 | "source": [
280 | "# Upload content for indexing in one request if content is not too large\n",
281 | "def uploadDocuments():\n",
282 | " documents = json.dumps(getDocumentObject())\n",
283 | " servicePath = '/indexes/' + indexName + '/docs/index?api-version=' + apiVersion\n",
284 | " r = postMethod(servicePath, documents)\n",
285 | " if r.status_code == 200:\n",
286 | " print('Success: %s' % r) \n",
287 | " else:\n",
288 | " print('Failure: %s' % r.text)\n",
289 | " exit(1)\n",
290 | "\n",
291 | "# Upload content for indexing in chunks if content is too large for one request\n",
292 | "def uploadDocumentsInChunks(chunksize):\n",
293 | " records = pe.iget_records(file_name=inputfile)\n",
294 | " cnt = 0\n",
295 | " for row in records:\n",
296 | " cnt += 1\n",
297 | "\n",
298 | " for chunk in range(int(cnt/chunksize) + 1):\n",
299 | " print('Processing chunk number %d ...' % chunk)\n",
300 | " start = chunk * chunksize\n",
301 | " end = start + chunksize\n",
302 | " documents = json.dumps(getDocumentObjectByChunk(start, end))\n",
303 | " servicePath = '/indexes/' + indexName + '/docs/index?api-version=' + apiVersion\n",
304 | " r = postMethod(servicePath, documents)\n",
305 | " if r.status_code == 200:\n",
306 | " print('Success: %s' % r) \n",
307 | " else:\n",
308 | " print('Failure: %s' % r.text)\n",
309 | " return\n",
310 | "\n",
311 | "# Upload content for indexing one document at a time\n",
312 | "def uploadDocumentsOneByOne():\n",
313 | " records = pe.iget_records(file_name=inputfile)\n",
314 | " valarry = []\n",
315 | " for i, row in enumerate(records):\n",
316 | " outdict = {}\n",
317 | " outdict['@search.action'] = 'upload'\n",
318 | "\n",
319 | " if (row[fields_map[0][0]]):\n",
320 | " outdict['Index'] = str(row['Index'])\n",
321 | " for (in_fld, out_fld) in fields_map:\n",
322 | " outdict[out_fld] = row[in_fld]\n",
323 | " valarry.append(outdict)\n",
324 | "\n",
325 | " documents = json.dumps({'value' : valarry})\n",
326 | " servicePath = '/indexes/' + indexName + '/docs/index?api-version=' + apiVersion\n",
327 | " r = postMethod(servicePath, documents)\n",
328 | " if r.status_code == 200:\n",
329 | " print('%d Success: %s' % (i,r)) \n",
330 | " else:\n",
331 | " print('%d Failure: %s' % (i, r.text))\n",
332 | " exit(1)"
333 | ]
334 | },
335 | {
336 | "cell_type": "markdown",
337 | "metadata": {},
338 | "source": [
339 | "#### Helper functions to check and query an index"
340 | ]
341 | },
342 | {
343 | "cell_type": "code",
344 | "execution_count": null,
345 | "metadata": {
346 | "collapsed": true
347 | },
348 | "outputs": [],
349 | "source": [
350 | "def printDocumentCount():\n",
351 | " servicePath = '/indexes/' + indexName + '/docs/$count?api-version=' + apiVersion \n",
352 | " getMethod(servicePath)\n",
353 | "\n",
354 | "def sampleQuery(query, ntop=3):\n",
355 | " servicePath = '/indexes/' + indexName + '/docs?api-version=%s&search=%s&$top=%d' % \\\n",
356 | " (apiVersion, query, ntop)\n",
357 | " getMethod(servicePath)"
358 | ]
359 | },
360 | {
361 | "cell_type": "markdown",
362 | "metadata": {},
363 | "source": [
364 | "### Create index and upload all parsed content\n",
365 | "\n",
366 | "Now let's create the index, or delete and re-create the index if it exists, then upload all parsed documents in chunks. The small sample can be uploaded all at once, but the full tax code content would require multiple requests."
367 | ]
368 | },
369 | {
370 | "cell_type": "code",
371 | "execution_count": null,
372 | "metadata": {
373 | "collapsed": true
374 | },
375 | "outputs": [],
376 | "source": [
377 | "# Choose upload method to be used. Options: 'all', chunks' or 'one'\n",
378 | "upload_method = 'chunks'\n",
379 | "upload_chunk_size = 50\n",
380 | "\n",
381 | "# Python 2.x/3.x incompatibility of input() and raw_input()\n",
382 | "# Bind input() to raw_input() in Python 2.x, leave as-is in Python 3.x\n",
383 | "try:\n",
384 | " input = raw_input\n",
385 | "except NameError:\n",
386 | " pass\n",
387 | "\n",
388 | "# Create index if it does not exist\n",
389 | "if not getIndex():\n",
390 | " createIndex() \n",
391 | "else:\n",
392 | " ans = input('Index %s already exists ... Do you want to delete it? [Y/n]' % indexName)\n",
393 | " if ans.lower() == 'y':\n",
394 | " deleteIndex()\n",
395 | " print('Re-creating index %s ...' % indexName)\n",
396 | " createIndex()\n",
397 | " else:\n",
398 | " print('Index %s is not deleted ... New content will be added to existing index' % indexName)\n",
399 | "\n",
400 | "if upload_method == 'all':\n",
401 | " uploadDocuments()\n",
402 | "elif upload_method == 'chunks':\n",
403 | " uploadDocumentsInChunks(upload_chunk_size)\n",
404 | "else:\n",
405 | " uploadDocumentsOneByOne()\n",
406 | " \n",
407 | "# Verify and test the newly created index\n",
408 | "printDocumentCount()\n",
409 | "sampleQuery('child tax credit')"
410 | ]
411 | },
412 | {
413 | "cell_type": "markdown",
414 | "metadata": {},
415 | "source": [
416 | "#### The content is now ready for interactive or batch queries, as demonstrated in step #3."
417 | ]
418 | },
419 | {
420 | "cell_type": "code",
421 | "execution_count": null,
422 | "metadata": {
423 | "collapsed": true
424 | },
425 | "outputs": [],
426 | "source": []
427 | }
428 | ],
429 | "metadata": {
430 | "kernelspec": {
431 | "display_name": "Python 3",
432 | "language": "python",
433 | "name": "python3"
434 | },
435 | "language_info": {
436 | "codemirror_mode": {
437 | "name": "ipython",
438 | "version": 3
439 | },
440 | "file_extension": ".py",
441 | "mimetype": "text/x-python",
442 | "name": "python",
443 | "nbconvert_exporter": "python",
444 | "pygments_lexer": "ipython3",
445 | "version": "3.6.1"
446 | }
447 | },
448 | "nbformat": 4,
449 | "nbformat_minor": 2
450 | }
451 |
--------------------------------------------------------------------------------
/JupyterNotebooks/3-azure_search_query.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## Building a Custom Search Engine\n",
8 | "### Step 3 - Query the Index and Retrieve Answers\n",
9 | "- Submit a single search query\n",
10 | "- Submit multiple queries in batch\n",
11 | "\n",
12 | "**Note:** A command-line script version is included under the Python folder of this project.\n",
13 | "- For interactive queries: azsearch_query.py\n",
14 | "- For batch queries in a file: azsearch_queryall.py"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 6,
20 | "metadata": {
21 | "collapsed": true
22 | },
23 | "outputs": [],
24 | "source": [
25 | "import requests\n",
26 | "import json\n",
27 | "import os\n",
28 | "import csv\n",
29 | "import pyexcel as pe\n",
30 | "import codecs\n",
31 | "import pandas as pd"
32 | ]
33 | },
34 | {
35 | "cell_type": "markdown",
36 | "metadata": {},
37 | "source": [
38 | "Initialize Azure Search configuration parameters to point to the content index to be used."
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": 7,
44 | "metadata": {
45 | "collapsed": true
46 | },
47 | "outputs": [],
48 | "source": [
49 | "# This is the service you've already created in Azure Portal\n",
50 | "serviceName = 'your_azure_search_service_name'\n",
51 | "\n",
52 | "# This is the index you've already created in Azure Portal or via the azsearch_mgmt.py script\n",
53 | "indexName = 'your_index_name_to_use'\n",
54 | "\n",
55 | "# Set your service API key, either via an environment variable or enter it below\n",
56 | "#apiKey = os.getenv('SEARCH_KEY_DEV', '')\n",
57 | "apiKey = 'your_azure_search_service_api_key'\n",
58 | "apiVersion = '2016-09-01'"
59 | ]
60 | },
61 | {
62 | "cell_type": "markdown",
63 | "metadata": {},
64 | "source": [
65 | "Optional configuration parameters to alter the search query request."
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 9,
71 | "metadata": {
72 | "collapsed": true
73 | },
74 | "outputs": [],
75 | "source": [
76 | "# Retrieval options to alter the query results\n",
77 | "SEARCHFIELDS = None # use all searchable fields for retrieval\n",
78 | "#SEARCHFIELDS = 'Keywords, SubsectionText' # use selected fields only for retrieval\n",
79 | "FUZZY = False # enable fuzzy search (check API for details)\n",
80 | "NTOP = 5 # uumber of results to return"
81 | ]
82 | },
83 | {
84 | "cell_type": "markdown",
85 | "metadata": {},
86 | "source": [
87 | "#### Helper functions for basic REST API operations"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": 10,
93 | "metadata": {
94 | "collapsed": true
95 | },
96 | "outputs": [],
97 | "source": [
98 | "def getServiceUrl():\n",
99 | " return 'https://' + serviceName + '.search.windows.net'\n",
100 | "\n",
101 | "def getMethod(servicePath):\n",
102 | " headers = {'Content-type': 'application/json', 'api-key': apiKey}\n",
103 | " r = requests.get(getServiceUrl() + servicePath, headers=headers)\n",
104 | " #print(r, r.text)\n",
105 | " return r\n",
106 | "\n",
107 | "def postMethod(servicePath, body):\n",
108 | " headers = {'Content-type': 'application/json', 'api-key': apiKey}\n",
109 | " r = requests.post(getServiceUrl() + servicePath, headers=headers, data=body)\n",
110 | " #print(r, r.text)\n",
111 | " return r"
112 | ]
113 | },
114 | {
115 | "cell_type": "markdown",
116 | "metadata": {},
117 | "source": [
118 | "#### Helper functions to submit a search query interactively or in batch"
119 | ]
120 | },
121 | {
122 | "cell_type": "code",
123 | "execution_count": 17,
124 | "metadata": {
125 | "collapsed": true
126 | },
127 | "outputs": [],
128 | "source": [
129 | "def submitQuery(query, fields=None, ntop=10, fuzzy=False):\n",
130 | " servicePath = '/indexes/' + indexName + '/docs?api-version=%s&search=%s&$top=%d' % \\\n",
131 | " (apiVersion, query, ntop)\n",
132 | " if fields != None:\n",
133 | " servicePath += '&searchFields=%s' % fields\n",
134 | " if fuzzy:\n",
135 | " servicePath += '&queryType=full'\n",
136 | " \n",
137 | " # Submit GET request\n",
138 | " r = getMethod(servicePath)\n",
139 | " if r.status_code != 200:\n",
140 | " print('Failed to retrieve search results')\n",
141 | " print(r, r.text)\n",
142 | " return\n",
143 | " \n",
144 | " # Parse and report search results\n",
145 | " docs = json.loads(r.text)['value']\n",
146 | " print('Number of search results = %d\\n' % len(docs))\n",
147 | " for i, doc in enumerate(docs):\n",
148 | " print('Results# %d' % (i+1))\n",
149 | " print('Chapter title : %s' % doc['ChapterTitle'].encode('utf8'))\n",
150 | " print('Section title : %s' % doc['SectionTitle'].encode('utf8'))\n",
151 | " print('Subsection title: %s' % doc['SubsectionTitle'].encode('utf8'))\n",
152 | " print('%s\\n' % doc['SubsectionText'].encode('utf8'))\n",
153 | " \n",
154 | "def submitBatchQuery(query, fields=None, ntop=10, fuzzy=False):\n",
155 | " servicePath = '/indexes/' + indexName + '/docs?api-version=%s&search=%s&$top=%d' % \\\n",
156 | " (apiVersion, query, ntop)\n",
157 | " if fields != None:\n",
158 | " servicePath += '&searchFields=%s' % fields\n",
159 | " if fuzzy:\n",
160 | " servicePath += '&queryType=full'\n",
161 | "\n",
162 | " # Submit GET request\n",
163 | " r = getMethod(servicePath)\n",
164 | " if r.status_code != 200:\n",
165 | " print('Failed to retrieve search results')\n",
166 | " print(query, r, r.text)\n",
167 | " return {}\n",
168 | "\n",
169 | " # Return search results\n",
170 | " docs = json.loads(r.text)['value']\n",
171 | " return docs"
172 | ]
173 | },
174 | {
175 | "cell_type": "markdown",
176 | "metadata": {},
177 | "source": [
178 | "Let's submit a query/question and retrieve the answers."
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": 12,
184 | "metadata": {
185 | "scrolled": true
186 | },
187 | "outputs": [
188 | {
189 | "name": "stdout",
190 | "output_type": "stream",
191 | "text": [
192 | "Number of search results = 5\n",
193 | "\n",
194 | "Results# 1\n",
195 | "Chapter title : b'Income Taxes - NORMAL TAXES AND SURTAXES'\n",
196 | "Section title : b'Determination of Tax Liability - TAX ON INDIVIDUALS'\n",
197 | "Subsection title: b'Tax imposed - Married individuals filing separate returns'\n",
198 | "b'(d) Married individuals filing separate returns There is hereby imposed on the taxable income of every married individual (as defined in section 7703) who does not make a single return jointly with his spouse under section 6013, a tax determined in accordance with the following table: If taxable income is: The tax is: Not over $18,450 15% of taxable income. Over $18,450 but not over $44,575 $2,767.50, plus 28% of the excess over $18,450. Over $44,575 but not over $70,000 $10,082.50, plus 31% of the excess over $44,575. Over $70,000 but not over $125,000 $17,964.25, plus 36% of the excess over $70,000. Over $125,000 $37,764.25, plus 39.6% of the excess over $125,000.'\n",
199 | "\n",
200 | "Results# 2\n",
201 | "Chapter title : b'Income Taxes - NORMAL TAXES AND SURTAXES'\n",
202 | "Section title : b'Determination of Tax Liability - CREDITS AGAINST TAX - Nonrefundable Personal Credits'\n",
203 | "Subsection title: b'Credit for the elderly and the permanently and totally disabled - Definitions and special rules'\n",
204 | "b'(e) Definitions and special rules For purposes of this section (1) Married couple must file joint return Except in the case of a husband and wife who live apart at all times during the taxable year, if the taxpayer is married at the close of the taxable year, the credit provided by this section shall be allowed only if the taxpayer and his spouse file a joint return for the taxable year. (2) Marital status Marital status shall be determined under section 7703. (3) Permanent and total disability defined An individual is permanently and totally disabled if he is unable to engage in any substantial gainful activity by reason of any medically determinable physical or mental impairment which can be expected to result in death or which has lasted or can be expected to last for a continuous period of not less than 12 months. An individual shall not be considered to be permanently and totally disabled unless he furnishes proof of the existence thereof in such form and manner, and at such times, as the Secretary may require.'\n",
205 | "\n",
206 | "Results# 3\n",
207 | "Chapter title : b'Income Taxes - NORMAL TAXES AND SURTAXES'\n",
208 | "Section title : b'Determination of Tax Liability - TAX ON INDIVIDUALS'\n",
209 | "Subsection title: b'Tax imposed - Married individuals filing joint returns and surviving spouses'\n",
210 | "b'(a) Married individuals filing joint returns and surviving spouses There is hereby imposed on the taxable income of (1) every married individual (as defined in section 7703) who makes a single return jointly with his spouse under section 6013, and (2) every surviving spouse (as defined in section 2(a)), a tax determined in accordance with the following table: If taxable income is: The tax is: Not over $36,900 15% of taxable income. Over $36,900 but not over $89,150 $5,535, plus 28% of the excess over $36,900. Over $89,150 but not over $140,000 $20,165, plus 31% of the excess over $89,150. Over $140,000 but not over $250,000 $35,928.50, plus 36% of the excess over $140,000. Over $250,000 $75,528.50, plus 39.6% of the excess over $250,000.'\n",
211 | "\n",
212 | "Results# 4\n",
213 | "Chapter title : b'Income Taxes - NORMAL TAXES AND SURTAXES'\n",
214 | "Section title : b'Determination of Tax Liability - TAX ON INDIVIDUALS'\n",
215 | "Subsection title: b'Tax imposed - Phaseout of marriage penalty in 15-percent bracket; adjustments in tax tables so that inflation will not result in tax increases'\n",
216 | "b'(f) Phaseout of marriage penalty in 15-percent bracket; adjustments in tax tables so that inflation will not result in tax increases (1) In general Not later than December 15 of 1993, and each subsequent calendar year, the Secretary shall prescribe tables which shall apply in lieu of the tables contained in subsections (a), (b), (c), (d), and (e) with respect to taxable years beginning in the succeeding calendar year. (2) Method of prescribing tables The table which under paragraph (1) is to apply in lieu of the table contained in subsection (a), (b), (c), (d), or (e), as the case may be, with respect to taxable years beginning in any calendar year shall be prescribed (A) except as provided in paragraph (8), by increasing the minimum and maximum dollar amounts for each rate bracket for which a tax is imposed under such table by the cost-of-living adjustment for such calendar year, (B) by not changing the rate applicable to any rate bracket as adjusted under subparagraph (A), and (C) by adjusting the amounts setting forth the tax to the extent necessary to reflect the adjustments in the rate brackets. (3) Cost-of-living adjustment For purposes of paragraph (2), the cost-of-living adjustment for any calendar year is the percentage (if any) by which (A) the CPI for the preceding calendar year, exceeds (B) the CPI for the calendar year 1992. (4) CPI for any calendar year For purposes of paragraph (3), the CPI for any calendar year is the average of the Consumer Price Index as of the close of the 12-month period ending on August 31 of such calendar year. (5) Consumer Price Index For purposes of paragraph (4), the term Consumer Price Index means the last Consumer Price Index for all-urban consumers published by the Department of Labor. For purposes of the preceding sentence, the revision of the Consumer Price Index which is most consistent with the Consumer Price Index for calendar year 1986 shall be used. (6) Rounding (A) In general If any increase determined under paragraph (2)(A), section 63(c)(4), section 68(b)(2) or section 151(d)(4) is not a multiple of $50, such increase shall be rounded to the next lowest multiple of $50. (B) Table for married individuals filing separately In the case of a married individual filing a separate return, subparagraph (A) (other than with respect to sections 63(c)(4) and 151(d)(4)(A)) shall be applied by substituting $25 for $50 each place it appears. (7) Special rule for certain brackets In prescribing tables under paragraph (1) which apply to taxable years beginning in a calendar year after 1994, the cost-of-living adjustment used in making adjustments to the dollar amounts at which the 36 percent rate bracket begins or at which the 39.6 percent rate bracket begins shall be determined under paragraph (3) by substituting 1993 for 1992. (8) Elimination of marriage penalty in 15-percent bracket With respect to taxable years beginning after December 31, 2003 , in prescribing the tables under paragraph (1) (A) the maximum taxable income in the 15-percent rate bracket in the table contained in subsection (a) (and the minimum taxable income in the next higher taxable income bracket in such table) shall be 200 percent of the maximum taxable income in the 15-percent rate bracket in the table contained in subsection (c) (after any other adjustment under this subsection), and (B) the comparable taxable income amounts in the table contained in subsection (d) shall be \\xc2\\xbd of the amounts determined under subparagraph (A).'\n",
217 | "\n",
218 | "Results# 5\n",
219 | "Chapter title : b'Income Taxes - NORMAL TAXES AND SURTAXES'\n",
220 | "Section title : b'Determination of Tax Liability - CREDITS AGAINST TAX - Nonrefundable Personal Credits'\n",
221 | "Subsection title: b'Adoption expenses - Filing requirements'\n",
222 | "b'(f) Filing requirements (1) Married couples must file joint returns Rules similar to the rules of paragraphs (2), (3), and (4) of section 21(e) shall apply for purposes of this section. (2) Taxpayer must include TIN (A) In general No credit shall be allowed under this section with respect to any eligible child unless the taxpayer includes (if known) the name, age, and TIN of such child on the return of tax for the taxable year. (B) Other methods The Secretary may, in lieu of the information referred to in subparagraph (A), require other information meeting the purposes of subparagraph (A), including identification of an agent assisting with the adoption.'\n",
223 | "\n"
224 | ]
225 | }
226 | ],
227 | "source": [
228 | "query = 'what is the tax bracket for married couple filing separately'\n",
229 | "if query != '':\n",
230 | " # Submit query to Azure Search and retrieve results\n",
231 | " searchFields = SEARCHFIELDS\n",
232 | " submitQuery(query, fields=searchFields, ntop=NTOP)"
233 | ]
234 | },
235 | {
236 | "cell_type": "markdown",
237 | "metadata": {},
238 | "source": [
239 | "Now let's submit a set of queries in batch and retrieve all ranked lists of results. This mode would be useful for performance evaluation given a set of queries and ground truth answers."
240 | ]
241 | },
242 | {
243 | "cell_type": "code",
244 | "execution_count": 27,
245 | "metadata": {
246 | "collapsed": true
247 | },
248 | "outputs": [],
249 | "source": [
250 | "# Input file coontaining the list of queries [tab-separated .txt or .tsv, Excel .xls or .xlsx]\n",
251 | "infile = os.path.join(os.getcwd(), '../sample/sample_queries.txt')\n",
252 | "outfile = os.path.join(os.getcwd(), '../sample/sample_query_answers.xlsx')\n",
253 | "\n",
254 | "if infile.endswith('.tsv') or infile.endswith('.txt'):\n",
255 | " records = pd.read_csv(infile, sep='\\t', header=0, encoding='utf-8')\n",
256 | " rows = records.iterrows()\n",
257 | "elif infile.endswith('.xls') or infile.endswith('.xlsx'):\n",
258 | " records = pe.iget_records(file_name=infile)\n",
259 | " rows = enumerate(records)\n",
260 | "else:\n",
261 | " print('Unsupported query file extension. Options: tsv, txt, xls, xlsx')"
262 | ]
263 | },
264 | {
265 | "cell_type": "code",
266 | "execution_count": 28,
267 | "metadata": {},
268 | "outputs": [
269 | {
270 | "name": "stdout",
271 | "output_type": "stream",
272 | "text": [
273 | "QID: 1\tNumber of results: 5\n",
274 | "QID: 2\tNumber of results: 5\n",
275 | "QID: 3\tNumber of results: 5\n",
276 | "QID: 4\tNumber of results: 5\n",
277 | "QID: 5\tNumber of results: 5\n",
278 | "QID: 6\tNumber of results: 5\n",
279 | "QID: 7\tNumber of results: 5\n",
280 | "Search results saved in file sample_query_answers.xlsx\n"
281 | ]
282 | }
283 | ],
284 | "source": [
285 | "# Dataframe to keep index of crawled pages\n",
286 | "df = pd.DataFrame(columns = ['Qid', 'Query', 'Rank', 'SubsectionText', 'ChapterTitle', 'SectionTitle', 'SubsectionTitle', 'Keywords'])\n",
287 | " \n",
288 | "for i, row in rows:\n",
289 | " qid = int(row['Qid'])\n",
290 | " query = row['Query']\n",
291 | " # Submit query to Azure Search and retrieve results\n",
292 | " searchFields = SEARCHFIELDS\n",
293 | " docs = submitBatchQuery(query, fields=searchFields, ntop=NTOP, fuzzy=FUZZY)\n",
294 | " print('QID: %4d\\tNumber of results: %d' % (qid, len(docs)))\n",
295 | " for id, doc in enumerate(docs):\n",
296 | " chapter_title = doc['ChapterTitle']\n",
297 | " section_title = doc['SectionTitle']\n",
298 | " subsection_title = doc['SubsectionTitle']\n",
299 | " subsection_text = doc['SubsectionText']\n",
300 | " keywords = doc['Keywords']\n",
301 | "\n",
302 | " df = df.append({'Qid' : qid, \n",
303 | " 'Query' : query, \n",
304 | " 'Rank' : (id + 1), \n",
305 | " 'SubsectionText' : subsection_text,\n",
306 | " 'ChapterTitle' : chapter_title,\n",
307 | " 'SectionTitle' : section_title,\n",
308 | " 'SubsectionTitle' : subsection_title,\n",
309 | " 'Keywords' : keywords},\n",
310 | " ignore_index=True)\n",
311 | "\n",
312 | "# Save all answers\n",
313 | "df['Qid'] = df['Qid'].astype(int)\n",
314 | "df['Rank'] = df['Rank'].astype(int)\n",
315 | "\n",
316 | "if outfile.endswith('.xls') or outfile.endswith('.xlsx'):\n",
317 | " df.to_excel(outfile, index=False, encoding='utf-8') \n",
318 | "else: # default tab-separated file\n",
319 | " df.to_csv(outfile, sep='\\t', index=False, encoding='utf-8') \n",
320 | "print('Search results saved in file %s' % os.path.basename(outfile))"
321 | ]
322 | },
323 | {
324 | "cell_type": "code",
325 | "execution_count": null,
326 | "metadata": {
327 | "collapsed": true
328 | },
329 | "outputs": [],
330 | "source": []
331 | }
332 | ],
333 | "metadata": {
334 | "kernelspec": {
335 | "display_name": "Python 3",
336 | "language": "python",
337 | "name": "python3"
338 | },
339 | "language_info": {
340 | "codemirror_mode": {
341 | "name": "ipython",
342 | "version": 3
343 | },
344 | "file_extension": ".py",
345 | "mimetype": "text/x-python",
346 | "name": "python",
347 | "nbconvert_exporter": "python",
348 | "pygments_lexer": "ipython3",
349 | "version": "3.6.1"
350 | }
351 | },
352 | "nbformat": 4,
353 | "nbformat_minor": 2
354 | }
355 |
--------------------------------------------------------------------------------
/JupyterNotebooks/AugmentingSearch_CreatingASynonymMap.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "deletable": true,
7 | "editable": true
8 | },
9 | "source": [
10 | "# This notebook shows how you can get synonyms of key words/phrases by web-crawling Thesaurus.com and/or adding them manually. This can be used to augment a downstream search operation. #"
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": 14,
16 | "metadata": {
17 | "collapsed": true,
18 | "deletable": true,
19 | "editable": true
20 | },
21 | "outputs": [],
22 | "source": [
23 | "import pandas as pd\n",
24 | "import os\n",
25 | "import requests\n",
26 | "from bs4 import BeautifulSoup\n",
27 | "from collections import Counter"
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {
33 | "deletable": true,
34 | "editable": true
35 | },
36 | "source": [
37 | "# Part 1: Get synonyms of keywords from thesaurus.com #\n",
38 | "** Define a function to call and crawl thesearus.com. You pass in a word (which could be a phrase) and get back up to top N synonyms if they exist. You can also filter results by Part of Speech if you wish. \n",
39 | "Note: the advantage of crawling vs calling an API is that you can make unlimited free requests for getting synonyms.**"
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "execution_count": 15,
45 | "metadata": {
46 | "collapsed": false,
47 | "deletable": true,
48 | "editable": true
49 | },
50 | "outputs": [
51 | {
52 | "data": {
53 | "text/plain": [
54 | "['welcome', 'howdy', 'hi', 'greetings', 'bonjour']"
55 | ]
56 | },
57 | "execution_count": 15,
58 | "metadata": {},
59 | "output_type": "execute_result"
60 | }
61 | ],
62 | "source": [
63 | "def get_web_syns(word, pos=None, n = 5):\n",
64 | " if pos == None:\n",
65 | " req = requests.get('http://www.thesaurus.com/browse/%s' % word)\n",
66 | " else:\n",
67 | " req = requests.get('http://www.thesaurus.com/browse/%s/%s' % (word, pos))\n",
68 | "\n",
69 | " soup = BeautifulSoup(req.text, 'html.parser')\n",
70 | " \n",
71 | " all_syns = soup.find('div', {'class' : 'relevancy-list'})\n",
72 | " syns = []\n",
73 | " if all_syns == None:\n",
74 | " return syns\n",
75 | " for ul in all_syns.findAll('ul'):\n",
76 | " for li in ul.findAll('span', {'class':'text'}):\n",
77 | " syns.append(li.text.split(\",\")[0])\n",
78 | " return syns[:n]\n",
79 | "\n",
80 | "# Example\n",
81 | "get_web_syns('hello')"
82 | ]
83 | },
84 | {
85 | "cell_type": "markdown",
86 | "metadata": {
87 | "deletable": true,
88 | "editable": true
89 | },
90 | "source": [
91 | "**Read in a sample input file, e.g. excel format. Show the raw raw text and keywords extracted columns: **"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": 16,
97 | "metadata": {
98 | "collapsed": false,
99 | "deletable": true,
100 | "editable": true
101 | },
102 | "outputs": [
103 | {
104 | "name": "stdout",
105 | "output_type": "stream",
106 | "text": [
107 | " ParaText \\\n",
108 | "0 Your salary, interest you earn, dividends rece... \n",
109 | "1 You must include on your return all items of i... \n",
110 | "\n",
111 | " Keywords \n",
112 | "0 gross income, excluded, taxable income, passiv... \n",
113 | "1 income, tax law, taxable, nontaxable, items, d... \n"
114 | ]
115 | }
116 | ],
117 | "source": [
118 | "INPUT_FILE = \"raw_text_enriched_with_keywords_sample.xlsx\"\n",
119 | "df = pd.read_excel(INPUT_FILE)\n",
120 | "print(df[['ParaText','Keywords']])"
121 | ]
122 | },
123 | {
124 | "cell_type": "markdown",
125 | "metadata": {
126 | "deletable": true,
127 | "editable": true
128 | },
129 | "source": [
130 | "** We are going to extract all the keywords/phrases in the Keywords column, count frequency, and keep only keywords above a pre-defined threshold. Then, get the synonyms (if they exist) of each keyword, and save the resulting map to file: **"
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": 17,
136 | "metadata": {
137 | "collapsed": false,
138 | "deletable": true,
139 | "editable": true
140 | },
141 | "outputs": [
142 | {
143 | "name": "stdout",
144 | "output_type": "stream",
145 | "text": [
146 | "Number of keywords-synonym pairs before cleaning: 66\n",
147 | "Number of keywords-synonym pairs after cleaning: 56\n",
148 | "{' travel agency': ['holiday company', 'travel bureau'], ' friend': ['colleague', 'acquaintance', 'buddy', 'associate', 'companion'], ' discussions': ['conference', 'dialogue', 'deliberation', 'exchange', 'review'], ' many kinds': ['womankinds'], ' sale': ['purchase', 'transaction', 'deal', 'business', 'auction']}\n"
149 | ]
150 | }
151 | ],
152 | "source": [
153 | "MIN_KEYWORD_COUNT = 1\n",
154 | "keywords_list = df[\"Keywords\"].tolist()\n",
155 | "\n",
156 | "flattened_keywords_list = []\n",
157 | "for sublist in keywords_list:\n",
158 | " for val in sublist.split(\",\"):\n",
159 | " flattened_keywords_list.append(val)\n",
160 | " \n",
161 | "keywords_count = Counter(flattened_keywords_list)\n",
162 | "keywords_filtered = Counter(el for el in keywords_count.elements() if keywords_count[el] >=MIN_KEYWORD_COUNT)\n",
163 | "\n",
164 | "keyword_synonym = {keyword:get_web_syns(keyword) for keyword in keywords_filtered}\n",
165 | "#print(keyword_synonym)\n",
166 | "print(\"Number of keywords-synonym pairs before cleaning:\",len(keyword_synonym))\n",
167 | "\n",
168 | "# a helper function to identify and filter out keywords containing a digit - normally, you cannot find synonyms \n",
169 | "#for such words in thesaurus\n",
170 | "def hasNumbers(inputString):\n",
171 | " return any(char.isdigit() for char in inputString)\n",
172 | "\n",
173 | "keyword_synonym_clean = {}\n",
174 | "for k,v in keyword_synonym.items():\n",
175 | " if v!=[] and not hasNumbers(k):\n",
176 | " keyword_synonym_clean[k]=v\n",
177 | " \n",
178 | "print(\"Number of keywords-synonym pairs after cleaning:\",len(keyword_synonym_clean))\n",
179 | "# peek at a few keyword-synonyms pairs\n",
180 | "print(dict(list(keyword_synonym_clean.items())[0:5]))"
181 | ]
182 | },
183 | {
184 | "cell_type": "markdown",
185 | "metadata": {
186 | "deletable": true,
187 | "editable": true
188 | },
189 | "source": [
190 | "# Part 2: Manually adding synonym entries, typically for domain specific definitions #\n",
191 | "** Any synonym service would most like not be able to retrieve domain specific synonyms to acronym words. If you have such a domain specific acronym map, you can add it manually to your synonym map. **"
192 | ]
193 | },
194 | {
195 | "cell_type": "code",
196 | "execution_count": 18,
197 | "metadata": {
198 | "collapsed": true
199 | },
200 | "outputs": [],
201 | "source": [
202 | "# domain specific acronyms in the taxcode world\n",
203 | "acronym_dict = \"\"\"AAA, Accumulated Adjustment Account\n",
204 | "Acq., Acquiescence\n",
205 | "ACRS, Accelerated Cost Recovery System\n",
206 | "ADR, Asset Depreciation Range\n",
207 | "ADLs, Activities of Daily Living\n",
208 | "ADS, Alternative Depreciation System\n",
209 | "AFR, Applicable Federal Rate\n",
210 | "AGI, Adjusted Gross Income\n",
211 | "AIME, Average Indexed Monthly Earnings (Social Security)\n",
212 | "AMT, Alternative Minimum Tax\n",
213 | "AOD, Action on Decision\n",
214 | "ARM, Adjustable Rate Mortgage\n",
215 | "ATG, Audit Techniques Guide\n",
216 | "CB, Cumulative Bulletin\n",
217 | "CCA, Chief Council Advice\n",
218 | "CC-ITA, Chief Council - Income Tax and Accounting\n",
219 | "CCC, Commodity Credit Corporation\n",
220 | "CCP, Counter-Cyclical Program (government farm program)\n",
221 | "CDHP, Consumer-Driven Health Plan\n",
222 | "CFR, Code of Federal Regulations\n",
223 | "CLT, Charitable Lead Trust\n",
224 | "COBRA, Consolidated Omnibus Budget Reconciliations Act of 1985\n",
225 | "COGS, Cost of Goods Sold\n",
226 | "COLA, Cost of Living Adjustment\n",
227 | "CONUS, Continental United States\n",
228 | "CPI, Consurmer Price Index\n",
229 | "CRT, Charitable Remainder Trust\n",
230 | "CSRA, Community Spouse Resource Allowance\n",
231 | "CSRS, Civil Service Retirement System\n",
232 | "DOD, Date of Death\n",
233 | "DOI, Discharge of Indebtedness\n",
234 | "DP, Direct Payment (government farm program)\n",
235 | "DPAD, Domestic Production Activities Deduction\n",
236 | "DPAI, Domestic Production Activities Income\n",
237 | "DPAR, Domestic Production Activities Receipts\n",
238 | "DPGR, Domestic Production Gross Receipts\n",
239 | "EFIN, Electronic Filing Identification Number\n",
240 | "EFT, Electronic Funds Transfer\n",
241 | "EFTPS, Electronic Federal Tax Payment System\n",
242 | "EIC, Earned Income Credit\n",
243 | "EIN, Employer Identification Number\n",
244 | "f/b/o, For Benefit Of or For and On Behalf Of\n",
245 | "FICA, Federal Insurance Contribution Act\n",
246 | "FIFO, First In First Out\n",
247 | "FLP, Family Limited Partnership\n",
248 | "FMV, Fair Market Value\n",
249 | "FR, Federal Register\n",
250 | "FS, IRS Fact Sheets (example: FS-2005-10)\n",
251 | "FSA, Flexible Spending Account or Farm Service Agency\n",
252 | "FTD, Federal Tax Deposit\n",
253 | "FUTA, Federal Unemployment Tax Act\n",
254 | "GCM, General Counsel Memorandum\n",
255 | "GDS, General Depreciation System\n",
256 | "HDHP, High Deductible Health Plan\n",
257 | "HOH, Head of Household\n",
258 | "HRA, Health Reimbursement Account\n",
259 | "HSA, Health Savings Account\n",
260 | "IDC, Intangible Drilling Costs\n",
261 | "ILIT, Irrevocable Life Insurance Trust\n",
262 | "IR, IRS News Releases (example: IR-2005-2)\n",
263 | "IRA, Individual Retirement Arrangement\n",
264 | "IRB, Internal Revenue Bulletin\n",
265 | "IRC, Internal Revenue Code\n",
266 | "IRD, Income In Respect of Decedent\n",
267 | "IRP, Information Reporting Program\n",
268 | "ITA, Income Tax and Accounting\n",
269 | "ITIN, Individual Taxpayer Identification Number\n",
270 | "LDP, Loan Deficiency Payment\n",
271 | "LIFO, Last In First Out\n",
272 | "LLC, Limited Liability Company\n",
273 | "LLLP, Limited Liability Limited Partnership\n",
274 | "LP, Limited Partnership\n",
275 | "MACRS, Modified Accelerated Cost Recovery System\n",
276 | "MAGI, Modified Adjusted Gross Income\n",
277 | "MFJ, Married Filing Jointly\n",
278 | "MMMNA, Minimum Monthly Maintenance Needs Allowance\n",
279 | "MRD, Minimum Required Distribution\n",
280 | "MSA, Medical Savings Account (Archer MSA)\n",
281 | "MSSP, Market Segment Specialization Program\n",
282 | "NAICS, North American Industry Classification System\n",
283 | "NOL, Net Operating Loss\n",
284 | "OASDI, Old Age Survivor and Disability Insurance\n",
285 | "OIC, Offer in Compromise\n",
286 | "OID, Original Issue Discount\n",
287 | "PATR, Patronage Dividend\n",
288 | "PBA, Principal Business Activity\n",
289 | "PCP, Posted County Price, also referred to as AWP - adjusted world price\n",
290 | "PHC, Personal Holding Company\n",
291 | "PIA, Primary Insurance Amount (Social Security)\n",
292 | "PLR, Private Letter Ruling\n",
293 | "POD, Payable on Death\n",
294 | "PSC, Public Service Corporation\n",
295 | "QTIP, Qualified Terminable Interest Property\n",
296 | "RBD, Required Beginning Date\n",
297 | "REIT, Real Estate Investment Trust\n",
298 | "RMD, Required Minimum Distribution\n",
299 | "SCA, Service Center Advice\n",
300 | "SCIN, Self-Canceling Installment Note\n",
301 | "SE, Self Employment\n",
302 | "SEP, Simplified Employee Pension\n",
303 | "SIC, Service Industry Code\n",
304 | "SIMPLE, Savings Incentive Match Plan for Employees\n",
305 | "SL, Straight-Line Depreciation\n",
306 | "SMLLC, Single Member LLC\n",
307 | "SSA, Social Security Administration\n",
308 | "SSI, Supplemental Security Income\n",
309 | "SSN, Social Security Number\n",
310 | "SUTA, State Unemployment Tax Act\n",
311 | "TC, Tax Court\n",
312 | "TCMP, Taxpayer Compliance Measurement Program\n",
313 | "TD, Treasury Decision\n",
314 | "TIN, Taxpayer Identification Number\n",
315 | "TIR, Technical Information Release\n",
316 | "TOD, Transfer on Death\n",
317 | "USC, United States Code\n",
318 | "U/D/T, Under Declaration of Trust\n",
319 | "UNICAP, Uniform Capitalization Rules\n",
320 | "UTMA, Uniform Transfers to Minors Act\n",
321 | "VITA, Volunteer Income Tax Assistance\n",
322 | "GO Zone, Gulf Opportunity Zone\n",
323 | "Ct. D., Court Decision\n",
324 | "Ltr. Rul., Letter Rulings\n",
325 | "Prop. Reg., Proposed Treasury Regulations\n",
326 | "Pub. L., Public Law\n",
327 | "Rev. Proc., Revenue Procedure\n",
328 | "Rev. Rul., Revenue Ruling\n",
329 | "\"\"\""
330 | ]
331 | },
332 | {
333 | "cell_type": "markdown",
334 | "metadata": {},
335 | "source": [
336 | "** Add the thesaurus synonyms and the acronyms to a synonym map that can later be utilized by a search engine **"
337 | ]
338 | },
339 | {
340 | "cell_type": "code",
341 | "execution_count": 19,
342 | "metadata": {
343 | "collapsed": true,
344 | "deletable": true,
345 | "editable": true
346 | },
347 | "outputs": [],
348 | "source": [
349 | "OUTPUT_FILE = \"keywords_synonym.txt\"\n",
350 | "\n",
351 | "file = open(OUTPUT_FILE, 'w')\n",
352 | "# 1. add the acronyms: comma separated to indicate both ways relationship, e.g. \"<=>\"\n",
353 | "file.write(acronym_dict)\n",
354 | "# 2. add the synonyms: \"=>\" separated to indicate a relationship from left to right only\n",
355 | "for k,v in keyword_synonym_clean.items():\n",
356 | " line = k.strip() + \"=>\" + ','.join(v) + \"\\n\"\n",
357 | " file.write(line)\n",
358 | " \n",
359 | "file.close()"
360 | ]
361 | },
362 | {
363 | "cell_type": "markdown",
364 | "metadata": {
365 | "deletable": true,
366 | "editable": true
367 | },
368 | "source": [
369 | "** Peek at a few synonym map entries **"
370 | ]
371 | },
372 | {
373 | "cell_type": "code",
374 | "execution_count": 20,
375 | "metadata": {
376 | "collapsed": false,
377 | "deletable": true,
378 | "editable": true
379 | },
380 | "outputs": [
381 | {
382 | "name": "stdout",
383 | "output_type": "stream",
384 | "text": [
385 | "AAA, Accumulated Adjustment Account\n",
386 | "Acq., Acquiescence\n",
387 | "ACRS, Accelerated Cost Recovery System\n",
388 | "ADR, Asset Depreciation Range\n",
389 | "ADLs, Activities of Daily Living\n",
390 | "organizing=>run,formulate,form,set up,create\n",
391 | "inheritances=>legacy,bequest,estate,heritage,devise\n",
392 | "reported=>recorded,noted,announced,rumored,said\n",
393 | "interest=>importance,significance,sympathy,passion,activity\n",
394 | "book=>essay,album,novel,publication,dictionary\n"
395 | ]
396 | }
397 | ],
398 | "source": [
399 | "%%bash\n",
400 | "cat keywords_synonym.txt | head -5 | less -S\n",
401 | "cat keywords_synonym.txt | tail -5 | less -S"
402 | ]
403 | },
404 | {
405 | "cell_type": "code",
406 | "execution_count": null,
407 | "metadata": {
408 | "collapsed": true,
409 | "deletable": true,
410 | "editable": true
411 | },
412 | "outputs": [],
413 | "source": []
414 | }
415 | ],
416 | "metadata": {
417 | "anaconda-cloud": {},
418 | "kernelspec": {
419 | "display_name": "Python 3.5",
420 | "language": "python",
421 | "name": "python3"
422 | },
423 | "language_info": {
424 | "codemirror_mode": {
425 | "name": "ipython",
426 | "version": 3
427 | },
428 | "file_extension": ".py",
429 | "mimetype": "text/x-python",
430 | "name": "python",
431 | "nbconvert_exporter": "python",
432 | "pygments_lexer": "ipython3",
433 | "version": "3.5.2"
434 | }
435 | },
436 | "nbformat": 4,
437 | "nbformat_minor": 1
438 | }
439 |
--------------------------------------------------------------------------------
/JupyterNotebooks/SmartStoplist.txt:
--------------------------------------------------------------------------------
1 | #stop word list from SMART (Salton,1971). Available at ftp://ftp.cs.cornell.edu/pub/smart/english.stop
2 | a
3 | a's
4 | able
5 | about
6 | above
7 | according
8 | accordingly
9 | across
10 | actually
11 | after
12 | afterwards
13 | again
14 | against
15 | ain't
16 | all
17 | allow
18 | allows
19 | almost
20 | alone
21 | along
22 | already
23 | also
24 | although
25 | always
26 | am
27 | among
28 | amongst
29 | an
30 | and
31 | another
32 | any
33 | anybody
34 | anyhow
35 | anyone
36 | anything
37 | anyway
38 | anyways
39 | anywhere
40 | apart
41 | appear
42 | appreciate
43 | appropriate
44 | are
45 | aren't
46 | around
47 | as
48 | aside
49 | ask
50 | asking
51 | associated
52 | at
53 | available
54 | away
55 | awfully
56 | b
57 | be
58 | became
59 | because
60 | become
61 | becomes
62 | becoming
63 | been
64 | before
65 | beforehand
66 | behind
67 | being
68 | believe
69 | below
70 | beside
71 | besides
72 | best
73 | better
74 | between
75 | beyond
76 | both
77 | brief
78 | but
79 | by
80 | c
81 | c'mon
82 | c's
83 | came
84 | can
85 | can't
86 | cannot
87 | cant
88 | cause
89 | causes
90 | certain
91 | certainly
92 | changes
93 | clearly
94 | co
95 | com
96 | come
97 | comes
98 | concerning
99 | consequently
100 | consider
101 | considering
102 | contain
103 | containing
104 | contains
105 | corresponding
106 | could
107 | couldn't
108 | course
109 | currently
110 | d
111 | definitely
112 | described
113 | despite
114 | did
115 | didn't
116 | different
117 | do
118 | does
119 | doesn't
120 | doing
121 | don't
122 | done
123 | down
124 | downwards
125 | during
126 | e
127 | each
128 | edu
129 | eg
130 | eight
131 | either
132 | else
133 | elsewhere
134 | enough
135 | entirely
136 | especially
137 | et
138 | etc
139 | even
140 | ever
141 | every
142 | everybody
143 | everyone
144 | everything
145 | everywhere
146 | ex
147 | exactly
148 | example
149 | except
150 | f
151 | far
152 | few
153 | fifth
154 | first
155 | five
156 | followed
157 | following
158 | follows
159 | for
160 | former
161 | formerly
162 | forth
163 | four
164 | from
165 | further
166 | furthermore
167 | g
168 | get
169 | gets
170 | getting
171 | given
172 | gives
173 | go
174 | goes
175 | going
176 | gone
177 | got
178 | gotten
179 | greetings
180 | h
181 | had
182 | hadn't
183 | happens
184 | hardly
185 | has
186 | hasn't
187 | have
188 | haven't
189 | having
190 | he
191 | he's
192 | hello
193 | help
194 | hence
195 | her
196 | here
197 | here's
198 | hereafter
199 | hereby
200 | herein
201 | hereupon
202 | hers
203 | herself
204 | hi
205 | him
206 | himself
207 | his
208 | hither
209 | hopefully
210 | how
211 | howbeit
212 | however
213 | i
214 | i'd
215 | i'll
216 | i'm
217 | i've
218 | ie
219 | if
220 | ignored
221 | immediate
222 | in
223 | inasmuch
224 | inc
225 | indeed
226 | indicate
227 | indicated
228 | indicates
229 | inner
230 | insofar
231 | instead
232 | into
233 | inward
234 | is
235 | isn't
236 | it
237 | it'd
238 | it'll
239 | it's
240 | its
241 | itself
242 | j
243 | just
244 | k
245 | keep
246 | keeps
247 | kept
248 | know
249 | knows
250 | known
251 | l
252 | last
253 | lately
254 | later
255 | latter
256 | latterly
257 | least
258 | less
259 | lest
260 | let
261 | let's
262 | like
263 | liked
264 | likely
265 | little
266 | look
267 | looking
268 | looks
269 | ltd
270 | m
271 | mainly
272 | many
273 | may
274 | maybe
275 | me
276 | mean
277 | meanwhile
278 | merely
279 | might
280 | more
281 | moreover
282 | most
283 | mostly
284 | much
285 | must
286 | my
287 | myself
288 | n
289 | name
290 | namely
291 | nd
292 | near
293 | nearly
294 | necessary
295 | need
296 | needs
297 | neither
298 | never
299 | nevertheless
300 | new
301 | next
302 | nine
303 | no
304 | nobody
305 | non
306 | none
307 | noone
308 | nor
309 | normally
310 | not
311 | nothing
312 | novel
313 | now
314 | nowhere
315 | o
316 | obviously
317 | of
318 | off
319 | often
320 | oh
321 | ok
322 | okay
323 | old
324 | on
325 | once
326 | one
327 | ones
328 | only
329 | onto
330 | or
331 | other
332 | others
333 | otherwise
334 | ought
335 | our
336 | ours
337 | ourselves
338 | out
339 | outside
340 | over
341 | overall
342 | own
343 | p
344 | particular
345 | particularly
346 | per
347 | perhaps
348 | placed
349 | please
350 | plus
351 | possible
352 | presumably
353 | probably
354 | provides
355 | q
356 | que
357 | quite
358 | qv
359 | r
360 | rather
361 | rd
362 | re
363 | really
364 | reasonably
365 | regarding
366 | regardless
367 | regards
368 | relatively
369 | respectively
370 | right
371 | s
372 | said
373 | same
374 | saw
375 | say
376 | saying
377 | says
378 | second
379 | secondly
380 | see
381 | seeing
382 | seem
383 | seemed
384 | seeming
385 | seems
386 | seen
387 | self
388 | selves
389 | sensible
390 | sent
391 | serious
392 | seriously
393 | seven
394 | several
395 | shall
396 | she
397 | should
398 | shouldn't
399 | since
400 | six
401 | so
402 | some
403 | somebody
404 | somehow
405 | someone
406 | something
407 | sometime
408 | sometimes
409 | somewhat
410 | somewhere
411 | soon
412 | sorry
413 | specified
414 | specify
415 | specifying
416 | still
417 | sub
418 | such
419 | sup
420 | sure
421 | t
422 | t's
423 | take
424 | taken
425 | tell
426 | tends
427 | th
428 | than
429 | thank
430 | thanks
431 | thanx
432 | that
433 | that's
434 | thats
435 | the
436 | their
437 | theirs
438 | them
439 | themselves
440 | then
441 | thence
442 | there
443 | there's
444 | thereafter
445 | thereby
446 | therefore
447 | therein
448 | theres
449 | thereupon
450 | these
451 | they
452 | they'd
453 | they'll
454 | they're
455 | they've
456 | think
457 | third
458 | this
459 | thorough
460 | thoroughly
461 | those
462 | though
463 | three
464 | through
465 | throughout
466 | thru
467 | thus
468 | to
469 | together
470 | too
471 | took
472 | toward
473 | towards
474 | tried
475 | tries
476 | truly
477 | try
478 | trying
479 | twice
480 | two
481 | u
482 | un
483 | under
484 | unfortunately
485 | unless
486 | unlikely
487 | until
488 | unto
489 | up
490 | upon
491 | us
492 | use
493 | used
494 | useful
495 | uses
496 | using
497 | usually
498 | uucp
499 | v
500 | value
501 | various
502 | very
503 | via
504 | viz
505 | vs
506 | w
507 | want
508 | wants
509 | was
510 | wasn't
511 | way
512 | we
513 | we'd
514 | we'll
515 | we're
516 | we've
517 | welcome
518 | well
519 | went
520 | were
521 | weren't
522 | what
523 | what's
524 | whatever
525 | when
526 | whence
527 | whenever
528 | where
529 | where's
530 | whereafter
531 | whereas
532 | whereby
533 | wherein
534 | whereupon
535 | wherever
536 | whether
537 | which
538 | while
539 | whither
540 | who
541 | who's
542 | whoever
543 | whole
544 | whom
545 | whose
546 | why
547 | will
548 | willing
549 | wish
550 | with
551 | within
552 | without
553 | won't
554 | wonder
555 | would
556 | would
557 | wouldn't
558 | x
559 | y
560 | yes
561 | yet
562 | you
563 | you'd
564 | you'll
565 | you're
566 | you've
567 | your
568 | yours
569 | yourself
570 | yourselves
571 | z
572 | zero
573 |
--------------------------------------------------------------------------------
/JupyterNotebooks/SmartStoplist_extended.txt:
--------------------------------------------------------------------------------
1 | #stop word list from SMART (Salton,1971). Available at ftp://ftp.cs.cornell.edu/pub/smart/english.stop
2 | a
3 | a's
4 | able
5 | about
6 | above
7 | according
8 | accordingly
9 | across
10 | actually
11 | after
12 | afterwards
13 | again
14 | against
15 | ain't
16 | all
17 | allow
18 | allows
19 | almost
20 | alone
21 | along
22 | already
23 | also
24 | although
25 | always
26 | am
27 | among
28 | amongst
29 | an
30 | and
31 | another
32 | any
33 | anybody
34 | anyhow
35 | anyone
36 | anything
37 | anyway
38 | anyways
39 | anywhere
40 | apart
41 | appear
42 | appreciate
43 | appropriate
44 | are
45 | aren't
46 | around
47 | as
48 | aside
49 | ask
50 | asking
51 | associated
52 | at
53 | available
54 | away
55 | awfully
56 | b
57 | be
58 | became
59 | because
60 | become
61 | becomes
62 | becoming
63 | been
64 | before
65 | beforehand
66 | behind
67 | being
68 | believe
69 | below
70 | beside
71 | besides
72 | best
73 | better
74 | between
75 | beyond
76 | both
77 | brief
78 | but
79 | by
80 | c
81 | c'mon
82 | c's
83 | came
84 | can
85 | can't
86 | cannot
87 | cant
88 | cause
89 | causes
90 | certain
91 | certainly
92 | changes
93 | clearly
94 | co
95 | com
96 | come
97 | comes
98 | concerning
99 | consequently
100 | consider
101 | considering
102 | contain
103 | containing
104 | contains
105 | corresponding
106 | could
107 | couldn't
108 | course
109 | currently
110 | d
111 | definitely
112 | described
113 | despite
114 | did
115 | didn't
116 | different
117 | do
118 | does
119 | doesn't
120 | doing
121 | don't
122 | done
123 | down
124 | downwards
125 | during
126 | e
127 | each
128 | edu
129 | eg
130 | eight
131 | either
132 | else
133 | elsewhere
134 | enough
135 | entirely
136 | especially
137 | et
138 | etc
139 | even
140 | ever
141 | every
142 | everybody
143 | everyone
144 | everything
145 | everywhere
146 | ex
147 | exactly
148 | example
149 | except
150 | f
151 | far
152 | few
153 | fifth
154 | first
155 | five
156 | followed
157 | following
158 | follows
159 | for
160 | former
161 | formerly
162 | forth
163 | four
164 | from
165 | further
166 | furthermore
167 | g
168 | get
169 | gets
170 | getting
171 | given
172 | gives
173 | go
174 | goes
175 | going
176 | gone
177 | got
178 | gotten
179 | greetings
180 | h
181 | had
182 | hadn't
183 | happens
184 | hardly
185 | has
186 | hasn't
187 | have
188 | haven't
189 | having
190 | he
191 | he's
192 | hello
193 | help
194 | hence
195 | her
196 | here
197 | here's
198 | hereafter
199 | hereby
200 | herein
201 | hereupon
202 | hers
203 | herself
204 | hi
205 | him
206 | himself
207 | his
208 | hither
209 | hopefully
210 | how
211 | howbeit
212 | however
213 | i
214 | i'd
215 | i'll
216 | i'm
217 | i've
218 | ie
219 | if
220 | ignored
221 | immediate
222 | in
223 | inasmuch
224 | inc
225 | indeed
226 | indicate
227 | indicated
228 | indicates
229 | inner
230 | insofar
231 | instead
232 | into
233 | inward
234 | is
235 | isn't
236 | it
237 | it'd
238 | it'll
239 | it's
240 | its
241 | itself
242 | j
243 | just
244 | k
245 | keep
246 | keeps
247 | kept
248 | know
249 | knows
250 | known
251 | l
252 | last
253 | lately
254 | later
255 | latter
256 | latterly
257 | least
258 | less
259 | lest
260 | let
261 | let's
262 | like
263 | liked
264 | likely
265 | little
266 | look
267 | looking
268 | looks
269 | ltd
270 | m
271 | mainly
272 | many
273 | may
274 | maybe
275 | me
276 | mean
277 | meanwhile
278 | merely
279 | might
280 | more
281 | moreover
282 | most
283 | mostly
284 | much
285 | must
286 | my
287 | myself
288 | n
289 | name
290 | namely
291 | nd
292 | near
293 | nearly
294 | necessary
295 | need
296 | needs
297 | neither
298 | never
299 | nevertheless
300 | new
301 | next
302 | nine
303 | no
304 | nobody
305 | non
306 | none
307 | noone
308 | nor
309 | normally
310 | not
311 | nothing
312 | novel
313 | now
314 | nowhere
315 | o
316 | obviously
317 | of
318 | off
319 | often
320 | oh
321 | ok
322 | okay
323 | old
324 | on
325 | once
326 | one
327 | ones
328 | only
329 | onto
330 | or
331 | other
332 | others
333 | otherwise
334 | ought
335 | our
336 | ours
337 | ourselves
338 | out
339 | outside
340 | over
341 | overall
342 | own
343 | p
344 | particular
345 | particularly
346 | per
347 | perhaps
348 | placed
349 | please
350 | plus
351 | possible
352 | presumably
353 | probably
354 | provides
355 | q
356 | que
357 | quite
358 | qv
359 | r
360 | rather
361 | rd
362 | re
363 | really
364 | reasonably
365 | regarding
366 | regardless
367 | regards
368 | relatively
369 | respectively
370 | right
371 | s
372 | said
373 | same
374 | saw
375 | say
376 | saying
377 | says
378 | second
379 | secondly
380 | see
381 | seeing
382 | seem
383 | seemed
384 | seeming
385 | seems
386 | seen
387 | self
388 | selves
389 | sensible
390 | sent
391 | serious
392 | seriously
393 | seven
394 | several
395 | shall
396 | she
397 | should
398 | shouldn't
399 | since
400 | six
401 | so
402 | some
403 | somebody
404 | somehow
405 | someone
406 | something
407 | sometime
408 | sometimes
409 | somewhat
410 | somewhere
411 | soon
412 | sorry
413 | specified
414 | specify
415 | specifying
416 | still
417 | sub
418 | such
419 | sup
420 | sure
421 | t
422 | t's
423 | take
424 | taken
425 | tell
426 | tends
427 | th
428 | than
429 | thank
430 | thanks
431 | thanx
432 | that
433 | that's
434 | thats
435 | the
436 | their
437 | theirs
438 | them
439 | themselves
440 | then
441 | thence
442 | there
443 | there's
444 | thereafter
445 | thereby
446 | therefore
447 | therein
448 | theres
449 | thereupon
450 | these
451 | they
452 | they'd
453 | they'll
454 | they're
455 | they've
456 | think
457 | third
458 | this
459 | thorough
460 | thoroughly
461 | those
462 | though
463 | three
464 | through
465 | throughout
466 | thru
467 | thus
468 | to
469 | together
470 | too
471 | took
472 | toward
473 | towards
474 | tried
475 | tries
476 | truly
477 | try
478 | trying
479 | twice
480 | two
481 | u
482 | un
483 | under
484 | unfortunately
485 | unless
486 | unlikely
487 | until
488 | unto
489 | up
490 | upon
491 | us
492 | use
493 | used
494 | useful
495 | uses
496 | using
497 | usually
498 | uucp
499 | v
500 | value
501 | various
502 | very
503 | via
504 | viz
505 | vs
506 | w
507 | want
508 | wants
509 | was
510 | wasn't
511 | way
512 | we
513 | we'd
514 | we'll
515 | we're
516 | we've
517 | welcome
518 | well
519 | went
520 | were
521 | weren't
522 | what
523 | what's
524 | whatever
525 | when
526 | whence
527 | whenever
528 | where
529 | where's
530 | whereafter
531 | whereas
532 | whereby
533 | wherein
534 | whereupon
535 | wherever
536 | whether
537 | which
538 | while
539 | whither
540 | who
541 | who's
542 | whoever
543 | whole
544 | whom
545 | whose
546 | why
547 | will
548 | willing
549 | wish
550 | with
551 | within
552 | without
553 | won't
554 | wonder
555 | would
556 | would
557 | wouldn't
558 | x
559 | y
560 | yes
561 | yet
562 | you
563 | you'd
564 | you'll
565 | you're
566 | you've
567 | your
568 | yours
569 | yourself
570 | yourselves
571 | z
572 | zero
573 | ###################
574 | section
575 | subsection
576 | sections
577 | subsections
578 | chapter
579 | chapters
580 | example
581 | paragraph
582 | paragraphs
583 | regard
584 | clause
585 | subclause
586 | case
587 | subparagraph
588 | subparagraphs
589 | i
590 | ii
591 | iii
592 | iv
593 | v
594 | vi
595 | vii
596 | viii
597 | ix
598 | x
599 |
600 |
--------------------------------------------------------------------------------
/JupyterNotebooks/rake.py:
--------------------------------------------------------------------------------
1 | # Implementation of RAKE - Rapid Automtic Keyword Exraction algorithm
2 | # as described in:
3 | # Rose, S., D. Engel, N. Cramer, and W. Cowley (2010).
4 | # Automatic keyword extraction from indi-vidual documents.
5 | # In M. W. Berry and J. Kogan (Eds.), Text Mining: Applications and Theory.unknown: John Wiley and Sons, Ltd.
6 |
7 | import re
8 | import operator
9 |
10 | debug = False
11 | test = False
12 |
13 |
14 | def is_number(s):
15 | try:
16 | float(s) if '.' in s else int(s)
17 | return True
18 | except ValueError:
19 | return False
20 |
21 |
22 | def load_stop_words(stop_word_file):
23 | """
24 | Utility function to load stop words from a file and return as a list of words
25 | @param stop_word_file Path and file name of a file containing stop words.
26 | @return list A list of stop words.
27 | """
28 | stop_words = []
29 | for line in open(stop_word_file):
30 | if line.strip()[0:1] != "#":
31 | for word in line.split(): # in case more than one per line
32 | stop_words.append(word)
33 | return stop_words
34 |
35 |
36 | def separate_words(text, min_word_return_size):
37 | """
38 | Utility function to return a list of all words that are have a length greater than a specified number of characters.
39 | @param text The text that must be split in to words.
40 | @param min_word_return_size The minimum no of characters a word must have to be included.
41 | """
42 | splitter = re.compile('[^a-zA-Z0-9_\\+\\-/]')
43 | words = []
44 | for single_word in splitter.split(text):
45 | current_word = single_word.strip().lower()
46 | #leave numbers in phrase, but don't count as words, since they tend to invalidate scores of their phrases
47 | if len(current_word) > min_word_return_size and current_word != '' and not is_number(current_word):
48 | words.append(current_word)
49 | return words
50 |
51 |
52 | def split_sentences(text):
53 | """
54 | Utility function to return a list of sentences.
55 | @param text The text that must be split in to sentences.
56 | """
57 | sentence_delimiters = re.compile(u'[.!?,;:\t\\\\"\\(\\)\\\'\u2019\u2013]|\\s\\-\\s')
58 | sentences = sentence_delimiters.split(text)
59 | return sentences
60 |
61 |
62 | def build_stop_word_regex(stop_word_file_path):
63 | stop_word_list = load_stop_words(stop_word_file_path)
64 | stop_word_regex_list = []
65 | for word in stop_word_list:
66 | word_regex = r'\b' + word + r'(?![\w-])' # added look ahead for hyphen
67 | stop_word_regex_list.append(word_regex)
68 | stop_word_pattern = re.compile('|'.join(stop_word_regex_list), re.IGNORECASE)
69 | return stop_word_pattern
70 |
71 |
72 | def generate_candidate_keywords(sentence_list, stopword_pattern):
73 | phrase_list = []
74 | for s in sentence_list:
75 | tmp = re.sub(stopword_pattern, '|', s.strip())
76 | phrases = tmp.split("|")
77 | for phrase in phrases:
78 | phrase = phrase.strip().lower()
79 | if phrase != "":
80 | phrase_list.append(phrase)
81 | return phrase_list
82 |
83 |
84 | def calculate_word_scores(phraseList):
85 | word_frequency = {}
86 | word_degree = {}
87 | for phrase in phraseList:
88 | word_list = separate_words(phrase, 0)
89 | word_list_length = len(word_list)
90 | word_list_degree = word_list_length - 1
91 | #if word_list_degree > 3: word_list_degree = 3 #exp.
92 | for word in word_list:
93 | word_frequency.setdefault(word, 0)
94 | word_frequency[word] += 1
95 | word_degree.setdefault(word, 0)
96 | word_degree[word] += word_list_degree #orig.
97 | #word_degree[word] += 1/(word_list_length*1.0) #exp.
98 | for item in word_frequency:
99 | word_degree[item] = word_degree[item] + word_frequency[item]
100 |
101 | # Calculate Word scores = deg(w)/frew(w)
102 | word_score = {}
103 | for item in word_frequency:
104 | word_score.setdefault(item, 0)
105 | word_score[item] = word_degree[item] / (word_frequency[item] * 1.0) #orig.
106 | #word_score[item] = word_frequency[item]/(word_degree[item] * 1.0) #exp.
107 | return word_score
108 |
109 |
110 | def generate_candidate_keyword_scores(phrase_list, word_score):
111 | keyword_candidates = {}
112 | for phrase in phrase_list:
113 | keyword_candidates.setdefault(phrase, 0)
114 | word_list = separate_words(phrase, 0)
115 | candidate_score = 0
116 | for word in word_list:
117 | candidate_score += word_score[word]
118 | keyword_candidates[phrase] = candidate_score
119 | return keyword_candidates
120 |
121 |
122 | class Rake(object):
123 | def __init__(self, stop_words_path):
124 | self.stop_words_path = stop_words_path
125 | self.__stop_words_pattern = build_stop_word_regex(stop_words_path)
126 |
127 | def run(self, text):
128 | sentence_list = split_sentences(text)
129 |
130 | phrase_list = generate_candidate_keywords(sentence_list, self.__stop_words_pattern)
131 | word_scores = calculate_word_scores(phrase_list)
132 |
133 | keyword_candidates = generate_candidate_keyword_scores(phrase_list, word_scores)
134 |
135 | sorted_keywords = sorted(keyword_candidates.items(), key=operator.itemgetter(1), reverse=True)
136 | return sorted_keywords
137 |
138 |
139 | if test:
140 | text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types of systems and systems of mixed types."
141 |
142 | # Split text into sentences
143 | sentenceList = split_sentences(text)
144 | #stoppath = "FoxStoplist.txt" #Fox stoplist contains "numbers", so it will not find "natural numbers" like in Table 1.1
145 | stoppath = "SmartStoplist.txt" #SMART stoplist misses some of the lower-scoring keywords in Figure 1.5, which means that the top 1/3 cuts off one of the 4.0 score words in Table 1.1
146 | stopwordpattern = build_stop_word_regex(stoppath)
147 |
148 | # generate candidate keywords
149 | phraseList = generate_candidate_keywords(sentenceList, stopwordpattern)
150 |
151 | # calculate individual word scores
152 | wordscores = calculate_word_scores(phraseList)
153 |
154 | # generate candidate keyword scores
155 | keywordcandidates = generate_candidate_keyword_scores(phraseList, wordscores)
156 | if debug: print(keywordcandidates)
157 |
158 | sortedKeywords = sorted(keywordcandidates.items(), key=operator.itemgetter(1), reverse=True)
159 | if debug: print(sortedKeywords)
160 |
161 | totalKeywords = len(sortedKeywords)
162 | if debug: print(totalKeywords)
163 | print(sortedKeywords[0:(totalKeywords / 3)])
164 |
165 | rake = Rake("SmartStoplist.txt")
166 | keywords = rake.run(text)
167 | print(keywords)
168 |
--------------------------------------------------------------------------------
/JupyterNotebooks/sample_page.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CatalystCode/CustomSearch/3d9f82f676577210dde159ef7218753a2e8c309b/JupyterNotebooks/sample_page.png
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | The MIT License (MIT)
2 | Copyright (c) Microsoft Corporation
3 |
4 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and
5 | associated documentation files (the "Software"), to deal in the Software without restriction,
6 | including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
7 | and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so,
8 | subject to the following conditions:
9 |
10 | The above copyright notice and this permission notice shall be included in all copies or substantial
11 | portions of the Software.
12 |
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT
14 | NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
15 | IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
16 | WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
17 | SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
--------------------------------------------------------------------------------
/Python/azsearch_mgmt.py:
--------------------------------------------------------------------------------
1 | """
2 | Python code to upload data to Azure Search for the Custom Search example.
3 |
4 | This script will upload all of the session information where
5 | each individual sesssion equates to a document in an index
6 | in an Azure Search service.
7 |
8 | Go to http://portal.azure.com and sign up for a search service.
9 | Get the service name and service key and plug it in below.
10 | This is NOT production level code. Please do not use it as such.
11 | You might have to pip install the imported modules here.
12 |
13 | Run this script in the 'code' directory:
14 | python azsearch_mgmt.py
15 |
16 | See Azure Search REST API docs for more info:
17 | https://docs.microsoft.com/en-us/rest/api/searchservice/index
18 |
19 | Depedencies: This script requires pyexcel, pyexcel-xls and pyexcel-xlsx
20 | To install dependencies: pip install pyexcel pyexcel-xls pyexcel-xlsx
21 | """
22 |
23 | import requests
24 | import json
25 | import csv
26 | import datetime
27 | import pytz
28 | import calendar
29 | import os
30 | import pyexcel as pe
31 |
32 | # This is the service you've already created in Azure Portal
33 | serviceName = 'your_azure_search_service_name'
34 |
35 | # Index to be created
36 | indexName = 'name_of_index_to_create'
37 |
38 | # Set your service API key, either via an environment variable or enter it below
39 | #apiKey = os.getenv('SEARCH_KEY_DEV', '')
40 | apiKey = 'your_azure_search_service_api_key'
41 | apiVersion = '2016-09-01'
42 |
43 | # Input parsed content Excel file, e.g., output of step #1 in
44 | # https://github.com/CatalystCode/CustomSearch/tree/master/JupyterNotebooks/1-content_extraction.ipynb
45 | inputfile = os.path.join(os.getcwd(), '../sample/parsed_content.xlsx')
46 |
47 | # Define fields mapping from Excel file column names to search index field names (except Index)
48 | # Change this mapping to match your content fields and rename output fields as desired
49 | # Search field names should match their definition in getIndexDefinition()
50 | fields_map = [ ('File' , 'File'),
51 | ('ChapterTitle' , 'ChapterTitle'),
52 | ('SectionTitle' , 'SectionTitle'),
53 | ('SubsectionTitle' , 'SubsectionTitle'),
54 | ('SubsectionText' , 'SubsectionText'),
55 | ('Keywords' , 'Keywords') ]
56 |
57 | # Fields: Index File ChapterTitle SectionTitle SubsectionTitle SubsectionText Keywords
58 | def getIndexDefinition():
59 | return {
60 | "name": indexName,
61 | "fields": [
62 | {"name": "Index", "type": "Edm.String", "key": True, "retrievable": True, "searchable": False, "filterable": False, "sortable": True, "facetable": False},
63 |
64 | {"name": "File", "type": "Edm.String", "retrievable": True, "searchable": False, "filterable": True, "sortable": True, "facetable": False},
65 |
66 | {"name": "ChapterTitle", "type": "Edm.String", "retrievable": True, "searchable": True, "filterable": True, "sortable": True, "facetable": True},
67 |
68 | {"name": "SectionTitle", "type": "Edm.String", "retrievable": True, "searchable": True, "filterable": True, "sortable": False, "facetable": True},
69 |
70 | {"name": "SubsectionTitle", "type": "Edm.String", "retrievable": True, "searchable": True, "filterable": True, "sortable": True, "facetable": False},
71 |
72 | {"name": "SubsectionText", "type": "Edm.String", "retrievable": True, "searchable": True, "filterable": False, "sortable": False, "facetable": False, "analyzer": "en.microsoft"},
73 |
74 | {"name": "Keywords", "type": "Edm.String", "retrievable": True, "searchable": True, "filterable": False, "sortable": False, "facetable": False, "analyzer": "en.microsoft"}
75 | ]
76 | }
77 |
78 | def getServiceUrl():
79 | return 'https://' + serviceName + '.search.windows.net'
80 |
81 | def getMethod(servicePath):
82 | headers = {'Content-type': 'application/json', 'api-key': apiKey}
83 | r = requests.get(getServiceUrl() + servicePath, headers=headers)
84 | #print(r.text)
85 | return r
86 |
87 | def postMethod(servicePath, body):
88 | headers = {'Content-type': 'application/json', 'api-key': apiKey}
89 | r = requests.post(getServiceUrl() + servicePath, headers=headers, data=body)
90 | #print(r, r.text)
91 | return r
92 |
93 | def createIndex():
94 | indexDefinition = json.dumps(getIndexDefinition())
95 | servicePath = '/indexes/?api-version=%s' % apiVersion
96 | r = postMethod(servicePath, indexDefinition)
97 | #print(r.text)
98 | if r.status_code == 201:
99 | print('Index %s created' % indexName)
100 | else:
101 | print('Failed to create index %s' % indexName)
102 | exit(1)
103 |
104 | def deleteIndex():
105 | servicePath = '/indexes/%s?api-version=%s&delete' % (indexName, apiVersion)
106 | headers = {'Content-type': 'application/json', 'api-key': apiKey}
107 | r = requests.delete(getServiceUrl() + servicePath, headers=headers)
108 | #print(r.text)
109 |
110 | def getIndex():
111 | servicePath = '/indexes/%s?api-version=%s' % (indexName, apiVersion)
112 | r = getMethod(servicePath)
113 | if r.status_code == 200:
114 | return True
115 | else:
116 | return False
117 |
118 | def getDocumentObject():
119 | valarry = []
120 | cnt = 1
121 | records = pe.iget_records(file_name=inputfile)
122 | for row in records:
123 | outdict = {}
124 | outdict['@search.action'] = 'upload'
125 |
126 | if (row[fields_map[0][0]]):
127 | outdict['Index'] = str(row['Index'])
128 | for (in_fld, out_fld) in fields_map:
129 | outdict[out_fld] = row[in_fld]
130 | valarry.append(outdict)
131 | cnt+=1
132 |
133 | return {'value' : valarry}
134 |
135 | def getDocumentObjectByChunk(start, end):
136 | valarry = []
137 | cnt = 1
138 | records = pe.iget_records(file_name=inputfile)
139 | for i, row in enumerate(records):
140 | if start <= i < end:
141 | outdict = {}
142 | outdict['@search.action'] = 'upload'
143 |
144 | if (row[fields_map[0][0]]):
145 | outdict['Index'] = str(row['Index'])
146 | for (in_fld, out_fld) in fields_map:
147 | outdict[out_fld] = row[in_fld]
148 | valarry.append(outdict)
149 | cnt+=1
150 |
151 | return {'value' : valarry}
152 |
153 | # Upload content for indexing in one request if content is not too large
154 | def uploadDocuments():
155 | documents = json.dumps(getDocumentObject())
156 | servicePath = '/indexes/' + indexName + '/docs/index?api-version=' + apiVersion
157 | r = postMethod(servicePath, documents)
158 | if r.status_code == 200:
159 | print('Success: %s' % r)
160 | else:
161 | print('Failure: %s' % r.text)
162 | exit(1)
163 |
164 | # Upload content for indexing in chunks if content is too large for one request
165 | def uploadDocumentsInChunks(chunksize):
166 | records = pe.iget_records(file_name=inputfile)
167 | cnt = 0
168 | for row in records:
169 | cnt += 1
170 |
171 | for chunk in range(int(cnt/chunksize) + 1):
172 | print('Processing chunk number %d ...' % chunk)
173 | start = chunk * chunksize
174 | end = start + chunksize
175 | documents = json.dumps(getDocumentObjectByChunk(start, end))
176 | servicePath = '/indexes/' + indexName + '/docs/index?api-version=' + apiVersion
177 | r = postMethod(servicePath, documents)
178 | if r.status_code == 200:
179 | print('Success: %s' % r)
180 | else:
181 | print('Failure: %s' % r.text)
182 | return
183 |
184 | # Upload content for indexing one document at a time
185 | def uploadDocumentsOneByOne():
186 | records = pe.iget_records(file_name=inputfile)
187 | valarry = []
188 | for i, row in enumerate(records):
189 | outdict = {}
190 | outdict['@search.action'] = 'upload'
191 |
192 | if (row[fields_map[0][0]]):
193 | outdict['Index'] = str(row['Index'])
194 | for (in_fld, out_fld) in fields_map:
195 | outdict[out_fld] = row[in_fld]
196 | valarry.append(outdict)
197 |
198 | documents = json.dumps({'value' : valarry})
199 | servicePath = '/indexes/' + indexName + '/docs/index?api-version=' + apiVersion
200 | r = postMethod(servicePath, documents)
201 | if r.status_code == 200:
202 | print('%d Success: %s' % (i,r))
203 | else:
204 | print('%d Failure: %s' % (i, r.text))
205 | exit(1)
206 |
207 | def printDocumentCount():
208 | servicePath = '/indexes/' + indexName + '/docs/$count?api-version=' + apiVersion
209 | getMethod(servicePath)
210 |
211 | def sampleQuery(query, ntop=3):
212 | servicePath = '/indexes/' + indexName + '/docs?api-version=%s&search=%s&$top=%d' % \
213 | (apiVersion, query, ntop)
214 | getMethod(servicePath)
215 |
216 | # Python 2.x/3.x incompatibility of input() and raw_input()
217 | # Bind input() to raw_input() in Python 2.x, leave as-is in Python 3.x
218 | try:
219 | input = raw_input
220 | except NameError:
221 | pass
222 |
223 | if __name__ == '__main__':
224 | # Create index if it does not exist
225 | if not getIndex():
226 | createIndex()
227 | else:
228 | ans = input('Index %s already exists ... Do you want to delete it? [Y/n]' % indexName)
229 | if ans.lower() == 'y':
230 | deleteIndex()
231 | print('Re-creating index %s ...' % indexName)
232 | createIndex()
233 | else:
234 | print('Index %s is not deleted ... New content will be added to existing index' % indexName)
235 |
236 | #getIndex()
237 | #uploadDocuments()
238 | uploadDocumentsInChunks(50)
239 | #uploadDocumentsOneByOne()
240 | printDocumentCount()
241 | sampleQuery('child tax credit')
--------------------------------------------------------------------------------
/Python/azsearch_query.py:
--------------------------------------------------------------------------------
1 | """
2 | Python code to query Azure Search interactively
3 |
4 | Run this script in the 'code' directory:
5 | python azsearch_query.py
6 |
7 | See Azure Search REST API docs for more info:
8 | https://docs.microsoft.com/en-us/rest/api/searchservice/index
9 |
10 | """
11 |
12 | import requests
13 | import json
14 | import os
15 |
16 | # This is the service you've already created in Azure Portal
17 | serviceName = 'your_azure_search_service_name'
18 |
19 | # This is the index you've already created in Azure Portal or via the azsearch_mgmt.py script
20 | indexName = 'your_index_name_to_use'
21 |
22 | # Set your service API key, either via an environment variable or enter it below
23 | #apiKey = os.getenv('SEARCH_KEY_DEV', '')
24 | apiKey = 'your_azure_search_service_api_key'
25 | apiVersion = '2016-09-01'
26 |
27 | # Retrieval options to alter the query results
28 | SEARCHFIELDS = None # use all searchable fields for retrieval
29 | #SEARCHFIELDS = 'Keywords, SubsectionText' # use selected fields only for retrieval
30 | FUZZY = False # enable fuzzy search (check API for details)
31 | NTOP = 5 # uumber of results to return
32 |
33 |
34 | def getServiceUrl():
35 | return 'https://' + serviceName + '.search.windows.net'
36 |
37 | def getMethod(servicePath):
38 | headers = {'Content-type': 'application/json', 'api-key': apiKey}
39 | r = requests.get(getServiceUrl() + servicePath, headers=headers)
40 | #print(r, r.text)
41 | return r
42 |
43 | def postMethod(servicePath, body):
44 | headers = {'Content-type': 'application/json', 'api-key': apiKey}
45 | r = requests.post(getServiceUrl() + servicePath, headers=headers, data=body)
46 | #print(r, r.text)
47 | return r
48 |
49 | def submitQuery(query, fields=None, ntop=10):
50 | servicePath = '/indexes/' + indexName + '/docs?api-version=%s&search=%s&$top=%d' % \
51 | (apiVersion, query, ntop)
52 | if fields != None:
53 | servicePath += '&searchFields=%s' % fields
54 | if FUZZY:
55 | servicePath += '&queryType=full'
56 | r = getMethod(servicePath)
57 | if r.status_code != 200:
58 | print('Failed to retrieve search results')
59 | print(r, r.text)
60 | return
61 | docs = json.loads(r.text)['value']
62 | print('Number of search results = %d\n' % len(docs))
63 | for i, doc in enumerate(docs):
64 | print('Results# %d' % (i+1))
65 | print('Chapter title : %s' % doc['ChapterTitle'].encode('utf8'))
66 | print('Section title : %s' % doc['SectionTitle'].encode('utf8'))
67 | print('Subsection title: %s' % doc['SubsectionTitle'].encode('utf8'))
68 | print('%s\n' % doc['SubsectionText'].encode('utf8'))
69 |
70 | # Python 2.x/3.x incompatibility of input() and raw_input()
71 | # Bind input() to raw_input() in Python 2.x, leave as-is in Python 3.x
72 | try:
73 | input = raw_input
74 | except NameError:
75 | pass
76 |
77 | #####################################################################
78 | # Azure Search interactive query - command-line interface
79 | # Retrieve Azure Search documents via an interactive query
80 | # Fields: Index File ChapterTitle SectionTitle SubsectionTitle SubsectionText Keywords
81 | #####################################################################
82 | if __name__ == '__main__':
83 | while True:
84 | print()
85 | print("Hit enter with no input to quit.")
86 | query = input("Query: ")
87 | if query == '':
88 | exit(0)
89 |
90 | # Submit query to Azure Search and retrieve results
91 | #searchFields = None
92 | searchFields = SEARCHFIELDS
93 | submitQuery(query, fields=searchFields, ntop=NTOP)
--------------------------------------------------------------------------------
/Python/azsearch_queryall.py:
--------------------------------------------------------------------------------
1 | """
2 | Python code for batch retrieval of Azure Search results for multiple queries in a file
3 |
4 | Run this script in the 'code' directory:
5 | python azsearch_queryall.py
6 |
7 | See Azure Search REST API docs for more info:
8 | https://docs.microsoft.com/en-us/rest/api/searchservice/index
9 |
10 | """
11 |
12 | import requests
13 | import json
14 | import csv
15 | import os
16 | import pyexcel as pe
17 | import codecs
18 | import pandas as pd
19 |
20 | # This is the service you've already created in Azure Portal
21 | serviceName = 'your_azure_search_service_name'
22 |
23 | # This is the index you've already created in Azure Portal or via the azsearch_mgmt.py script
24 | indexName = 'your_index_name_to_use'
25 |
26 | # Set your service API key, either via an environment variable or enter it below
27 | #apiKey = os.getenv('SEARCH_KEY_DEV', '')
28 | apiKey = 'your_azure_search_service_api_key'
29 | apiVersion = '2016-09-01'
30 |
31 | # Input file coontaining the list of queries [tab-separated .txt or .tsv, Excel .xls or .xlsx]
32 | infile = os.path.join(os.getcwd(), '../sample/sample_queries.txt')
33 | outfile = os.path.join(os.getcwd(), '../sample/sample_query_answers.xlsx')
34 |
35 | # Retrieval options to alter the query results
36 | SEARCHFIELDS = None # use all searchable fields for retrieval
37 | #SEARCHFIELDS = 'Keywords, SubsectionText' # use selected fields only for retrieval
38 | FUZZY = False # enable fuzzy search (check API for details)
39 | NTOP = 5 # uumber of results to return
40 |
41 |
42 | def getServiceUrl():
43 | return 'https://' + serviceName + '.search.windows.net'
44 |
45 | def getMethod(servicePath):
46 | headers = {'Content-type': 'application/json', 'api-key': apiKey}
47 | r = requests.get(getServiceUrl() + servicePath, headers=headers)
48 | #print(r, r.text)
49 | return r
50 |
51 | def postMethod(servicePath, body):
52 | headers = {'Content-type': 'application/json', 'api-key': apiKey}
53 | r = requests.post(getServiceUrl() + servicePath, headers=headers, data=body)
54 | #print(r, r.text)
55 | return r
56 |
57 | def submitQuery(query, fields=None, ntop=10, fuzzy=False):
58 | servicePath = '/indexes/' + indexName + '/docs?api-version=%s&search=%s&$top=%d' % \
59 | (apiVersion, query, ntop)
60 | if fields != None:
61 | servicePath += '&searchFields=%s' % fields
62 | if fuzzy:
63 | servicePath += '&queryType=full'
64 |
65 | r = getMethod(servicePath)
66 | if r.status_code != 200:
67 | print('Failed to retrieve search results')
68 | print(query, r, r.text)
69 | return {}
70 | docs = json.loads(r.text)['value']
71 | return docs
72 |
73 |
74 | #############################################################################
75 | # Retrieve Azure Search documents for all queries in batch
76 | # Fields: Index File ChapterTitle SectionTitle SubsectionTitle SubsectionText Keywords
77 | #############################################################################
78 | if __name__ == '__main__':
79 | # Dataframe to keep index of crawled pages
80 | df = pd.DataFrame(columns = ['Qid', 'Query', 'Rank', 'SubsectionText', 'ChapterTitle', 'SectionTitle', 'SubsectionTitle', 'Keywords'])
81 |
82 | if infile.endswith('.tsv') or infile.endswith('.txt'):
83 | records = pd.read_csv(infile, sep='\t', header=0, encoding='utf-8')
84 | rows = records.iterrows()
85 | elif infile.endswith('.xls') or infile.endswith('.xlsx'):
86 | records = pe.iget_records(file_name=infile)
87 | rows = enumerate(records)
88 | else:
89 | print('Unsupported query file extension. Options: tsv, txt, xls, xlsx')
90 | exit(1)
91 |
92 | for i, row in rows:
93 | qid = int(row['Qid'])
94 | query = row['Query']
95 | # Submit query to Azure Search and retrieve results
96 | searchFields = SEARCHFIELDS
97 | docs = submitQuery(query, fields=searchFields, ntop=NTOP, fuzzy=FUZZY)
98 | print('QID: %4d\tNumber of results: %d' % (qid, len(docs)))
99 | for id, doc in enumerate(docs):
100 | chapter_title = doc['ChapterTitle']
101 | section_title = doc['SectionTitle']
102 | subsection_title = doc['SubsectionTitle']
103 | subsection_text = doc['SubsectionText']
104 | keywords = doc['Keywords']
105 |
106 | df = df.append({'Qid' : qid,
107 | 'Query' : query,
108 | 'Rank' : (id + 1),
109 | 'SubsectionText' : subsection_text,
110 | 'ChapterTitle' : chapter_title,
111 | 'SectionTitle' : section_title,
112 | 'SubsectionTitle' : subsection_title,
113 | 'Keywords' : keywords},
114 | ignore_index=True)
115 |
116 | # Save all answers
117 | df['Qid'] = df['Qid'].astype(int)
118 | df['Rank'] = df['Rank'].astype(int)
119 |
120 | if outfile.endswith('.xls') or outfile.endswith('.xlsx'):
121 | df.to_excel(outfile, index=False, encoding='utf-8')
122 | else: # default tab-separated file
123 | df.to_csv(outfile, sep='\t', index=False, encoding='utf-8')
124 |
125 |
--------------------------------------------------------------------------------
/Python/keyphrase_extract.py:
--------------------------------------------------------------------------------
1 | ###########################################################################################
2 | # Keyphrase extractor example for experimentation
3 | # Supports algoriths: RAKE, topic rank, single rank, TFIDF and KPMINER
4 | #
5 | # For more info about RAKE algorithm and implmentation, see https://github.com/aneesha/RAKE
6 | # Note: A copy of rake.py and SmartStoplist.txt stopwords list is included in the
7 | # folder ../JupyterNotebooks. Copy rake.py and the stopwords list files to current folder.
8 | #
9 | # For more info about the PKE implementations, see https://github.com/boudinfl/pke
10 | # Note: Install PKE from the GitHub repo https://github.com/boudinfl/pke
11 | # Incoptibility alert: PKE only works in Python 2.x at the moment. For Python 3.x, use RAKE.
12 | ###########################################################################################
13 |
14 | # Import base packages
15 | from bs4 import BeautifulSoup
16 | import os, glob, sys, re
17 | from rake import *
18 | import pke
19 |
20 |
21 | # Strip non-ascii characters that break the overlap check
22 | def strip_non_ascii(s):
23 | s = (c for c in s if 0 < ord(c) < 255)
24 | s = ''.join(s)
25 | return s
26 |
27 | # Clean text: remove newlines, compact spaces, strip non_ascii, etc.
28 | def clean_text(text, lowercase=False, nopunct=False):
29 | # Convert to lowercase
30 | if lowercase:
31 | text = text.lower()
32 |
33 | # Remove punctuation
34 | if nopunct:
35 | puncts = string.punctuation
36 | for c in puncts:
37 | text = text.replace(c, ' ')
38 |
39 | # Strip non-ascii characters
40 | text = strip_non_ascii(text)
41 |
42 | # Remove newlines - Compact and strip whitespaces
43 | text = re.sub('[\r\n]+', ' ', text)
44 | text = re.sub('\s+', ' ', text)
45 | return text.strip()
46 |
47 | # Load custom stopwords list
48 | def load_stop_words(stoplist_path):
49 | stop_words = []
50 | for line in open(stoplist_path):
51 | if line.strip()[0:1] != "#":
52 | for word in line.split():
53 | stop_words.append(word)
54 | return stop_words
55 |
56 | # Extract keyphrases using RAKE algorithm. Limit results by minimum score.
57 | def get_keyphrases_rake(infile, stoplist_path=None, min_score=0):
58 | if stoplist_path == None:
59 | stoplist_path = 'SmartStoplist.txt'
60 |
61 | rake = Rake(stoplist_path)
62 | text = open(infile, 'r').read()
63 | keywords = rake.run(text)
64 | phrases = []
65 | for keyword in keywords:
66 | score = keyword[1]
67 | if score >= min_score:
68 | phrases.append(keyword)
69 |
70 | return phrases
71 |
72 | # Extract keyphrases using various algorithms provided by PKE
73 | def get_keyphrases_pke(infile, mode='topic', stoplist_path=None, postags=None, ntop=100):
74 | if stoplist_path == None:
75 | stoplist_path = 'SmartStoplist.txt'
76 | stoplist = load_stop_words(stoplist_path)
77 |
78 | if postags == None:
79 | postags = ['NN', 'NNS', 'NNP', 'NNPS', 'JJ', 'JJR', 'JJS', 'VBN', 'VBD']
80 |
81 | # Run keyphrase extractor - Topic_Rank unsupervised method
82 | if mode == 'topic':
83 | try:
84 | extractor = pke.TopicRank(input_file=infile, language='english')
85 | extractor.read_document(format='raw', stemmer=None)
86 | extractor.candidate_selection(stoplist=stoplist, pos=postags)
87 | extractor.candidate_weighting(threshold=0.25, method='average')
88 | phrases = extractor.get_n_best(300, redundancy_removal=True)
89 | except:
90 | phrases = []
91 |
92 | # Run keyphrase extractor - Single_Rank unsupervised method
93 | elif mode == 'single':
94 | try:
95 | extractor = pke.SingleRank(input_file=infile, language='english')
96 | extractor.read_document(format='raw', stemmer=None)
97 | extractor.candidate_selection(stoplist=stoplist)
98 | extractor.candidate_weighting(normalized=True)
99 | except:
100 | phrases = []
101 |
102 | # Run keyphrase extractor - TfIdf unsupervised method
103 | elif mode == 'tfidf':
104 | try:
105 | extractor= pke.TfIdf(input_file=infile, language='english')
106 | extractor.read_document(format='raw', stemmer=None)
107 | extractor.candidate_selection(stoplist=stoplist)
108 | extractor.candidate_weighting()
109 | except:
110 | phrases = []
111 |
112 | # Run keyphrase extractor - KP_Miner unsupervised method
113 | elif mode == 'kpminer':
114 | try:
115 | extractor = pke.KPMiner(input_file=infile, language='english')
116 | extractor.read_document(format='raw', stemmer=None)
117 | extractor.candidate_selection(stoplist=stoplist)
118 | extractor.candidate_weighting()
119 | except:
120 | phrases = []
121 |
122 | else: # invalid mode
123 | print("Invalid keyphrase extraction algorithm: %s" % mode)
124 | print("Valid PKE algorithms: [topic, single, kpminer, tfidf]")
125 | exit(1)
126 |
127 | phrases = extractor.get_n_best(ntop, redundancy_removal=True)
128 | return phrases
129 |
130 | def usage():
131 | print('Usage %s filename [algo]' % os.path.basename(sys.argv[0]))
132 | print('Algo options: rake, topic, single, tfidf, kpminer')
133 |
134 |
135 | ##############################
136 | # Main processing
137 | ##############################
138 |
139 | if len(sys.argv) < 2:
140 | print('Missing content file name')
141 | usage()
142 | exit(1)
143 |
144 | infile = sys.argv[1]
145 | if len(sys.argv) >= 3:
146 | algo = sys.argv[2]
147 | if algo not in ['rake', 'topic', 'single', 'tfidf', 'kpminer']:
148 | print("Invalid keyphrase extraction algorithm: %s" % algo)
149 | usage()
150 | exit(1)
151 | else:
152 | algo = 'rake'
153 |
154 | # Read custom stopwords list from file - Applies to all algos
155 | # If no stopwords file is supplied, default uses SmartStoplist.txt
156 | stoplist_file = 'SmartStoplist_extended.txt'
157 |
158 | # Select POS tags to use for PKE candidate selection, use default if None
159 | postags = ['NN', 'NNS', 'NNP', 'NNPS', 'JJ', 'JJR', 'JJS', 'VBN', 'VBD']
160 |
161 | # Run keyphrase extraction
162 | if algo == 'rake':
163 | min_score = 1
164 | phrases = get_keyphrases_rake(infile, stoplist_path=stoplist_file, min_score=min_score)
165 | else:
166 | ntop = 200
167 | phrases = get_keyphrases_pke(infile, mode=algo, stoplist_path=stoplist_file, postags=postags, ntop=ntop)
168 |
169 | # Report all keyphrases and their scores
170 | print('Number of extracted keyphrases = %d' % len(phrases))
171 | for phrase in phrases:
172 | print(phrase)
173 |
174 | # Combined list of keyphrases (no scores)
175 | all_phrases = ', '.join(p[0] for p in phrases)
176 | print('\nKeyphrases list: %s' % all_phrases)
177 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Custom Search
3 |
4 | > Sample custom search project using Azure Search and the US Tax Code.
5 |
6 | > Python scripts and Jupyter notebooks that allow you to quickly and iteratively customize,
7 | improve and measure your custom search experience.
8 |
9 |
10 | ## Custom Search Service Development Features in the Python Scripts
11 | * Upload and update search index in Azure Search
12 | * Query interactively to test results
13 | * Query on batch basis to analyze performance
14 | * Extract keyphrases to enhance search index metadata
15 |
16 | ## End-to-End Example Provided in Jupyter Notebooks
17 | * Collect, pre-process, and augment content with keyphrases
18 | * Create an Azure Search index
19 | * Query the index and retrieve results interactively and/or in batch
20 |
21 | ## Getting Started
22 |
23 | 1. Read the [Real Life Code Story](https://www.microsoft.com/reallifecode/), "[Developing a Custom Search Engine for an Expert Chat System.](https://www.microsoft.com/reallifecode/)"
24 | 2. Review the [Azure Search service features](https://azure.microsoft.com/en-us/services/search/).
25 | 3. Get a [free trial subscriptions to Azure Search.](https://azure.microsoft.com/en-us/free/)
26 | 4. Copy your Azure Search name and Key.
27 | 5. Review the [sample](https://github.com/CatalystCode/CustomSearch/tree/master/sample)
28 | search index input and enriched input in the sample folder to understand content.
29 | 6. Try the sample Jupyter notebooks for an overview of the end-2-end process for content extraction, augmentation with keyphrases, indexing and retrieval.
30 | * Step 1: Content and keyphrase extraction: [1-content_extraction.ipynb](https://github.com/CatalystCode/CustomSearch/blob/master/JupyterNotebooks/1-content_extraction.ipynb)
31 | * Step 2: Index creation: [2-content_indexing.ipynb](https://github.com/CatalystCode/CustomSearch/blob/master/JupyterNotebooks/2-content_indexing.ipynb)
32 | * Step 3: Interactive and batch search queries: [3-azure_search_query.ipynb](https://github.com/CatalystCode/CustomSearch/blob/master/JupyterNotebooks/3-azure_search_query.ipynb)
33 | 7. A command-line version of the scripts is available under the Python folder.
34 | * Run the [azsearch_mgmt.py script](https://github.com/CatalystCode/CustomSearch/blob/master/Python/azsearch_mgmt.py), using your Azure Search name, key and index name of your choice to create a search index.
35 | * Run the [azsearch_query.py script](https://github.com/CatalystCode/CustomSearch/blob/master/Python/azsearch_query.py) to interactively query your new search index and see results.
36 | * Run the [azsearch_queryall.py script](https://github.com/CatalystCode/CustomSearch/blob/master/Python/azsearch_queryall.py) to batch query your new search index and evaluate the results.
37 | * Run the [keyphrase_extract.py script](https://github.com/CatalystCode/CustomSearch/blob/master/Python/keyphrase_extract.py) to experiment with various keyphrase extraction algorithms to enrich the search index metadata. Note this script is Python 2.7 only.
38 |
39 |
40 | ## Description
41 | Querying specific content areas quickly and easily is a common enterprise need. Fast traversal of specialized publications, customer support knowledge bases or document repositories allows enterprises to deliver service efficiently and effectively. Simple FAQs don’t cover enough ground, and a string search isn’t effective or efficient for those not familiar with the domain or the document set. Instead, enterprises can deliver a custom search experience that saves their clients time and provides them better service through a question and answer format. In this project, we leveraged Azure Search and Cognitive Services and we share our custom code for iterative testing, measurement and indexer redeployment. In our solution, the customized search engine will form the foundation for delivering a question and answer experience in a specific domain area.
42 |
--------------------------------------------------------------------------------
/sample/html/1.1.1.1.1.2.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
338 | For purposes of this part, an individual shall be treated as not married at the close of the taxable year if such individual is so treated under the provisions of section 7703(b). 339 |
340 |353 | In the case of a nonresident alien individual, the taxes imposed by sections 1 and 55 shall apply only as provided by section 871 or 877. 354 |
355 |368 | For definition of taxable income, see section 63. 369 |
370 |81 | For purposes of paragraph (1), the term “ceiling amount” means, with respect to any taxpayer, the amount (not less than $20,000) determined by the Secretary for the tax rate category in which such taxpayer falls. 82 |
83 |96 | The Secretary may provide that this section shall apply also for any taxable year to individuals who itemize their deductions. Any tables prescribed under the preceding sentence shall be on the basis of taxable income. 97 |
98 |145 | For purposes of this title, the tax imposed by this section shall be treated as tax imposed by section 1. 146 |
147 |160 | Whenever it is necessary to determine the taxable income of an individual to whom this section applies, the taxable income shall be determined under section 63. 161 |
162 |175 | For computation of tax by Secretary, see section 6014. 176 |
177 |36 | A tax is hereby imposed for each taxable year on the taxable income of every corporation. 37 |
38 |116 | Notwithstanding paragraph (1), the amount of the tax imposed by subsection (a) on the taxable income of a qualified personal service corporation (as defined in section 448(d)(2)) shall be equal to 35 percent of the taxable income. 117 |
118 |175 | In the case of a foreign corporation, the taxes imposed by subsection (a) and section 55 shall apply only as provided by section 882. 176 |
177 |36 | In the case of an eligible individual, there shall be allowed as a credit against the tax imposed by this subtitle for the taxable year an amount equal to the applicable percentage of so much of the qualified retirement savings contributions of the eligible individual for the taxable year as do not exceed $2,000. 37 |
38 |198 | The term “eligible individual” means any individual if such individual has attained the age of 18 as of the close of the taxable year. 199 |
200 |331 | The qualified retirement savings contributions determined under paragraph (1) shall be reduced (but not below zero) by the aggregate distributions received by the individual during the testing period from any entity of a type to which contributions under paragraph (1) may be made. The preceding sentence shall not apply to the portion of any distribution which is not includible in gross income by reason of a trustee-to-trustee transfer or a rollover distribution. 332 |
333 |422 | For purposes of determining distributions received by an individual under subparagraph (A) for any taxable year, any distribution received by the spouse of such individual shall be treated as received by such individual if such individual and spouse file a joint return for such taxable year and for the taxable year during which the spouse receives the distribution. 423 |
424 |439 | For purposes of this section, adjusted gross income shall be determined without regard to sections 911, 931, and 933. 440 |
441 |454 | Notwithstanding any other provision of law, a qualified retirement savings contribution shall not fail to be included in determining the investment in the contract for purposes of section 72 by reason of the credit under this section. 455 |
456 |27 | The basis of an interest in a partnership acquired by a contribution of property, including money, to the partnership shall be the amount of such money and the adjusted basis of such property to the contributing partner at the time of the contribution increased by the amount (if any) of gain recognized under section 721(b) to the contributing partner at such time. 28 |
29 |