├── JupyterNotebooks ├── 1-content_extraction.ipynb ├── 2-content_indexing.ipynb ├── 3-azure_search_query.ipynb ├── AugmentingSearch_CreatingASynonymMap.ipynb ├── AugmentingSearch_UploadingSynonymMapToAzureSearch.ipynb ├── SmartStoplist.txt ├── SmartStoplist_extended.txt ├── rake.py └── sample_page.png ├── LICENSE ├── Python ├── azsearch_mgmt.py ├── azsearch_query.py ├── azsearch_queryall.py └── keyphrase_extract.py ├── README.md └── sample ├── html ├── 1.1.1.1.1.1.html ├── 1.1.1.1.1.2.html ├── 1.1.1.1.1.3.html ├── 1.1.1.1.2.1.html ├── 1.1.1.1.4.1.1.html ├── 1.1.1.1.4.1.2.html ├── 1.1.1.1.4.1.3.html ├── 1.1.1.1.4.1.4.html ├── 1.1.1.1.4.1.5.html ├── 1.1.1.1.4.1.6.html ├── 1.1.1.1.4.1.7.html ├── 1.1.1.1.4.1.8.html ├── 1.1.1.11.2.1.2.html └── styles │ ├── css_Q4z0-iME7xTpui0Tzf4MEFv02rRuJ1dHZbo9kP_JLBg.css │ ├── css_XgGKW_fNRFCK5BruHWlbChY4U8WE0xT4CWGilKSjSXA.css │ ├── css_dolo-SIAwemLdrlTs99Lrug9kFXMYlMG3OlznBv4Kho.css │ ├── css_kShW4RPmRstZ3SpIC-ZvVGNFVAi0WEMuCnI0ZkYIaFw.css │ ├── css_rJ3pqftttKVzxtjsOG18hAid4RqqjfFMw3d1C89lWd4.css │ └── css_tuqeOBz1ozigHOvScJR2wasCmXBizZ9rfd58u6_20EE.css ├── parsed_content.xlsx ├── parsed_content_sample.xlsx ├── raw_text_enriched_with_keywords_sample.xlsx ├── sample_page.png ├── sample_queries.txt ├── sample_queries.xlsx ├── sample_query_answers.txt └── sample_query_answers.xlsx /JupyterNotebooks/2-content_indexing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Building a Custom Search Engine\n", 8 | "### Step 2 - Create Azure Search Index\n", 9 | "- Define new index structure\n", 10 | "- Create Azure Search index\n", 11 | "- Upload and index parsed content from step 1\n", 12 | "- Optional: Simple management of Azure Search index\n", 13 | "\n", 14 | "Dependencies: Please install pyexcel, pyexcel-xls and pyexcel-xlsx
\n", 15 | "To install dependencies: pip install pyexcel pyexcel-xls pyexcel-xlsx" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 1, 21 | "metadata": { 22 | "collapsed": true 23 | }, 24 | "outputs": [], 25 | "source": [ 26 | "# Import base packages\n", 27 | "import requests\n", 28 | "import json\n", 29 | "import csv\n", 30 | "import os\n", 31 | "import pyexcel as pe" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "First, initialize Azure Search configuration parameters to be used for index creation" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 2, 44 | "metadata": { 45 | "collapsed": true 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "# This is the service you've already created in Azure Portal\n", 50 | "serviceName = 'your_azure_search_service_name'\n", 51 | "\n", 52 | "# Index to be created\n", 53 | "indexName = 'name_of_index_to_create'\n", 54 | "\n", 55 | "# Set your service API key, either via an environment variable or enter it below\n", 56 | "#apiKey = os.getenv('SEARCH_KEY_DEV', '')\n", 57 | "apiKey = 'your_azure_search_service_api_key'\n", 58 | "apiVersion = '2016-09-01'" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "Set the path to the parsed content file from step 1, and define a basic mapping of the input fields to the desired target field names in the new index. Input and output field names do not need to be the same. However, the target names should match the index definition in getIndexDefinition()." 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "metadata": { 72 | "collapsed": true 73 | }, 74 | "outputs": [], 75 | "source": [ 76 | "# Input parsed content Excel file, e.g., output of step #1 in\n", 77 | "# https://github.com/CatalystCode/CustomSearch/tree/master/JupyterNotebooks/1-content_extraction.ipynb\n", 78 | "inputfile = os.path.join(os.getcwd(), '../sample/parsed_content.xlsx')\n", 79 | "\n", 80 | "# Define fields mapping from Excel file column names to search index field names (except Index)\n", 81 | "# Change this mapping to match your content fields and rename output fields as desired\n", 82 | "# Search field names should match their definition in getIndexDefinition()\n", 83 | "fields_map = [ ('File' , 'File'),\n", 84 | " ('ChapterTitle' , 'ChapterTitle'),\n", 85 | " ('SectionTitle' , 'SectionTitle'),\n", 86 | " ('SubsectionTitle' , 'SubsectionTitle'),\n", 87 | " ('SubsectionText' , 'SubsectionText'),\n", 88 | " ('Keywords' , 'Keywords') ]" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "Now, let's define the structure of the new index to be created. In this example, all titles, content text and keywords fields are full-text searchable. Queries will use all searchable fields by default to retrieve a ranked list of results.\n", 96 | "\n", 97 | "For more details, refer to [Create an Azure Search Index](https://docs.microsoft.com/en-us/azure/search/search-what-is-an-index)." 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": { 104 | "collapsed": true 105 | }, 106 | "outputs": [], 107 | "source": [ 108 | "# Fields: Index\tFile\tChapterTitle\tSectionTitle\tSubsectionTitle\t\tSubsectionText\tKeywords\n", 109 | "def getIndexDefinition():\n", 110 | " return {\n", 111 | " \"name\": indexName, \n", 112 | " \"fields\": [\n", 113 | " {\"name\": \"Index\", \"type\": \"Edm.String\", \"key\": True, \"retrievable\": True, \"searchable\": False, \"filterable\": False, \"sortable\": True, \"facetable\": False},\n", 114 | "\n", 115 | " {\"name\": \"File\", \"type\": \"Edm.String\", \"retrievable\": True, \"searchable\": False, \"filterable\": True, \"sortable\": True, \"facetable\": False},\n", 116 | "\n", 117 | " {\"name\": \"ChapterTitle\", \"type\": \"Edm.String\", \"retrievable\": True, \"searchable\": True, \"filterable\": True, \"sortable\": True, \"facetable\": True},\n", 118 | "\n", 119 | " {\"name\": \"SectionTitle\", \"type\": \"Edm.String\", \"retrievable\": True, \"searchable\": True, \"filterable\": True, \"sortable\": False, \"facetable\": True},\n", 120 | "\n", 121 | " {\"name\": \"SubsectionTitle\", \"type\": \"Edm.String\", \"retrievable\": True, \"searchable\": True, \"filterable\": True, \"sortable\": True, \"facetable\": False},\n", 122 | "\n", 123 | " {\"name\": \"SubsectionText\", \"type\": \"Edm.String\", \"retrievable\": True, \"searchable\": True, \"filterable\": False, \"sortable\": False, \"facetable\": False, \"analyzer\": \"en.microsoft\"},\n", 124 | "\n", 125 | " {\"name\": \"Keywords\", \"type\": \"Edm.String\", \"retrievable\": True, \"searchable\": True, \"filterable\": False, \"sortable\": False, \"facetable\": False, \"analyzer\": \"en.microsoft\"}\n", 126 | " ]\n", 127 | " }" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "#### Helper functions for basic REST API operations" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": { 141 | "collapsed": true 142 | }, 143 | "outputs": [], 144 | "source": [ 145 | "def getServiceUrl():\n", 146 | " return 'https://' + serviceName + '.search.windows.net'\n", 147 | "\n", 148 | "def getMethod(servicePath):\n", 149 | " headers = {'Content-type': 'application/json', 'api-key': apiKey}\n", 150 | " r = requests.get(getServiceUrl() + servicePath, headers=headers)\n", 151 | " #print(r.text)\n", 152 | " return r\n", 153 | "\n", 154 | "def postMethod(servicePath, body):\n", 155 | " headers = {'Content-type': 'application/json', 'api-key': apiKey}\n", 156 | " r = requests.post(getServiceUrl() + servicePath, headers=headers, data=body)\n", 157 | " #print(r, r.text)\n", 158 | " return r" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": {}, 164 | "source": [ 165 | "#### Simple index management functions\n", 166 | "- Create a new index\n", 167 | "- Delete an existing index\n", 168 | "- Check if index exists" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": null, 174 | "metadata": { 175 | "collapsed": true 176 | }, 177 | "outputs": [], 178 | "source": [ 179 | "def createIndex():\n", 180 | " indexDefinition = json.dumps(getIndexDefinition()) \n", 181 | " servicePath = '/indexes/?api-version=%s' % apiVersion\n", 182 | " r = postMethod(servicePath, indexDefinition)\n", 183 | " #print r.text\n", 184 | " if r.status_code == 201:\n", 185 | " print('Index %s created' % indexName) \n", 186 | " else:\n", 187 | " print('Failed to create index %s' % indexName)\n", 188 | " exit(1)\n", 189 | "\n", 190 | "def deleteIndex():\n", 191 | " servicePath = '/indexes/%s?api-version=%s&delete' % (indexName, apiVersion)\n", 192 | " headers = {'Content-type': 'application/json', 'api-key': apiKey}\n", 193 | " r = requests.delete(getServiceUrl() + servicePath, headers=headers)\n", 194 | " #print(r.text)\n", 195 | "\n", 196 | "def getIndex():\n", 197 | " servicePath = '/indexes/%s?api-version=%s' % (indexName, apiVersion)\n", 198 | " r = getMethod(servicePath)\n", 199 | " if r.status_code == 200: \n", 200 | " return True\n", 201 | " else:\n", 202 | " return False" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "#### Helper functions to fetch one or more documents from the parsed content file\n", 210 | "\n", 211 | "Note: In this exercise, a *document* corresponds to one row from the parsed content Excel file." 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": null, 217 | "metadata": { 218 | "collapsed": true 219 | }, 220 | "outputs": [], 221 | "source": [ 222 | "def getDocumentObject(): \n", 223 | " valarry = []\n", 224 | " cnt = 1\n", 225 | " records = pe.iget_records(file_name=inputfile)\n", 226 | " for row in records:\n", 227 | " outdict = {}\n", 228 | " outdict['@search.action'] = 'upload'\n", 229 | "\n", 230 | " if (row[fields_map[0][0]]):\n", 231 | " outdict['Index'] = str(row['Index'])\n", 232 | " for (in_fld, out_fld) in fields_map:\n", 233 | " outdict[out_fld] = row[in_fld]\n", 234 | " valarry.append(outdict)\n", 235 | " cnt+=1\n", 236 | "\n", 237 | " return {'value' : valarry}\n", 238 | "\n", 239 | "def getDocumentObjectByChunk(start, end): \n", 240 | " valarry = []\n", 241 | " cnt = 1\n", 242 | " records = pe.iget_records(file_name=inputfile)\n", 243 | " for i, row in enumerate(records):\n", 244 | " if start <= i < end:\n", 245 | " outdict = {}\n", 246 | " outdict['@search.action'] = 'upload'\n", 247 | "\n", 248 | " if (row[fields_map[0][0]]):\n", 249 | " outdict['Index'] = str(row['Index'])\n", 250 | " for (in_fld, out_fld) in fields_map:\n", 251 | " outdict[out_fld] = row[in_fld]\n", 252 | " valarry.append(outdict)\n", 253 | " cnt+=1\n", 254 | "\n", 255 | " return {'value' : valarry}" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "#### Main functions to upload and index documents in Azure Search\n", 263 | "\n", 264 | "Three methods are provided:\n", 265 | "- Upload all documents (rows) at once\n", 266 | "- Upload documents in chunks\n", 267 | "- Upload one document at a time\n", 268 | "\n", 269 | "**Note:** The method choice depends on the content size and whether it would fit in one or more REST request. " 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "metadata": { 276 | "collapsed": true 277 | }, 278 | "outputs": [], 279 | "source": [ 280 | "# Upload content for indexing in one request if content is not too large\n", 281 | "def uploadDocuments():\n", 282 | " documents = json.dumps(getDocumentObject())\n", 283 | " servicePath = '/indexes/' + indexName + '/docs/index?api-version=' + apiVersion\n", 284 | " r = postMethod(servicePath, documents)\n", 285 | " if r.status_code == 200:\n", 286 | " print('Success: %s' % r) \n", 287 | " else:\n", 288 | " print('Failure: %s' % r.text)\n", 289 | " exit(1)\n", 290 | "\n", 291 | "# Upload content for indexing in chunks if content is too large for one request\n", 292 | "def uploadDocumentsInChunks(chunksize):\n", 293 | " records = pe.iget_records(file_name=inputfile)\n", 294 | " cnt = 0\n", 295 | " for row in records:\n", 296 | " cnt += 1\n", 297 | "\n", 298 | " for chunk in range(int(cnt/chunksize) + 1):\n", 299 | " print('Processing chunk number %d ...' % chunk)\n", 300 | " start = chunk * chunksize\n", 301 | " end = start + chunksize\n", 302 | " documents = json.dumps(getDocumentObjectByChunk(start, end))\n", 303 | " servicePath = '/indexes/' + indexName + '/docs/index?api-version=' + apiVersion\n", 304 | " r = postMethod(servicePath, documents)\n", 305 | " if r.status_code == 200:\n", 306 | " print('Success: %s' % r) \n", 307 | " else:\n", 308 | " print('Failure: %s' % r.text)\n", 309 | " return\n", 310 | "\n", 311 | "# Upload content for indexing one document at a time\n", 312 | "def uploadDocumentsOneByOne():\n", 313 | " records = pe.iget_records(file_name=inputfile)\n", 314 | " valarry = []\n", 315 | " for i, row in enumerate(records):\n", 316 | " outdict = {}\n", 317 | " outdict['@search.action'] = 'upload'\n", 318 | "\n", 319 | " if (row[fields_map[0][0]]):\n", 320 | " outdict['Index'] = str(row['Index'])\n", 321 | " for (in_fld, out_fld) in fields_map:\n", 322 | " outdict[out_fld] = row[in_fld]\n", 323 | " valarry.append(outdict)\n", 324 | "\n", 325 | " documents = json.dumps({'value' : valarry})\n", 326 | " servicePath = '/indexes/' + indexName + '/docs/index?api-version=' + apiVersion\n", 327 | " r = postMethod(servicePath, documents)\n", 328 | " if r.status_code == 200:\n", 329 | " print('%d Success: %s' % (i,r)) \n", 330 | " else:\n", 331 | " print('%d Failure: %s' % (i, r.text))\n", 332 | " exit(1)" 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "#### Helper functions to check and query an index" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": null, 345 | "metadata": { 346 | "collapsed": true 347 | }, 348 | "outputs": [], 349 | "source": [ 350 | "def printDocumentCount():\n", 351 | " servicePath = '/indexes/' + indexName + '/docs/$count?api-version=' + apiVersion \n", 352 | " getMethod(servicePath)\n", 353 | "\n", 354 | "def sampleQuery(query, ntop=3):\n", 355 | " servicePath = '/indexes/' + indexName + '/docs?api-version=%s&search=%s&$top=%d' % \\\n", 356 | " (apiVersion, query, ntop)\n", 357 | " getMethod(servicePath)" 358 | ] 359 | }, 360 | { 361 | "cell_type": "markdown", 362 | "metadata": {}, 363 | "source": [ 364 | "### Create index and upload all parsed content\n", 365 | "\n", 366 | "Now let's create the index, or delete and re-create the index if it exists, then upload all parsed documents in chunks. The small sample can be uploaded all at once, but the full tax code content would require multiple requests." 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": null, 372 | "metadata": { 373 | "collapsed": true 374 | }, 375 | "outputs": [], 376 | "source": [ 377 | "# Choose upload method to be used. Options: 'all', chunks' or 'one'\n", 378 | "upload_method = 'chunks'\n", 379 | "upload_chunk_size = 50\n", 380 | "\n", 381 | "# Python 2.x/3.x incompatibility of input() and raw_input()\n", 382 | "# Bind input() to raw_input() in Python 2.x, leave as-is in Python 3.x\n", 383 | "try:\n", 384 | " input = raw_input\n", 385 | "except NameError:\n", 386 | " pass\n", 387 | "\n", 388 | "# Create index if it does not exist\n", 389 | "if not getIndex():\n", 390 | " createIndex() \n", 391 | "else:\n", 392 | " ans = input('Index %s already exists ... Do you want to delete it? [Y/n]' % indexName)\n", 393 | " if ans.lower() == 'y':\n", 394 | " deleteIndex()\n", 395 | " print('Re-creating index %s ...' % indexName)\n", 396 | " createIndex()\n", 397 | " else:\n", 398 | " print('Index %s is not deleted ... New content will be added to existing index' % indexName)\n", 399 | "\n", 400 | "if upload_method == 'all':\n", 401 | " uploadDocuments()\n", 402 | "elif upload_method == 'chunks':\n", 403 | " uploadDocumentsInChunks(upload_chunk_size)\n", 404 | "else:\n", 405 | " uploadDocumentsOneByOne()\n", 406 | " \n", 407 | "# Verify and test the newly created index\n", 408 | "printDocumentCount()\n", 409 | "sampleQuery('child tax credit')" 410 | ] 411 | }, 412 | { 413 | "cell_type": "markdown", 414 | "metadata": {}, 415 | "source": [ 416 | "#### The content is now ready for interactive or batch queries, as demonstrated in step #3." 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "execution_count": null, 422 | "metadata": { 423 | "collapsed": true 424 | }, 425 | "outputs": [], 426 | "source": [] 427 | } 428 | ], 429 | "metadata": { 430 | "kernelspec": { 431 | "display_name": "Python 3", 432 | "language": "python", 433 | "name": "python3" 434 | }, 435 | "language_info": { 436 | "codemirror_mode": { 437 | "name": "ipython", 438 | "version": 3 439 | }, 440 | "file_extension": ".py", 441 | "mimetype": "text/x-python", 442 | "name": "python", 443 | "nbconvert_exporter": "python", 444 | "pygments_lexer": "ipython3", 445 | "version": "3.6.1" 446 | } 447 | }, 448 | "nbformat": 4, 449 | "nbformat_minor": 2 450 | } 451 | -------------------------------------------------------------------------------- /JupyterNotebooks/3-azure_search_query.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Building a Custom Search Engine\n", 8 | "### Step 3 - Query the Index and Retrieve Answers\n", 9 | "- Submit a single search query\n", 10 | "- Submit multiple queries in batch\n", 11 | "\n", 12 | "**Note:** A command-line script version is included under the Python folder of this project.\n", 13 | "- For interactive queries: azsearch_query.py\n", 14 | "- For batch queries in a file: azsearch_queryall.py" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 6, 20 | "metadata": { 21 | "collapsed": true 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "import requests\n", 26 | "import json\n", 27 | "import os\n", 28 | "import csv\n", 29 | "import pyexcel as pe\n", 30 | "import codecs\n", 31 | "import pandas as pd" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "Initialize Azure Search configuration parameters to point to the content index to be used." 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 7, 44 | "metadata": { 45 | "collapsed": true 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "# This is the service you've already created in Azure Portal\n", 50 | "serviceName = 'your_azure_search_service_name'\n", 51 | "\n", 52 | "# This is the index you've already created in Azure Portal or via the azsearch_mgmt.py script\n", 53 | "indexName = 'your_index_name_to_use'\n", 54 | "\n", 55 | "# Set your service API key, either via an environment variable or enter it below\n", 56 | "#apiKey = os.getenv('SEARCH_KEY_DEV', '')\n", 57 | "apiKey = 'your_azure_search_service_api_key'\n", 58 | "apiVersion = '2016-09-01'" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "Optional configuration parameters to alter the search query request." 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 9, 71 | "metadata": { 72 | "collapsed": true 73 | }, 74 | "outputs": [], 75 | "source": [ 76 | "# Retrieval options to alter the query results\n", 77 | "SEARCHFIELDS = None # use all searchable fields for retrieval\n", 78 | "#SEARCHFIELDS = 'Keywords, SubsectionText' # use selected fields only for retrieval\n", 79 | "FUZZY = False # enable fuzzy search (check API for details)\n", 80 | "NTOP = 5 # uumber of results to return" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "#### Helper functions for basic REST API operations" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 10, 93 | "metadata": { 94 | "collapsed": true 95 | }, 96 | "outputs": [], 97 | "source": [ 98 | "def getServiceUrl():\n", 99 | " return 'https://' + serviceName + '.search.windows.net'\n", 100 | "\n", 101 | "def getMethod(servicePath):\n", 102 | " headers = {'Content-type': 'application/json', 'api-key': apiKey}\n", 103 | " r = requests.get(getServiceUrl() + servicePath, headers=headers)\n", 104 | " #print(r, r.text)\n", 105 | " return r\n", 106 | "\n", 107 | "def postMethod(servicePath, body):\n", 108 | " headers = {'Content-type': 'application/json', 'api-key': apiKey}\n", 109 | " r = requests.post(getServiceUrl() + servicePath, headers=headers, data=body)\n", 110 | " #print(r, r.text)\n", 111 | " return r" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "#### Helper functions to submit a search query interactively or in batch" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 17, 124 | "metadata": { 125 | "collapsed": true 126 | }, 127 | "outputs": [], 128 | "source": [ 129 | "def submitQuery(query, fields=None, ntop=10, fuzzy=False):\n", 130 | " servicePath = '/indexes/' + indexName + '/docs?api-version=%s&search=%s&$top=%d' % \\\n", 131 | " (apiVersion, query, ntop)\n", 132 | " if fields != None:\n", 133 | " servicePath += '&searchFields=%s' % fields\n", 134 | " if fuzzy:\n", 135 | " servicePath += '&queryType=full'\n", 136 | " \n", 137 | " # Submit GET request\n", 138 | " r = getMethod(servicePath)\n", 139 | " if r.status_code != 200:\n", 140 | " print('Failed to retrieve search results')\n", 141 | " print(r, r.text)\n", 142 | " return\n", 143 | " \n", 144 | " # Parse and report search results\n", 145 | " docs = json.loads(r.text)['value']\n", 146 | " print('Number of search results = %d\\n' % len(docs))\n", 147 | " for i, doc in enumerate(docs):\n", 148 | " print('Results# %d' % (i+1))\n", 149 | " print('Chapter title : %s' % doc['ChapterTitle'].encode('utf8'))\n", 150 | " print('Section title : %s' % doc['SectionTitle'].encode('utf8'))\n", 151 | " print('Subsection title: %s' % doc['SubsectionTitle'].encode('utf8'))\n", 152 | " print('%s\\n' % doc['SubsectionText'].encode('utf8'))\n", 153 | " \n", 154 | "def submitBatchQuery(query, fields=None, ntop=10, fuzzy=False):\n", 155 | " servicePath = '/indexes/' + indexName + '/docs?api-version=%s&search=%s&$top=%d' % \\\n", 156 | " (apiVersion, query, ntop)\n", 157 | " if fields != None:\n", 158 | " servicePath += '&searchFields=%s' % fields\n", 159 | " if fuzzy:\n", 160 | " servicePath += '&queryType=full'\n", 161 | "\n", 162 | " # Submit GET request\n", 163 | " r = getMethod(servicePath)\n", 164 | " if r.status_code != 200:\n", 165 | " print('Failed to retrieve search results')\n", 166 | " print(query, r, r.text)\n", 167 | " return {}\n", 168 | "\n", 169 | " # Return search results\n", 170 | " docs = json.loads(r.text)['value']\n", 171 | " return docs" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "Let's submit a query/question and retrieve the answers." 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": 12, 184 | "metadata": { 185 | "scrolled": true 186 | }, 187 | "outputs": [ 188 | { 189 | "name": "stdout", 190 | "output_type": "stream", 191 | "text": [ 192 | "Number of search results = 5\n", 193 | "\n", 194 | "Results# 1\n", 195 | "Chapter title : b'Income Taxes - NORMAL TAXES AND SURTAXES'\n", 196 | "Section title : b'Determination of Tax Liability - TAX ON INDIVIDUALS'\n", 197 | "Subsection title: b'Tax imposed - Married individuals filing separate returns'\n", 198 | "b'(d) Married individuals filing separate returns There is hereby imposed on the taxable income of every married individual (as defined in section 7703) who does not make a single return jointly with his spouse under section 6013, a tax determined in accordance with the following table: If taxable income is: The tax is: Not over $18,450 15% of taxable income. Over $18,450 but not over $44,575 $2,767.50, plus 28% of the excess over $18,450. Over $44,575 but not over $70,000 $10,082.50, plus 31% of the excess over $44,575. Over $70,000 but not over $125,000 $17,964.25, plus 36% of the excess over $70,000. Over $125,000 $37,764.25, plus 39.6% of the excess over $125,000.'\n", 199 | "\n", 200 | "Results# 2\n", 201 | "Chapter title : b'Income Taxes - NORMAL TAXES AND SURTAXES'\n", 202 | "Section title : b'Determination of Tax Liability - CREDITS AGAINST TAX - Nonrefundable Personal Credits'\n", 203 | "Subsection title: b'Credit for the elderly and the permanently and totally disabled - Definitions and special rules'\n", 204 | "b'(e) Definitions and special rules For purposes of this section (1) Married couple must file joint return Except in the case of a husband and wife who live apart at all times during the taxable year, if the taxpayer is married at the close of the taxable year, the credit provided by this section shall be allowed only if the taxpayer and his spouse file a joint return for the taxable year. (2) Marital status Marital status shall be determined under section 7703. (3) Permanent and total disability defined An individual is permanently and totally disabled if he is unable to engage in any substantial gainful activity by reason of any medically determinable physical or mental impairment which can be expected to result in death or which has lasted or can be expected to last for a continuous period of not less than 12 months. An individual shall not be considered to be permanently and totally disabled unless he furnishes proof of the existence thereof in such form and manner, and at such times, as the Secretary may require.'\n", 205 | "\n", 206 | "Results# 3\n", 207 | "Chapter title : b'Income Taxes - NORMAL TAXES AND SURTAXES'\n", 208 | "Section title : b'Determination of Tax Liability - TAX ON INDIVIDUALS'\n", 209 | "Subsection title: b'Tax imposed - Married individuals filing joint returns and surviving spouses'\n", 210 | "b'(a) Married individuals filing joint returns and surviving spouses There is hereby imposed on the taxable income of (1) every married individual (as defined in section 7703) who makes a single return jointly with his spouse under section 6013, and (2) every surviving spouse (as defined in section 2(a)), a tax determined in accordance with the following table: If taxable income is: The tax is: Not over $36,900 15% of taxable income. Over $36,900 but not over $89,150 $5,535, plus 28% of the excess over $36,900. Over $89,150 but not over $140,000 $20,165, plus 31% of the excess over $89,150. Over $140,000 but not over $250,000 $35,928.50, plus 36% of the excess over $140,000. Over $250,000 $75,528.50, plus 39.6% of the excess over $250,000.'\n", 211 | "\n", 212 | "Results# 4\n", 213 | "Chapter title : b'Income Taxes - NORMAL TAXES AND SURTAXES'\n", 214 | "Section title : b'Determination of Tax Liability - TAX ON INDIVIDUALS'\n", 215 | "Subsection title: b'Tax imposed - Phaseout of marriage penalty in 15-percent bracket; adjustments in tax tables so that inflation will not result in tax increases'\n", 216 | "b'(f) Phaseout of marriage penalty in 15-percent bracket; adjustments in tax tables so that inflation will not result in tax increases (1) In general Not later than December 15 of 1993, and each subsequent calendar year, the Secretary shall prescribe tables which shall apply in lieu of the tables contained in subsections (a), (b), (c), (d), and (e) with respect to taxable years beginning in the succeeding calendar year. (2) Method of prescribing tables The table which under paragraph (1) is to apply in lieu of the table contained in subsection (a), (b), (c), (d), or (e), as the case may be, with respect to taxable years beginning in any calendar year shall be prescribed (A) except as provided in paragraph (8), by increasing the minimum and maximum dollar amounts for each rate bracket for which a tax is imposed under such table by the cost-of-living adjustment for such calendar year, (B) by not changing the rate applicable to any rate bracket as adjusted under subparagraph (A), and (C) by adjusting the amounts setting forth the tax to the extent necessary to reflect the adjustments in the rate brackets. (3) Cost-of-living adjustment For purposes of paragraph (2), the cost-of-living adjustment for any calendar year is the percentage (if any) by which (A) the CPI for the preceding calendar year, exceeds (B) the CPI for the calendar year 1992. (4) CPI for any calendar year For purposes of paragraph (3), the CPI for any calendar year is the average of the Consumer Price Index as of the close of the 12-month period ending on August 31 of such calendar year. (5) Consumer Price Index For purposes of paragraph (4), the term Consumer Price Index means the last Consumer Price Index for all-urban consumers published by the Department of Labor. For purposes of the preceding sentence, the revision of the Consumer Price Index which is most consistent with the Consumer Price Index for calendar year 1986 shall be used. (6) Rounding (A) In general If any increase determined under paragraph (2)(A), section 63(c)(4), section 68(b)(2) or section 151(d)(4) is not a multiple of $50, such increase shall be rounded to the next lowest multiple of $50. (B) Table for married individuals filing separately In the case of a married individual filing a separate return, subparagraph (A) (other than with respect to sections 63(c)(4) and 151(d)(4)(A)) shall be applied by substituting $25 for $50 each place it appears. (7) Special rule for certain brackets In prescribing tables under paragraph (1) which apply to taxable years beginning in a calendar year after 1994, the cost-of-living adjustment used in making adjustments to the dollar amounts at which the 36 percent rate bracket begins or at which the 39.6 percent rate bracket begins shall be determined under paragraph (3) by substituting 1993 for 1992. (8) Elimination of marriage penalty in 15-percent bracket With respect to taxable years beginning after December 31, 2003 , in prescribing the tables under paragraph (1) (A) the maximum taxable income in the 15-percent rate bracket in the table contained in subsection (a) (and the minimum taxable income in the next higher taxable income bracket in such table) shall be 200 percent of the maximum taxable income in the 15-percent rate bracket in the table contained in subsection (c) (after any other adjustment under this subsection), and (B) the comparable taxable income amounts in the table contained in subsection (d) shall be \\xc2\\xbd of the amounts determined under subparagraph (A).'\n", 217 | "\n", 218 | "Results# 5\n", 219 | "Chapter title : b'Income Taxes - NORMAL TAXES AND SURTAXES'\n", 220 | "Section title : b'Determination of Tax Liability - CREDITS AGAINST TAX - Nonrefundable Personal Credits'\n", 221 | "Subsection title: b'Adoption expenses - Filing requirements'\n", 222 | "b'(f) Filing requirements (1) Married couples must file joint returns Rules similar to the rules of paragraphs (2), (3), and (4) of section 21(e) shall apply for purposes of this section. (2) Taxpayer must include TIN (A) In general No credit shall be allowed under this section with respect to any eligible child unless the taxpayer includes (if known) the name, age, and TIN of such child on the return of tax for the taxable year. (B) Other methods The Secretary may, in lieu of the information referred to in subparagraph (A), require other information meeting the purposes of subparagraph (A), including identification of an agent assisting with the adoption.'\n", 223 | "\n" 224 | ] 225 | } 226 | ], 227 | "source": [ 228 | "query = 'what is the tax bracket for married couple filing separately'\n", 229 | "if query != '':\n", 230 | " # Submit query to Azure Search and retrieve results\n", 231 | " searchFields = SEARCHFIELDS\n", 232 | " submitQuery(query, fields=searchFields, ntop=NTOP)" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": {}, 238 | "source": [ 239 | "Now let's submit a set of queries in batch and retrieve all ranked lists of results. This mode would be useful for performance evaluation given a set of queries and ground truth answers." 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": 27, 245 | "metadata": { 246 | "collapsed": true 247 | }, 248 | "outputs": [], 249 | "source": [ 250 | "# Input file coontaining the list of queries [tab-separated .txt or .tsv, Excel .xls or .xlsx]\n", 251 | "infile = os.path.join(os.getcwd(), '../sample/sample_queries.txt')\n", 252 | "outfile = os.path.join(os.getcwd(), '../sample/sample_query_answers.xlsx')\n", 253 | "\n", 254 | "if infile.endswith('.tsv') or infile.endswith('.txt'):\n", 255 | " records = pd.read_csv(infile, sep='\\t', header=0, encoding='utf-8')\n", 256 | " rows = records.iterrows()\n", 257 | "elif infile.endswith('.xls') or infile.endswith('.xlsx'):\n", 258 | " records = pe.iget_records(file_name=infile)\n", 259 | " rows = enumerate(records)\n", 260 | "else:\n", 261 | " print('Unsupported query file extension. Options: tsv, txt, xls, xlsx')" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 28, 267 | "metadata": {}, 268 | "outputs": [ 269 | { 270 | "name": "stdout", 271 | "output_type": "stream", 272 | "text": [ 273 | "QID: 1\tNumber of results: 5\n", 274 | "QID: 2\tNumber of results: 5\n", 275 | "QID: 3\tNumber of results: 5\n", 276 | "QID: 4\tNumber of results: 5\n", 277 | "QID: 5\tNumber of results: 5\n", 278 | "QID: 6\tNumber of results: 5\n", 279 | "QID: 7\tNumber of results: 5\n", 280 | "Search results saved in file sample_query_answers.xlsx\n" 281 | ] 282 | } 283 | ], 284 | "source": [ 285 | "# Dataframe to keep index of crawled pages\n", 286 | "df = pd.DataFrame(columns = ['Qid', 'Query', 'Rank', 'SubsectionText', 'ChapterTitle', 'SectionTitle', 'SubsectionTitle', 'Keywords'])\n", 287 | " \n", 288 | "for i, row in rows:\n", 289 | " qid = int(row['Qid'])\n", 290 | " query = row['Query']\n", 291 | " # Submit query to Azure Search and retrieve results\n", 292 | " searchFields = SEARCHFIELDS\n", 293 | " docs = submitBatchQuery(query, fields=searchFields, ntop=NTOP, fuzzy=FUZZY)\n", 294 | " print('QID: %4d\\tNumber of results: %d' % (qid, len(docs)))\n", 295 | " for id, doc in enumerate(docs):\n", 296 | " chapter_title = doc['ChapterTitle']\n", 297 | " section_title = doc['SectionTitle']\n", 298 | " subsection_title = doc['SubsectionTitle']\n", 299 | " subsection_text = doc['SubsectionText']\n", 300 | " keywords = doc['Keywords']\n", 301 | "\n", 302 | " df = df.append({'Qid' : qid, \n", 303 | " 'Query' : query, \n", 304 | " 'Rank' : (id + 1), \n", 305 | " 'SubsectionText' : subsection_text,\n", 306 | " 'ChapterTitle' : chapter_title,\n", 307 | " 'SectionTitle' : section_title,\n", 308 | " 'SubsectionTitle' : subsection_title,\n", 309 | " 'Keywords' : keywords},\n", 310 | " ignore_index=True)\n", 311 | "\n", 312 | "# Save all answers\n", 313 | "df['Qid'] = df['Qid'].astype(int)\n", 314 | "df['Rank'] = df['Rank'].astype(int)\n", 315 | "\n", 316 | "if outfile.endswith('.xls') or outfile.endswith('.xlsx'):\n", 317 | " df.to_excel(outfile, index=False, encoding='utf-8') \n", 318 | "else: # default tab-separated file\n", 319 | " df.to_csv(outfile, sep='\\t', index=False, encoding='utf-8') \n", 320 | "print('Search results saved in file %s' % os.path.basename(outfile))" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": null, 326 | "metadata": { 327 | "collapsed": true 328 | }, 329 | "outputs": [], 330 | "source": [] 331 | } 332 | ], 333 | "metadata": { 334 | "kernelspec": { 335 | "display_name": "Python 3", 336 | "language": "python", 337 | "name": "python3" 338 | }, 339 | "language_info": { 340 | "codemirror_mode": { 341 | "name": "ipython", 342 | "version": 3 343 | }, 344 | "file_extension": ".py", 345 | "mimetype": "text/x-python", 346 | "name": "python", 347 | "nbconvert_exporter": "python", 348 | "pygments_lexer": "ipython3", 349 | "version": "3.6.1" 350 | } 351 | }, 352 | "nbformat": 4, 353 | "nbformat_minor": 2 354 | } 355 | -------------------------------------------------------------------------------- /JupyterNotebooks/AugmentingSearch_CreatingASynonymMap.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "deletable": true, 7 | "editable": true 8 | }, 9 | "source": [ 10 | "# This notebook shows how you can get synonyms of key words/phrases by web-crawling Thesaurus.com and/or adding them manually. This can be used to augment a downstream search operation. #" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 14, 16 | "metadata": { 17 | "collapsed": true, 18 | "deletable": true, 19 | "editable": true 20 | }, 21 | "outputs": [], 22 | "source": [ 23 | "import pandas as pd\n", 24 | "import os\n", 25 | "import requests\n", 26 | "from bs4 import BeautifulSoup\n", 27 | "from collections import Counter" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": { 33 | "deletable": true, 34 | "editable": true 35 | }, 36 | "source": [ 37 | "# Part 1: Get synonyms of keywords from thesaurus.com #\n", 38 | "** Define a function to call and crawl thesearus.com. You pass in a word (which could be a phrase) and get back up to top N synonyms if they exist. You can also filter results by Part of Speech if you wish. \n", 39 | "Note: the advantage of crawling vs calling an API is that you can make unlimited free requests for getting synonyms.**" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 15, 45 | "metadata": { 46 | "collapsed": false, 47 | "deletable": true, 48 | "editable": true 49 | }, 50 | "outputs": [ 51 | { 52 | "data": { 53 | "text/plain": [ 54 | "['welcome', 'howdy', 'hi', 'greetings', 'bonjour']" 55 | ] 56 | }, 57 | "execution_count": 15, 58 | "metadata": {}, 59 | "output_type": "execute_result" 60 | } 61 | ], 62 | "source": [ 63 | "def get_web_syns(word, pos=None, n = 5):\n", 64 | " if pos == None:\n", 65 | " req = requests.get('http://www.thesaurus.com/browse/%s' % word)\n", 66 | " else:\n", 67 | " req = requests.get('http://www.thesaurus.com/browse/%s/%s' % (word, pos))\n", 68 | "\n", 69 | " soup = BeautifulSoup(req.text, 'html.parser')\n", 70 | " \n", 71 | " all_syns = soup.find('div', {'class' : 'relevancy-list'})\n", 72 | " syns = []\n", 73 | " if all_syns == None:\n", 74 | " return syns\n", 75 | " for ul in all_syns.findAll('ul'):\n", 76 | " for li in ul.findAll('span', {'class':'text'}):\n", 77 | " syns.append(li.text.split(\",\")[0])\n", 78 | " return syns[:n]\n", 79 | "\n", 80 | "# Example\n", 81 | "get_web_syns('hello')" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": { 87 | "deletable": true, 88 | "editable": true 89 | }, 90 | "source": [ 91 | "**Read in a sample input file, e.g. excel format. Show the raw raw text and keywords extracted columns: **" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 16, 97 | "metadata": { 98 | "collapsed": false, 99 | "deletable": true, 100 | "editable": true 101 | }, 102 | "outputs": [ 103 | { 104 | "name": "stdout", 105 | "output_type": "stream", 106 | "text": [ 107 | " ParaText \\\n", 108 | "0 Your salary, interest you earn, dividends rece... \n", 109 | "1 You must include on your return all items of i... \n", 110 | "\n", 111 | " Keywords \n", 112 | "0 gross income, excluded, taxable income, passiv... \n", 113 | "1 income, tax law, taxable, nontaxable, items, d... \n" 114 | ] 115 | } 116 | ], 117 | "source": [ 118 | "INPUT_FILE = \"raw_text_enriched_with_keywords_sample.xlsx\"\n", 119 | "df = pd.read_excel(INPUT_FILE)\n", 120 | "print(df[['ParaText','Keywords']])" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": { 126 | "deletable": true, 127 | "editable": true 128 | }, 129 | "source": [ 130 | "** We are going to extract all the keywords/phrases in the Keywords column, count frequency, and keep only keywords above a pre-defined threshold. Then, get the synonyms (if they exist) of each keyword, and save the resulting map to file: **" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 17, 136 | "metadata": { 137 | "collapsed": false, 138 | "deletable": true, 139 | "editable": true 140 | }, 141 | "outputs": [ 142 | { 143 | "name": "stdout", 144 | "output_type": "stream", 145 | "text": [ 146 | "Number of keywords-synonym pairs before cleaning: 66\n", 147 | "Number of keywords-synonym pairs after cleaning: 56\n", 148 | "{' travel agency': ['holiday company', 'travel bureau'], ' friend': ['colleague', 'acquaintance', 'buddy', 'associate', 'companion'], ' discussions': ['conference', 'dialogue', 'deliberation', 'exchange', 'review'], ' many kinds': ['womankinds'], ' sale': ['purchase', 'transaction', 'deal', 'business', 'auction']}\n" 149 | ] 150 | } 151 | ], 152 | "source": [ 153 | "MIN_KEYWORD_COUNT = 1\n", 154 | "keywords_list = df[\"Keywords\"].tolist()\n", 155 | "\n", 156 | "flattened_keywords_list = []\n", 157 | "for sublist in keywords_list:\n", 158 | " for val in sublist.split(\",\"):\n", 159 | " flattened_keywords_list.append(val)\n", 160 | " \n", 161 | "keywords_count = Counter(flattened_keywords_list)\n", 162 | "keywords_filtered = Counter(el for el in keywords_count.elements() if keywords_count[el] >=MIN_KEYWORD_COUNT)\n", 163 | "\n", 164 | "keyword_synonym = {keyword:get_web_syns(keyword) for keyword in keywords_filtered}\n", 165 | "#print(keyword_synonym)\n", 166 | "print(\"Number of keywords-synonym pairs before cleaning:\",len(keyword_synonym))\n", 167 | "\n", 168 | "# a helper function to identify and filter out keywords containing a digit - normally, you cannot find synonyms \n", 169 | "#for such words in thesaurus\n", 170 | "def hasNumbers(inputString):\n", 171 | " return any(char.isdigit() for char in inputString)\n", 172 | "\n", 173 | "keyword_synonym_clean = {}\n", 174 | "for k,v in keyword_synonym.items():\n", 175 | " if v!=[] and not hasNumbers(k):\n", 176 | " keyword_synonym_clean[k]=v\n", 177 | " \n", 178 | "print(\"Number of keywords-synonym pairs after cleaning:\",len(keyword_synonym_clean))\n", 179 | "# peek at a few keyword-synonyms pairs\n", 180 | "print(dict(list(keyword_synonym_clean.items())[0:5]))" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": { 186 | "deletable": true, 187 | "editable": true 188 | }, 189 | "source": [ 190 | "# Part 2: Manually adding synonym entries, typically for domain specific definitions #\n", 191 | "** Any synonym service would most like not be able to retrieve domain specific synonyms to acronym words. If you have such a domain specific acronym map, you can add it manually to your synonym map. **" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": 18, 197 | "metadata": { 198 | "collapsed": true 199 | }, 200 | "outputs": [], 201 | "source": [ 202 | "# domain specific acronyms in the taxcode world\n", 203 | "acronym_dict = \"\"\"AAA, Accumulated Adjustment Account\n", 204 | "Acq., Acquiescence\n", 205 | "ACRS, Accelerated Cost Recovery System\n", 206 | "ADR, Asset Depreciation Range\n", 207 | "ADLs, Activities of Daily Living\n", 208 | "ADS, Alternative Depreciation System\n", 209 | "AFR, Applicable Federal Rate\n", 210 | "AGI, Adjusted Gross Income\n", 211 | "AIME, Average Indexed Monthly Earnings (Social Security)\n", 212 | "AMT, Alternative Minimum Tax\n", 213 | "AOD, Action on Decision\n", 214 | "ARM, Adjustable Rate Mortgage\n", 215 | "ATG, Audit Techniques Guide\n", 216 | "CB, Cumulative Bulletin\n", 217 | "CCA, Chief Council Advice\n", 218 | "CC-ITA, Chief Council - Income Tax and Accounting\n", 219 | "CCC, Commodity Credit Corporation\n", 220 | "CCP, Counter-Cyclical Program (government farm program)\n", 221 | "CDHP, Consumer-Driven Health Plan\n", 222 | "CFR, Code of Federal Regulations\n", 223 | "CLT, Charitable Lead Trust\n", 224 | "COBRA, Consolidated Omnibus Budget Reconciliations Act of 1985\n", 225 | "COGS, Cost of Goods Sold\n", 226 | "COLA, Cost of Living Adjustment\n", 227 | "CONUS, Continental United States\n", 228 | "CPI, Consurmer Price Index\n", 229 | "CRT, Charitable Remainder Trust\n", 230 | "CSRA, Community Spouse Resource Allowance\n", 231 | "CSRS, Civil Service Retirement System\n", 232 | "DOD, Date of Death\n", 233 | "DOI, Discharge of Indebtedness\n", 234 | "DP, Direct Payment (government farm program)\n", 235 | "DPAD, Domestic Production Activities Deduction\n", 236 | "DPAI, Domestic Production Activities Income\n", 237 | "DPAR, Domestic Production Activities Receipts\n", 238 | "DPGR, Domestic Production Gross Receipts\n", 239 | "EFIN, Electronic Filing Identification Number\n", 240 | "EFT, Electronic Funds Transfer\n", 241 | "EFTPS, Electronic Federal Tax Payment System\n", 242 | "EIC, Earned Income Credit\n", 243 | "EIN, Employer Identification Number\n", 244 | "f/b/o, For Benefit Of or For and On Behalf Of\n", 245 | "FICA, Federal Insurance Contribution Act\n", 246 | "FIFO, First In First Out\n", 247 | "FLP, Family Limited Partnership\n", 248 | "FMV, Fair Market Value\n", 249 | "FR, Federal Register\n", 250 | "FS, IRS Fact Sheets (example: FS-2005-10)\n", 251 | "FSA, Flexible Spending Account or Farm Service Agency\n", 252 | "FTD, Federal Tax Deposit\n", 253 | "FUTA, Federal Unemployment Tax Act\n", 254 | "GCM, General Counsel Memorandum\n", 255 | "GDS, General Depreciation System\n", 256 | "HDHP, High Deductible Health Plan\n", 257 | "HOH, Head of Household\n", 258 | "HRA, Health Reimbursement Account\n", 259 | "HSA, Health Savings Account\n", 260 | "IDC, Intangible Drilling Costs\n", 261 | "ILIT, Irrevocable Life Insurance Trust\n", 262 | "IR, IRS News Releases (example: IR-2005-2)\n", 263 | "IRA, Individual Retirement Arrangement\n", 264 | "IRB, Internal Revenue Bulletin\n", 265 | "IRC, Internal Revenue Code\n", 266 | "IRD, Income In Respect of Decedent\n", 267 | "IRP, Information Reporting Program\n", 268 | "ITA, Income Tax and Accounting\n", 269 | "ITIN, Individual Taxpayer Identification Number\n", 270 | "LDP, Loan Deficiency Payment\n", 271 | "LIFO, Last In First Out\n", 272 | "LLC, Limited Liability Company\n", 273 | "LLLP, Limited Liability Limited Partnership\n", 274 | "LP, Limited Partnership\n", 275 | "MACRS, Modified Accelerated Cost Recovery System\n", 276 | "MAGI, Modified Adjusted Gross Income\n", 277 | "MFJ, Married Filing Jointly\n", 278 | "MMMNA, Minimum Monthly Maintenance Needs Allowance\n", 279 | "MRD, Minimum Required Distribution\n", 280 | "MSA, Medical Savings Account (Archer MSA)\n", 281 | "MSSP, Market Segment Specialization Program\n", 282 | "NAICS, North American Industry Classification System\n", 283 | "NOL, Net Operating Loss\n", 284 | "OASDI, Old Age Survivor and Disability Insurance\n", 285 | "OIC, Offer in Compromise\n", 286 | "OID, Original Issue Discount\n", 287 | "PATR, Patronage Dividend\n", 288 | "PBA, Principal Business Activity\n", 289 | "PCP, Posted County Price, also referred to as AWP - adjusted world price\n", 290 | "PHC, Personal Holding Company\n", 291 | "PIA, Primary Insurance Amount (Social Security)\n", 292 | "PLR, Private Letter Ruling\n", 293 | "POD, Payable on Death\n", 294 | "PSC, Public Service Corporation\n", 295 | "QTIP, Qualified Terminable Interest Property\n", 296 | "RBD, Required Beginning Date\n", 297 | "REIT, Real Estate Investment Trust\n", 298 | "RMD, Required Minimum Distribution\n", 299 | "SCA, Service Center Advice\n", 300 | "SCIN, Self-Canceling Installment Note\n", 301 | "SE, Self Employment\n", 302 | "SEP, Simplified Employee Pension\n", 303 | "SIC, Service Industry Code\n", 304 | "SIMPLE, Savings Incentive Match Plan for Employees\n", 305 | "SL, Straight-Line Depreciation\n", 306 | "SMLLC, Single Member LLC\n", 307 | "SSA, Social Security Administration\n", 308 | "SSI, Supplemental Security Income\n", 309 | "SSN, Social Security Number\n", 310 | "SUTA, State Unemployment Tax Act\n", 311 | "TC, Tax Court\n", 312 | "TCMP, Taxpayer Compliance Measurement Program\n", 313 | "TD, Treasury Decision\n", 314 | "TIN, Taxpayer Identification Number\n", 315 | "TIR, Technical Information Release\n", 316 | "TOD, Transfer on Death\n", 317 | "USC, United States Code\n", 318 | "U/D/T, Under Declaration of Trust\n", 319 | "UNICAP, Uniform Capitalization Rules\n", 320 | "UTMA, Uniform Transfers to Minors Act\n", 321 | "VITA, Volunteer Income Tax Assistance\n", 322 | "GO Zone, Gulf Opportunity Zone\n", 323 | "Ct. D., Court Decision\n", 324 | "Ltr. Rul., Letter Rulings\n", 325 | "Prop. Reg., Proposed Treasury Regulations\n", 326 | "Pub. L., Public Law\n", 327 | "Rev. Proc., Revenue Procedure\n", 328 | "Rev. Rul., Revenue Ruling\n", 329 | "\"\"\"" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "** Add the thesaurus synonyms and the acronyms to a synonym map that can later be utilized by a search engine **" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": 19, 342 | "metadata": { 343 | "collapsed": true, 344 | "deletable": true, 345 | "editable": true 346 | }, 347 | "outputs": [], 348 | "source": [ 349 | "OUTPUT_FILE = \"keywords_synonym.txt\"\n", 350 | "\n", 351 | "file = open(OUTPUT_FILE, 'w')\n", 352 | "# 1. add the acronyms: comma separated to indicate both ways relationship, e.g. \"<=>\"\n", 353 | "file.write(acronym_dict)\n", 354 | "# 2. add the synonyms: \"=>\" separated to indicate a relationship from left to right only\n", 355 | "for k,v in keyword_synonym_clean.items():\n", 356 | " line = k.strip() + \"=>\" + ','.join(v) + \"\\n\"\n", 357 | " file.write(line)\n", 358 | " \n", 359 | "file.close()" 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": { 365 | "deletable": true, 366 | "editable": true 367 | }, 368 | "source": [ 369 | "** Peek at a few synonym map entries **" 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": 20, 375 | "metadata": { 376 | "collapsed": false, 377 | "deletable": true, 378 | "editable": true 379 | }, 380 | "outputs": [ 381 | { 382 | "name": "stdout", 383 | "output_type": "stream", 384 | "text": [ 385 | "AAA, Accumulated Adjustment Account\n", 386 | "Acq., Acquiescence\n", 387 | "ACRS, Accelerated Cost Recovery System\n", 388 | "ADR, Asset Depreciation Range\n", 389 | "ADLs, Activities of Daily Living\n", 390 | "organizing=>run,formulate,form,set up,create\n", 391 | "inheritances=>legacy,bequest,estate,heritage,devise\n", 392 | "reported=>recorded,noted,announced,rumored,said\n", 393 | "interest=>importance,significance,sympathy,passion,activity\n", 394 | "book=>essay,album,novel,publication,dictionary\n" 395 | ] 396 | } 397 | ], 398 | "source": [ 399 | "%%bash\n", 400 | "cat keywords_synonym.txt | head -5 | less -S\n", 401 | "cat keywords_synonym.txt | tail -5 | less -S" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": null, 407 | "metadata": { 408 | "collapsed": true, 409 | "deletable": true, 410 | "editable": true 411 | }, 412 | "outputs": [], 413 | "source": [] 414 | } 415 | ], 416 | "metadata": { 417 | "anaconda-cloud": {}, 418 | "kernelspec": { 419 | "display_name": "Python 3.5", 420 | "language": "python", 421 | "name": "python3" 422 | }, 423 | "language_info": { 424 | "codemirror_mode": { 425 | "name": "ipython", 426 | "version": 3 427 | }, 428 | "file_extension": ".py", 429 | "mimetype": "text/x-python", 430 | "name": "python", 431 | "nbconvert_exporter": "python", 432 | "pygments_lexer": "ipython3", 433 | "version": "3.5.2" 434 | } 435 | }, 436 | "nbformat": 4, 437 | "nbformat_minor": 1 438 | } 439 | -------------------------------------------------------------------------------- /JupyterNotebooks/SmartStoplist.txt: -------------------------------------------------------------------------------- 1 | #stop word list from SMART (Salton,1971). Available at ftp://ftp.cs.cornell.edu/pub/smart/english.stop 2 | a 3 | a's 4 | able 5 | about 6 | above 7 | according 8 | accordingly 9 | across 10 | actually 11 | after 12 | afterwards 13 | again 14 | against 15 | ain't 16 | all 17 | allow 18 | allows 19 | almost 20 | alone 21 | along 22 | already 23 | also 24 | although 25 | always 26 | am 27 | among 28 | amongst 29 | an 30 | and 31 | another 32 | any 33 | anybody 34 | anyhow 35 | anyone 36 | anything 37 | anyway 38 | anyways 39 | anywhere 40 | apart 41 | appear 42 | appreciate 43 | appropriate 44 | are 45 | aren't 46 | around 47 | as 48 | aside 49 | ask 50 | asking 51 | associated 52 | at 53 | available 54 | away 55 | awfully 56 | b 57 | be 58 | became 59 | because 60 | become 61 | becomes 62 | becoming 63 | been 64 | before 65 | beforehand 66 | behind 67 | being 68 | believe 69 | below 70 | beside 71 | besides 72 | best 73 | better 74 | between 75 | beyond 76 | both 77 | brief 78 | but 79 | by 80 | c 81 | c'mon 82 | c's 83 | came 84 | can 85 | can't 86 | cannot 87 | cant 88 | cause 89 | causes 90 | certain 91 | certainly 92 | changes 93 | clearly 94 | co 95 | com 96 | come 97 | comes 98 | concerning 99 | consequently 100 | consider 101 | considering 102 | contain 103 | containing 104 | contains 105 | corresponding 106 | could 107 | couldn't 108 | course 109 | currently 110 | d 111 | definitely 112 | described 113 | despite 114 | did 115 | didn't 116 | different 117 | do 118 | does 119 | doesn't 120 | doing 121 | don't 122 | done 123 | down 124 | downwards 125 | during 126 | e 127 | each 128 | edu 129 | eg 130 | eight 131 | either 132 | else 133 | elsewhere 134 | enough 135 | entirely 136 | especially 137 | et 138 | etc 139 | even 140 | ever 141 | every 142 | everybody 143 | everyone 144 | everything 145 | everywhere 146 | ex 147 | exactly 148 | example 149 | except 150 | f 151 | far 152 | few 153 | fifth 154 | first 155 | five 156 | followed 157 | following 158 | follows 159 | for 160 | former 161 | formerly 162 | forth 163 | four 164 | from 165 | further 166 | furthermore 167 | g 168 | get 169 | gets 170 | getting 171 | given 172 | gives 173 | go 174 | goes 175 | going 176 | gone 177 | got 178 | gotten 179 | greetings 180 | h 181 | had 182 | hadn't 183 | happens 184 | hardly 185 | has 186 | hasn't 187 | have 188 | haven't 189 | having 190 | he 191 | he's 192 | hello 193 | help 194 | hence 195 | her 196 | here 197 | here's 198 | hereafter 199 | hereby 200 | herein 201 | hereupon 202 | hers 203 | herself 204 | hi 205 | him 206 | himself 207 | his 208 | hither 209 | hopefully 210 | how 211 | howbeit 212 | however 213 | i 214 | i'd 215 | i'll 216 | i'm 217 | i've 218 | ie 219 | if 220 | ignored 221 | immediate 222 | in 223 | inasmuch 224 | inc 225 | indeed 226 | indicate 227 | indicated 228 | indicates 229 | inner 230 | insofar 231 | instead 232 | into 233 | inward 234 | is 235 | isn't 236 | it 237 | it'd 238 | it'll 239 | it's 240 | its 241 | itself 242 | j 243 | just 244 | k 245 | keep 246 | keeps 247 | kept 248 | know 249 | knows 250 | known 251 | l 252 | last 253 | lately 254 | later 255 | latter 256 | latterly 257 | least 258 | less 259 | lest 260 | let 261 | let's 262 | like 263 | liked 264 | likely 265 | little 266 | look 267 | looking 268 | looks 269 | ltd 270 | m 271 | mainly 272 | many 273 | may 274 | maybe 275 | me 276 | mean 277 | meanwhile 278 | merely 279 | might 280 | more 281 | moreover 282 | most 283 | mostly 284 | much 285 | must 286 | my 287 | myself 288 | n 289 | name 290 | namely 291 | nd 292 | near 293 | nearly 294 | necessary 295 | need 296 | needs 297 | neither 298 | never 299 | nevertheless 300 | new 301 | next 302 | nine 303 | no 304 | nobody 305 | non 306 | none 307 | noone 308 | nor 309 | normally 310 | not 311 | nothing 312 | novel 313 | now 314 | nowhere 315 | o 316 | obviously 317 | of 318 | off 319 | often 320 | oh 321 | ok 322 | okay 323 | old 324 | on 325 | once 326 | one 327 | ones 328 | only 329 | onto 330 | or 331 | other 332 | others 333 | otherwise 334 | ought 335 | our 336 | ours 337 | ourselves 338 | out 339 | outside 340 | over 341 | overall 342 | own 343 | p 344 | particular 345 | particularly 346 | per 347 | perhaps 348 | placed 349 | please 350 | plus 351 | possible 352 | presumably 353 | probably 354 | provides 355 | q 356 | que 357 | quite 358 | qv 359 | r 360 | rather 361 | rd 362 | re 363 | really 364 | reasonably 365 | regarding 366 | regardless 367 | regards 368 | relatively 369 | respectively 370 | right 371 | s 372 | said 373 | same 374 | saw 375 | say 376 | saying 377 | says 378 | second 379 | secondly 380 | see 381 | seeing 382 | seem 383 | seemed 384 | seeming 385 | seems 386 | seen 387 | self 388 | selves 389 | sensible 390 | sent 391 | serious 392 | seriously 393 | seven 394 | several 395 | shall 396 | she 397 | should 398 | shouldn't 399 | since 400 | six 401 | so 402 | some 403 | somebody 404 | somehow 405 | someone 406 | something 407 | sometime 408 | sometimes 409 | somewhat 410 | somewhere 411 | soon 412 | sorry 413 | specified 414 | specify 415 | specifying 416 | still 417 | sub 418 | such 419 | sup 420 | sure 421 | t 422 | t's 423 | take 424 | taken 425 | tell 426 | tends 427 | th 428 | than 429 | thank 430 | thanks 431 | thanx 432 | that 433 | that's 434 | thats 435 | the 436 | their 437 | theirs 438 | them 439 | themselves 440 | then 441 | thence 442 | there 443 | there's 444 | thereafter 445 | thereby 446 | therefore 447 | therein 448 | theres 449 | thereupon 450 | these 451 | they 452 | they'd 453 | they'll 454 | they're 455 | they've 456 | think 457 | third 458 | this 459 | thorough 460 | thoroughly 461 | those 462 | though 463 | three 464 | through 465 | throughout 466 | thru 467 | thus 468 | to 469 | together 470 | too 471 | took 472 | toward 473 | towards 474 | tried 475 | tries 476 | truly 477 | try 478 | trying 479 | twice 480 | two 481 | u 482 | un 483 | under 484 | unfortunately 485 | unless 486 | unlikely 487 | until 488 | unto 489 | up 490 | upon 491 | us 492 | use 493 | used 494 | useful 495 | uses 496 | using 497 | usually 498 | uucp 499 | v 500 | value 501 | various 502 | very 503 | via 504 | viz 505 | vs 506 | w 507 | want 508 | wants 509 | was 510 | wasn't 511 | way 512 | we 513 | we'd 514 | we'll 515 | we're 516 | we've 517 | welcome 518 | well 519 | went 520 | were 521 | weren't 522 | what 523 | what's 524 | whatever 525 | when 526 | whence 527 | whenever 528 | where 529 | where's 530 | whereafter 531 | whereas 532 | whereby 533 | wherein 534 | whereupon 535 | wherever 536 | whether 537 | which 538 | while 539 | whither 540 | who 541 | who's 542 | whoever 543 | whole 544 | whom 545 | whose 546 | why 547 | will 548 | willing 549 | wish 550 | with 551 | within 552 | without 553 | won't 554 | wonder 555 | would 556 | would 557 | wouldn't 558 | x 559 | y 560 | yes 561 | yet 562 | you 563 | you'd 564 | you'll 565 | you're 566 | you've 567 | your 568 | yours 569 | yourself 570 | yourselves 571 | z 572 | zero 573 | -------------------------------------------------------------------------------- /JupyterNotebooks/SmartStoplist_extended.txt: -------------------------------------------------------------------------------- 1 | #stop word list from SMART (Salton,1971). Available at ftp://ftp.cs.cornell.edu/pub/smart/english.stop 2 | a 3 | a's 4 | able 5 | about 6 | above 7 | according 8 | accordingly 9 | across 10 | actually 11 | after 12 | afterwards 13 | again 14 | against 15 | ain't 16 | all 17 | allow 18 | allows 19 | almost 20 | alone 21 | along 22 | already 23 | also 24 | although 25 | always 26 | am 27 | among 28 | amongst 29 | an 30 | and 31 | another 32 | any 33 | anybody 34 | anyhow 35 | anyone 36 | anything 37 | anyway 38 | anyways 39 | anywhere 40 | apart 41 | appear 42 | appreciate 43 | appropriate 44 | are 45 | aren't 46 | around 47 | as 48 | aside 49 | ask 50 | asking 51 | associated 52 | at 53 | available 54 | away 55 | awfully 56 | b 57 | be 58 | became 59 | because 60 | become 61 | becomes 62 | becoming 63 | been 64 | before 65 | beforehand 66 | behind 67 | being 68 | believe 69 | below 70 | beside 71 | besides 72 | best 73 | better 74 | between 75 | beyond 76 | both 77 | brief 78 | but 79 | by 80 | c 81 | c'mon 82 | c's 83 | came 84 | can 85 | can't 86 | cannot 87 | cant 88 | cause 89 | causes 90 | certain 91 | certainly 92 | changes 93 | clearly 94 | co 95 | com 96 | come 97 | comes 98 | concerning 99 | consequently 100 | consider 101 | considering 102 | contain 103 | containing 104 | contains 105 | corresponding 106 | could 107 | couldn't 108 | course 109 | currently 110 | d 111 | definitely 112 | described 113 | despite 114 | did 115 | didn't 116 | different 117 | do 118 | does 119 | doesn't 120 | doing 121 | don't 122 | done 123 | down 124 | downwards 125 | during 126 | e 127 | each 128 | edu 129 | eg 130 | eight 131 | either 132 | else 133 | elsewhere 134 | enough 135 | entirely 136 | especially 137 | et 138 | etc 139 | even 140 | ever 141 | every 142 | everybody 143 | everyone 144 | everything 145 | everywhere 146 | ex 147 | exactly 148 | example 149 | except 150 | f 151 | far 152 | few 153 | fifth 154 | first 155 | five 156 | followed 157 | following 158 | follows 159 | for 160 | former 161 | formerly 162 | forth 163 | four 164 | from 165 | further 166 | furthermore 167 | g 168 | get 169 | gets 170 | getting 171 | given 172 | gives 173 | go 174 | goes 175 | going 176 | gone 177 | got 178 | gotten 179 | greetings 180 | h 181 | had 182 | hadn't 183 | happens 184 | hardly 185 | has 186 | hasn't 187 | have 188 | haven't 189 | having 190 | he 191 | he's 192 | hello 193 | help 194 | hence 195 | her 196 | here 197 | here's 198 | hereafter 199 | hereby 200 | herein 201 | hereupon 202 | hers 203 | herself 204 | hi 205 | him 206 | himself 207 | his 208 | hither 209 | hopefully 210 | how 211 | howbeit 212 | however 213 | i 214 | i'd 215 | i'll 216 | i'm 217 | i've 218 | ie 219 | if 220 | ignored 221 | immediate 222 | in 223 | inasmuch 224 | inc 225 | indeed 226 | indicate 227 | indicated 228 | indicates 229 | inner 230 | insofar 231 | instead 232 | into 233 | inward 234 | is 235 | isn't 236 | it 237 | it'd 238 | it'll 239 | it's 240 | its 241 | itself 242 | j 243 | just 244 | k 245 | keep 246 | keeps 247 | kept 248 | know 249 | knows 250 | known 251 | l 252 | last 253 | lately 254 | later 255 | latter 256 | latterly 257 | least 258 | less 259 | lest 260 | let 261 | let's 262 | like 263 | liked 264 | likely 265 | little 266 | look 267 | looking 268 | looks 269 | ltd 270 | m 271 | mainly 272 | many 273 | may 274 | maybe 275 | me 276 | mean 277 | meanwhile 278 | merely 279 | might 280 | more 281 | moreover 282 | most 283 | mostly 284 | much 285 | must 286 | my 287 | myself 288 | n 289 | name 290 | namely 291 | nd 292 | near 293 | nearly 294 | necessary 295 | need 296 | needs 297 | neither 298 | never 299 | nevertheless 300 | new 301 | next 302 | nine 303 | no 304 | nobody 305 | non 306 | none 307 | noone 308 | nor 309 | normally 310 | not 311 | nothing 312 | novel 313 | now 314 | nowhere 315 | o 316 | obviously 317 | of 318 | off 319 | often 320 | oh 321 | ok 322 | okay 323 | old 324 | on 325 | once 326 | one 327 | ones 328 | only 329 | onto 330 | or 331 | other 332 | others 333 | otherwise 334 | ought 335 | our 336 | ours 337 | ourselves 338 | out 339 | outside 340 | over 341 | overall 342 | own 343 | p 344 | particular 345 | particularly 346 | per 347 | perhaps 348 | placed 349 | please 350 | plus 351 | possible 352 | presumably 353 | probably 354 | provides 355 | q 356 | que 357 | quite 358 | qv 359 | r 360 | rather 361 | rd 362 | re 363 | really 364 | reasonably 365 | regarding 366 | regardless 367 | regards 368 | relatively 369 | respectively 370 | right 371 | s 372 | said 373 | same 374 | saw 375 | say 376 | saying 377 | says 378 | second 379 | secondly 380 | see 381 | seeing 382 | seem 383 | seemed 384 | seeming 385 | seems 386 | seen 387 | self 388 | selves 389 | sensible 390 | sent 391 | serious 392 | seriously 393 | seven 394 | several 395 | shall 396 | she 397 | should 398 | shouldn't 399 | since 400 | six 401 | so 402 | some 403 | somebody 404 | somehow 405 | someone 406 | something 407 | sometime 408 | sometimes 409 | somewhat 410 | somewhere 411 | soon 412 | sorry 413 | specified 414 | specify 415 | specifying 416 | still 417 | sub 418 | such 419 | sup 420 | sure 421 | t 422 | t's 423 | take 424 | taken 425 | tell 426 | tends 427 | th 428 | than 429 | thank 430 | thanks 431 | thanx 432 | that 433 | that's 434 | thats 435 | the 436 | their 437 | theirs 438 | them 439 | themselves 440 | then 441 | thence 442 | there 443 | there's 444 | thereafter 445 | thereby 446 | therefore 447 | therein 448 | theres 449 | thereupon 450 | these 451 | they 452 | they'd 453 | they'll 454 | they're 455 | they've 456 | think 457 | third 458 | this 459 | thorough 460 | thoroughly 461 | those 462 | though 463 | three 464 | through 465 | throughout 466 | thru 467 | thus 468 | to 469 | together 470 | too 471 | took 472 | toward 473 | towards 474 | tried 475 | tries 476 | truly 477 | try 478 | trying 479 | twice 480 | two 481 | u 482 | un 483 | under 484 | unfortunately 485 | unless 486 | unlikely 487 | until 488 | unto 489 | up 490 | upon 491 | us 492 | use 493 | used 494 | useful 495 | uses 496 | using 497 | usually 498 | uucp 499 | v 500 | value 501 | various 502 | very 503 | via 504 | viz 505 | vs 506 | w 507 | want 508 | wants 509 | was 510 | wasn't 511 | way 512 | we 513 | we'd 514 | we'll 515 | we're 516 | we've 517 | welcome 518 | well 519 | went 520 | were 521 | weren't 522 | what 523 | what's 524 | whatever 525 | when 526 | whence 527 | whenever 528 | where 529 | where's 530 | whereafter 531 | whereas 532 | whereby 533 | wherein 534 | whereupon 535 | wherever 536 | whether 537 | which 538 | while 539 | whither 540 | who 541 | who's 542 | whoever 543 | whole 544 | whom 545 | whose 546 | why 547 | will 548 | willing 549 | wish 550 | with 551 | within 552 | without 553 | won't 554 | wonder 555 | would 556 | would 557 | wouldn't 558 | x 559 | y 560 | yes 561 | yet 562 | you 563 | you'd 564 | you'll 565 | you're 566 | you've 567 | your 568 | yours 569 | yourself 570 | yourselves 571 | z 572 | zero 573 | ################### 574 | section 575 | subsection 576 | sections 577 | subsections 578 | chapter 579 | chapters 580 | example 581 | paragraph 582 | paragraphs 583 | regard 584 | clause 585 | subclause 586 | case 587 | subparagraph 588 | subparagraphs 589 | i 590 | ii 591 | iii 592 | iv 593 | v 594 | vi 595 | vii 596 | viii 597 | ix 598 | x 599 | 600 | -------------------------------------------------------------------------------- /JupyterNotebooks/rake.py: -------------------------------------------------------------------------------- 1 | # Implementation of RAKE - Rapid Automtic Keyword Exraction algorithm 2 | # as described in: 3 | # Rose, S., D. Engel, N. Cramer, and W. Cowley (2010). 4 | # Automatic keyword extraction from indi-vidual documents. 5 | # In M. W. Berry and J. Kogan (Eds.), Text Mining: Applications and Theory.unknown: John Wiley and Sons, Ltd. 6 | 7 | import re 8 | import operator 9 | 10 | debug = False 11 | test = False 12 | 13 | 14 | def is_number(s): 15 | try: 16 | float(s) if '.' in s else int(s) 17 | return True 18 | except ValueError: 19 | return False 20 | 21 | 22 | def load_stop_words(stop_word_file): 23 | """ 24 | Utility function to load stop words from a file and return as a list of words 25 | @param stop_word_file Path and file name of a file containing stop words. 26 | @return list A list of stop words. 27 | """ 28 | stop_words = [] 29 | for line in open(stop_word_file): 30 | if line.strip()[0:1] != "#": 31 | for word in line.split(): # in case more than one per line 32 | stop_words.append(word) 33 | return stop_words 34 | 35 | 36 | def separate_words(text, min_word_return_size): 37 | """ 38 | Utility function to return a list of all words that are have a length greater than a specified number of characters. 39 | @param text The text that must be split in to words. 40 | @param min_word_return_size The minimum no of characters a word must have to be included. 41 | """ 42 | splitter = re.compile('[^a-zA-Z0-9_\\+\\-/]') 43 | words = [] 44 | for single_word in splitter.split(text): 45 | current_word = single_word.strip().lower() 46 | #leave numbers in phrase, but don't count as words, since they tend to invalidate scores of their phrases 47 | if len(current_word) > min_word_return_size and current_word != '' and not is_number(current_word): 48 | words.append(current_word) 49 | return words 50 | 51 | 52 | def split_sentences(text): 53 | """ 54 | Utility function to return a list of sentences. 55 | @param text The text that must be split in to sentences. 56 | """ 57 | sentence_delimiters = re.compile(u'[.!?,;:\t\\\\"\$\$\\\'\u2019\u2013]|\\s\\-\\s') 58 | sentences = sentence_delimiters.split(text) 59 | return sentences 60 | 61 | 62 | def build_stop_word_regex(stop_word_file_path): 63 | stop_word_list = load_stop_words(stop_word_file_path) 64 | stop_word_regex_list = [] 65 | for word in stop_word_list: 66 | word_regex = r'\b' + word + r'(?![\w-])' # added look ahead for hyphen 67 | stop_word_regex_list.append(word_regex) 68 | stop_word_pattern = re.compile('|'.join(stop_word_regex_list), re.IGNORECASE) 69 | return stop_word_pattern 70 | 71 | 72 | def generate_candidate_keywords(sentence_list, stopword_pattern): 73 | phrase_list = [] 74 | for s in sentence_list: 75 | tmp = re.sub(stopword_pattern, '|', s.strip()) 76 | phrases = tmp.split("|") 77 | for phrase in phrases: 78 | phrase = phrase.strip().lower() 79 | if phrase != "": 80 | phrase_list.append(phrase) 81 | return phrase_list 82 | 83 | 84 | def calculate_word_scores(phraseList): 85 | word_frequency = {} 86 | word_degree = {} 87 | for phrase in phraseList: 88 | word_list = separate_words(phrase, 0) 89 | word_list_length = len(word_list) 90 | word_list_degree = word_list_length - 1 91 | #if word_list_degree > 3: word_list_degree = 3 #exp. 92 | for word in word_list: 93 | word_frequency.setdefault(word, 0) 94 | word_frequency[word] += 1 95 | word_degree.setdefault(word, 0) 96 | word_degree[word] += word_list_degree #orig. 97 | #word_degree[word] += 1/(word_list_length*1.0) #exp. 98 | for item in word_frequency: 99 | word_degree[item] = word_degree[item] + word_frequency[item] 100 | 101 | # Calculate Word scores = deg(w)/frew(w) 102 | word_score = {} 103 | for item in word_frequency: 104 | word_score.setdefault(item, 0) 105 | word_score[item] = word_degree[item] / (word_frequency[item] * 1.0) #orig. 106 | #word_score[item] = word_frequency[item]/(word_degree[item] * 1.0) #exp. 107 | return word_score 108 | 109 | 110 | def generate_candidate_keyword_scores(phrase_list, word_score): 111 | keyword_candidates = {} 112 | for phrase in phrase_list: 113 | keyword_candidates.setdefault(phrase, 0) 114 | word_list = separate_words(phrase, 0) 115 | candidate_score = 0 116 | for word in word_list: 117 | candidate_score += word_score[word] 118 | keyword_candidates[phrase] = candidate_score 119 | return keyword_candidates 120 | 121 | 122 | class Rake(object): 123 | def __init__(self, stop_words_path): 124 | self.stop_words_path = stop_words_path 125 | self.__stop_words_pattern = build_stop_word_regex(stop_words_path) 126 | 127 | def run(self, text): 128 | sentence_list = split_sentences(text) 129 | 130 | phrase_list = generate_candidate_keywords(sentence_list, self.__stop_words_pattern) 131 | word_scores = calculate_word_scores(phrase_list) 132 | 133 | keyword_candidates = generate_candidate_keyword_scores(phrase_list, word_scores) 134 | 135 | sorted_keywords = sorted(keyword_candidates.items(), key=operator.itemgetter(1), reverse=True) 136 | return sorted_keywords 137 | 138 | 139 | if test: 140 | text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types of systems and systems of mixed types." 141 | 142 | # Split text into sentences 143 | sentenceList = split_sentences(text) 144 | #stoppath = "FoxStoplist.txt" #Fox stoplist contains "numbers", so it will not find "natural numbers" like in Table 1.1 145 | stoppath = "SmartStoplist.txt" #SMART stoplist misses some of the lower-scoring keywords in Figure 1.5, which means that the top 1/3 cuts off one of the 4.0 score words in Table 1.1 146 | stopwordpattern = build_stop_word_regex(stoppath) 147 | 148 | # generate candidate keywords 149 | phraseList = generate_candidate_keywords(sentenceList, stopwordpattern) 150 | 151 | # calculate individual word scores 152 | wordscores = calculate_word_scores(phraseList) 153 | 154 | # generate candidate keyword scores 155 | keywordcandidates = generate_candidate_keyword_scores(phraseList, wordscores) 156 | if debug: print(keywordcandidates) 157 | 158 | sortedKeywords = sorted(keywordcandidates.items(), key=operator.itemgetter(1), reverse=True) 159 | if debug: print(sortedKeywords) 160 | 161 | totalKeywords = len(sortedKeywords) 162 | if debug: print(totalKeywords) 163 | print(sortedKeywords[0:(totalKeywords / 3)]) 164 | 165 | rake = Rake("SmartStoplist.txt") 166 | keywords = rake.run(text) 167 | print(keywords) 168 | -------------------------------------------------------------------------------- /JupyterNotebooks/sample_page.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CatalystCode/CustomSearch/3d9f82f676577210dde159ef7218753a2e8c309b/JupyterNotebooks/sample_page.png -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | Copyright (c) Microsoft Corporation 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and 5 | associated documentation files (the "Software"), to deal in the Software without restriction, 6 | including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, 7 | and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, 8 | subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in all copies or substantial 11 | portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT 14 | NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 15 | IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, 16 | WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 17 | SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /Python/azsearch_mgmt.py: -------------------------------------------------------------------------------- 1 | """ 2 | Python code to upload data to Azure Search for the Custom Search example. 3 | 4 | This script will upload all of the session information where 5 | each individual sesssion equates to a document in an index 6 | in an Azure Search service. 7 | 8 | Go to http://portal.azure.com and sign up for a search service. 9 | Get the service name and service key and plug it in below. 10 | This is NOT production level code. Please do not use it as such. 11 | You might have to pip install the imported modules here. 12 | 13 | Run this script in the 'code' directory: 14 | python azsearch_mgmt.py 15 | 16 | See Azure Search REST API docs for more info: 17 | https://docs.microsoft.com/en-us/rest/api/searchservice/index 18 | 19 | Depedencies: This script requires pyexcel, pyexcel-xls and pyexcel-xlsx 20 | To install dependencies: pip install pyexcel pyexcel-xls pyexcel-xlsx 21 | """ 22 | 23 | import requests 24 | import json 25 | import csv 26 | import datetime 27 | import pytz 28 | import calendar 29 | import os 30 | import pyexcel as pe 31 | 32 | # This is the service you've already created in Azure Portal 33 | serviceName = 'your_azure_search_service_name' 34 | 35 | # Index to be created 36 | indexName = 'name_of_index_to_create' 37 | 38 | # Set your service API key, either via an environment variable or enter it below 39 | #apiKey = os.getenv('SEARCH_KEY_DEV', '') 40 | apiKey = 'your_azure_search_service_api_key' 41 | apiVersion = '2016-09-01' 42 | 43 | # Input parsed content Excel file, e.g., output of step #1 in 44 | # https://github.com/CatalystCode/CustomSearch/tree/master/JupyterNotebooks/1-content_extraction.ipynb 45 | inputfile = os.path.join(os.getcwd(), '../sample/parsed_content.xlsx') 46 | 47 | # Define fields mapping from Excel file column names to search index field names (except Index) 48 | # Change this mapping to match your content fields and rename output fields as desired 49 | # Search field names should match their definition in getIndexDefinition() 50 | fields_map = [ ('File' , 'File'), 51 | ('ChapterTitle' , 'ChapterTitle'), 52 | ('SectionTitle' , 'SectionTitle'), 53 | ('SubsectionTitle' , 'SubsectionTitle'), 54 | ('SubsectionText' , 'SubsectionText'), 55 | ('Keywords' , 'Keywords') ] 56 | 57 | # Fields: Index File ChapterTitle SectionTitle SubsectionTitle SubsectionText Keywords 58 | def getIndexDefinition(): 59 | return { 60 | "name": indexName, 61 | "fields": [ 62 | {"name": "Index", "type": "Edm.String", "key": True, "retrievable": True, "searchable": False, "filterable": False, "sortable": True, "facetable": False}, 63 | 64 | {"name": "File", "type": "Edm.String", "retrievable": True, "searchable": False, "filterable": True, "sortable": True, "facetable": False}, 65 | 66 | {"name": "ChapterTitle", "type": "Edm.String", "retrievable": True, "searchable": True, "filterable": True, "sortable": True, "facetable": True}, 67 | 68 | {"name": "SectionTitle", "type": "Edm.String", "retrievable": True, "searchable": True, "filterable": True, "sortable": False, "facetable": True}, 69 | 70 | {"name": "SubsectionTitle", "type": "Edm.String", "retrievable": True, "searchable": True, "filterable": True, "sortable": True, "facetable": False}, 71 | 72 | {"name": "SubsectionText", "type": "Edm.String", "retrievable": True, "searchable": True, "filterable": False, "sortable": False, "facetable": False, "analyzer": "en.microsoft"}, 73 | 74 | {"name": "Keywords", "type": "Edm.String", "retrievable": True, "searchable": True, "filterable": False, "sortable": False, "facetable": False, "analyzer": "en.microsoft"} 75 | ] 76 | } 77 | 78 | def getServiceUrl(): 79 | return 'https://' + serviceName + '.search.windows.net' 80 | 81 | def getMethod(servicePath): 82 | headers = {'Content-type': 'application/json', 'api-key': apiKey} 83 | r = requests.get(getServiceUrl() + servicePath, headers=headers) 84 | #print(r.text) 85 | return r 86 | 87 | def postMethod(servicePath, body): 88 | headers = {'Content-type': 'application/json', 'api-key': apiKey} 89 | r = requests.post(getServiceUrl() + servicePath, headers=headers, data=body) 90 | #print(r, r.text) 91 | return r 92 | 93 | def createIndex(): 94 | indexDefinition = json.dumps(getIndexDefinition()) 95 | servicePath = '/indexes/?api-version=%s' % apiVersion 96 | r = postMethod(servicePath, indexDefinition) 97 | #print(r.text) 98 | if r.status_code == 201: 99 | print('Index %s created' % indexName) 100 | else: 101 | print('Failed to create index %s' % indexName) 102 | exit(1) 103 | 104 | def deleteIndex(): 105 | servicePath = '/indexes/%s?api-version=%s&delete' % (indexName, apiVersion) 106 | headers = {'Content-type': 'application/json', 'api-key': apiKey} 107 | r = requests.delete(getServiceUrl() + servicePath, headers=headers) 108 | #print(r.text) 109 | 110 | def getIndex(): 111 | servicePath = '/indexes/%s?api-version=%s' % (indexName, apiVersion) 112 | r = getMethod(servicePath) 113 | if r.status_code == 200: 114 | return True 115 | else: 116 | return False 117 | 118 | def getDocumentObject(): 119 | valarry = [] 120 | cnt = 1 121 | records = pe.iget_records(file_name=inputfile) 122 | for row in records: 123 | outdict = {} 124 | outdict['@search.action'] = 'upload' 125 | 126 | if (row[fields_map[0][0]]): 127 | outdict['Index'] = str(row['Index']) 128 | for (in_fld, out_fld) in fields_map: 129 | outdict[out_fld] = row[in_fld] 130 | valarry.append(outdict) 131 | cnt+=1 132 | 133 | return {'value' : valarry} 134 | 135 | def getDocumentObjectByChunk(start, end): 136 | valarry = [] 137 | cnt = 1 138 | records = pe.iget_records(file_name=inputfile) 139 | for i, row in enumerate(records): 140 | if start <= i < end: 141 | outdict = {} 142 | outdict['@search.action'] = 'upload' 143 | 144 | if (row[fields_map[0][0]]): 145 | outdict['Index'] = str(row['Index']) 146 | for (in_fld, out_fld) in fields_map: 147 | outdict[out_fld] = row[in_fld] 148 | valarry.append(outdict) 149 | cnt+=1 150 | 151 | return {'value' : valarry} 152 | 153 | # Upload content for indexing in one request if content is not too large 154 | def uploadDocuments(): 155 | documents = json.dumps(getDocumentObject()) 156 | servicePath = '/indexes/' + indexName + '/docs/index?api-version=' + apiVersion 157 | r = postMethod(servicePath, documents) 158 | if r.status_code == 200: 159 | print('Success: %s' % r) 160 | else: 161 | print('Failure: %s' % r.text) 162 | exit(1) 163 | 164 | # Upload content for indexing in chunks if content is too large for one request 165 | def uploadDocumentsInChunks(chunksize): 166 | records = pe.iget_records(file_name=inputfile) 167 | cnt = 0 168 | for row in records: 169 | cnt += 1 170 | 171 | for chunk in range(int(cnt/chunksize) + 1): 172 | print('Processing chunk number %d ...' % chunk) 173 | start = chunk * chunksize 174 | end = start + chunksize 175 | documents = json.dumps(getDocumentObjectByChunk(start, end)) 176 | servicePath = '/indexes/' + indexName + '/docs/index?api-version=' + apiVersion 177 | r = postMethod(servicePath, documents) 178 | if r.status_code == 200: 179 | print('Success: %s' % r) 180 | else: 181 | print('Failure: %s' % r.text) 182 | return 183 | 184 | # Upload content for indexing one document at a time 185 | def uploadDocumentsOneByOne(): 186 | records = pe.iget_records(file_name=inputfile) 187 | valarry = [] 188 | for i, row in enumerate(records): 189 | outdict = {} 190 | outdict['@search.action'] = 'upload' 191 | 192 | if (row[fields_map[0][0]]): 193 | outdict['Index'] = str(row['Index']) 194 | for (in_fld, out_fld) in fields_map: 195 | outdict[out_fld] = row[in_fld] 196 | valarry.append(outdict) 197 | 198 | documents = json.dumps({'value' : valarry}) 199 | servicePath = '/indexes/' + indexName + '/docs/index?api-version=' + apiVersion 200 | r = postMethod(servicePath, documents) 201 | if r.status_code == 200: 202 | print('%d Success: %s' % (i,r)) 203 | else: 204 | print('%d Failure: %s' % (i, r.text)) 205 | exit(1) 206 | 207 | def printDocumentCount(): 208 | servicePath = '/indexes/' + indexName + '/docs/$count?api-version=' + apiVersion 209 | getMethod(servicePath) 210 | 211 | def sampleQuery(query, ntop=3): 212 | servicePath = '/indexes/' + indexName + '/docs?api-version=%s&search=%s&$top=%d' % \ 213 | (apiVersion, query, ntop) 214 | getMethod(servicePath) 215 | 216 | # Python 2.x/3.x incompatibility of input() and raw_input() 217 | # Bind input() to raw_input() in Python 2.x, leave as-is in Python 3.x 218 | try: 219 | input = raw_input 220 | except NameError: 221 | pass 222 | 223 | if __name__ == '__main__': 224 | # Create index if it does not exist 225 | if not getIndex(): 226 | createIndex() 227 | else: 228 | ans = input('Index %s already exists ... Do you want to delete it? [Y/n]' % indexName) 229 | if ans.lower() == 'y': 230 | deleteIndex() 231 | print('Re-creating index %s ...' % indexName) 232 | createIndex() 233 | else: 234 | print('Index %s is not deleted ... New content will be added to existing index' % indexName) 235 | 236 | #getIndex() 237 | #uploadDocuments() 238 | uploadDocumentsInChunks(50) 239 | #uploadDocumentsOneByOne() 240 | printDocumentCount() 241 | sampleQuery('child tax credit') -------------------------------------------------------------------------------- /Python/azsearch_query.py: -------------------------------------------------------------------------------- 1 | """ 2 | Python code to query Azure Search interactively 3 | 4 | Run this script in the 'code' directory: 5 | python azsearch_query.py 6 | 7 | See Azure Search REST API docs for more info: 8 | https://docs.microsoft.com/en-us/rest/api/searchservice/index 9 | 10 | """ 11 | 12 | import requests 13 | import json 14 | import os 15 | 16 | # This is the service you've already created in Azure Portal 17 | serviceName = 'your_azure_search_service_name' 18 | 19 | # This is the index you've already created in Azure Portal or via the azsearch_mgmt.py script 20 | indexName = 'your_index_name_to_use' 21 | 22 | # Set your service API key, either via an environment variable or enter it below 23 | #apiKey = os.getenv('SEARCH_KEY_DEV', '') 24 | apiKey = 'your_azure_search_service_api_key' 25 | apiVersion = '2016-09-01' 26 | 27 | # Retrieval options to alter the query results 28 | SEARCHFIELDS = None # use all searchable fields for retrieval 29 | #SEARCHFIELDS = 'Keywords, SubsectionText' # use selected fields only for retrieval 30 | FUZZY = False # enable fuzzy search (check API for details) 31 | NTOP = 5 # uumber of results to return 32 | 33 | 34 | def getServiceUrl(): 35 | return 'https://' + serviceName + '.search.windows.net' 36 | 37 | def getMethod(servicePath): 38 | headers = {'Content-type': 'application/json', 'api-key': apiKey} 39 | r = requests.get(getServiceUrl() + servicePath, headers=headers) 40 | #print(r, r.text) 41 | return r 42 | 43 | def postMethod(servicePath, body): 44 | headers = {'Content-type': 'application/json', 'api-key': apiKey} 45 | r = requests.post(getServiceUrl() + servicePath, headers=headers, data=body) 46 | #print(r, r.text) 47 | return r 48 | 49 | def submitQuery(query, fields=None, ntop=10): 50 | servicePath = '/indexes/' + indexName + '/docs?api-version=%s&search=%s&$top=%d' % \ 51 | (apiVersion, query, ntop) 52 | if fields != None: 53 | servicePath += '&searchFields=%s' % fields 54 | if FUZZY: 55 | servicePath += '&queryType=full' 56 | r = getMethod(servicePath) 57 | if r.status_code != 200: 58 | print('Failed to retrieve search results') 59 | print(r, r.text) 60 | return 61 | docs = json.loads(r.text)['value'] 62 | print('Number of search results = %d\n' % len(docs)) 63 | for i, doc in enumerate(docs): 64 | print('Results# %d' % (i+1)) 65 | print('Chapter title : %s' % doc['ChapterTitle'].encode('utf8')) 66 | print('Section title : %s' % doc['SectionTitle'].encode('utf8')) 67 | print('Subsection title: %s' % doc['SubsectionTitle'].encode('utf8')) 68 | print('%s\n' % doc['SubsectionText'].encode('utf8')) 69 | 70 | # Python 2.x/3.x incompatibility of input() and raw_input() 71 | # Bind input() to raw_input() in Python 2.x, leave as-is in Python 3.x 72 | try: 73 | input = raw_input 74 | except NameError: 75 | pass 76 | 77 | ##################################################################### 78 | # Azure Search interactive query - command-line interface 79 | # Retrieve Azure Search documents via an interactive query 80 | # Fields: Index File ChapterTitle SectionTitle SubsectionTitle SubsectionText Keywords 81 | ##################################################################### 82 | if __name__ == '__main__': 83 | while True: 84 | print() 85 | print("Hit enter with no input to quit.") 86 | query = input("Query: ") 87 | if query == '': 88 | exit(0) 89 | 90 | # Submit query to Azure Search and retrieve results 91 | #searchFields = None 92 | searchFields = SEARCHFIELDS 93 | submitQuery(query, fields=searchFields, ntop=NTOP) -------------------------------------------------------------------------------- /Python/azsearch_queryall.py: -------------------------------------------------------------------------------- 1 | """ 2 | Python code for batch retrieval of Azure Search results for multiple queries in a file 3 | 4 | Run this script in the 'code' directory: 5 | python azsearch_queryall.py 6 | 7 | See Azure Search REST API docs for more info: 8 | https://docs.microsoft.com/en-us/rest/api/searchservice/index 9 | 10 | """ 11 | 12 | import requests 13 | import json 14 | import csv 15 | import os 16 | import pyexcel as pe 17 | import codecs 18 | import pandas as pd 19 | 20 | # This is the service you've already created in Azure Portal 21 | serviceName = 'your_azure_search_service_name' 22 | 23 | # This is the index you've already created in Azure Portal or via the azsearch_mgmt.py script 24 | indexName = 'your_index_name_to_use' 25 | 26 | # Set your service API key, either via an environment variable or enter it below 27 | #apiKey = os.getenv('SEARCH_KEY_DEV', '') 28 | apiKey = 'your_azure_search_service_api_key' 29 | apiVersion = '2016-09-01' 30 | 31 | # Input file coontaining the list of queries [tab-separated .txt or .tsv, Excel .xls or .xlsx] 32 | infile = os.path.join(os.getcwd(), '../sample/sample_queries.txt') 33 | outfile = os.path.join(os.getcwd(), '../sample/sample_query_answers.xlsx') 34 | 35 | # Retrieval options to alter the query results 36 | SEARCHFIELDS = None # use all searchable fields for retrieval 37 | #SEARCHFIELDS = 'Keywords, SubsectionText' # use selected fields only for retrieval 38 | FUZZY = False # enable fuzzy search (check API for details) 39 | NTOP = 5 # uumber of results to return 40 | 41 | 42 | def getServiceUrl(): 43 | return 'https://' + serviceName + '.search.windows.net' 44 | 45 | def getMethod(servicePath): 46 | headers = {'Content-type': 'application/json', 'api-key': apiKey} 47 | r = requests.get(getServiceUrl() + servicePath, headers=headers) 48 | #print(r, r.text) 49 | return r 50 | 51 | def postMethod(servicePath, body): 52 | headers = {'Content-type': 'application/json', 'api-key': apiKey} 53 | r = requests.post(getServiceUrl() + servicePath, headers=headers, data=body) 54 | #print(r, r.text) 55 | return r 56 | 57 | def submitQuery(query, fields=None, ntop=10, fuzzy=False): 58 | servicePath = '/indexes/' + indexName + '/docs?api-version=%s&search=%s&$top=%d' % \ 59 | (apiVersion, query, ntop) 60 | if fields != None: 61 | servicePath += '&searchFields=%s' % fields 62 | if fuzzy: 63 | servicePath += '&queryType=full' 64 | 65 | r = getMethod(servicePath) 66 | if r.status_code != 200: 67 | print('Failed to retrieve search results') 68 | print(query, r, r.text) 69 | return {} 70 | docs = json.loads(r.text)['value'] 71 | return docs 72 | 73 | 74 | ############################################################################# 75 | # Retrieve Azure Search documents for all queries in batch 76 | # Fields: Index File ChapterTitle SectionTitle SubsectionTitle SubsectionText Keywords 77 | ############################################################################# 78 | if __name__ == '__main__': 79 | # Dataframe to keep index of crawled pages 80 | df = pd.DataFrame(columns = ['Qid', 'Query', 'Rank', 'SubsectionText', 'ChapterTitle', 'SectionTitle', 'SubsectionTitle', 'Keywords']) 81 | 82 | if infile.endswith('.tsv') or infile.endswith('.txt'): 83 | records = pd.read_csv(infile, sep='\t', header=0, encoding='utf-8') 84 | rows = records.iterrows() 85 | elif infile.endswith('.xls') or infile.endswith('.xlsx'): 86 | records = pe.iget_records(file_name=infile) 87 | rows = enumerate(records) 88 | else: 89 | print('Unsupported query file extension. Options: tsv, txt, xls, xlsx') 90 | exit(1) 91 | 92 | for i, row in rows: 93 | qid = int(row['Qid']) 94 | query = row['Query'] 95 | # Submit query to Azure Search and retrieve results 96 | searchFields = SEARCHFIELDS 97 | docs = submitQuery(query, fields=searchFields, ntop=NTOP, fuzzy=FUZZY) 98 | print('QID: %4d\tNumber of results: %d' % (qid, len(docs))) 99 | for id, doc in enumerate(docs): 100 | chapter_title = doc['ChapterTitle'] 101 | section_title = doc['SectionTitle'] 102 | subsection_title = doc['SubsectionTitle'] 103 | subsection_text = doc['SubsectionText'] 104 | keywords = doc['Keywords'] 105 | 106 | df = df.append({'Qid' : qid, 107 | 'Query' : query, 108 | 'Rank' : (id + 1), 109 | 'SubsectionText' : subsection_text, 110 | 'ChapterTitle' : chapter_title, 111 | 'SectionTitle' : section_title, 112 | 'SubsectionTitle' : subsection_title, 113 | 'Keywords' : keywords}, 114 | ignore_index=True) 115 | 116 | # Save all answers 117 | df['Qid'] = df['Qid'].astype(int) 118 | df['Rank'] = df['Rank'].astype(int) 119 | 120 | if outfile.endswith('.xls') or outfile.endswith('.xlsx'): 121 | df.to_excel(outfile, index=False, encoding='utf-8') 122 | else: # default tab-separated file 123 | df.to_csv(outfile, sep='\t', index=False, encoding='utf-8') 124 | 125 | -------------------------------------------------------------------------------- /Python/keyphrase_extract.py: -------------------------------------------------------------------------------- 1 | ########################################################################################### 2 | # Keyphrase extractor example for experimentation 3 | # Supports algoriths: RAKE, topic rank, single rank, TFIDF and KPMINER 4 | # 5 | # For more info about RAKE algorithm and implmentation, see https://github.com/aneesha/RAKE 6 | # Note: A copy of rake.py and SmartStoplist.txt stopwords list is included in the 7 | # folder ../JupyterNotebooks. Copy rake.py and the stopwords list files to current folder. 8 | # 9 | # For more info about the PKE implementations, see https://github.com/boudinfl/pke 10 | # Note: Install PKE from the GitHub repo https://github.com/boudinfl/pke 11 | # Incoptibility alert: PKE only works in Python 2.x at the moment. For Python 3.x, use RAKE. 12 | ########################################################################################### 13 | 14 | # Import base packages 15 | from bs4 import BeautifulSoup 16 | import os, glob, sys, re 17 | from rake import * 18 | import pke 19 | 20 | 21 | # Strip non-ascii characters that break the overlap check 22 | def strip_non_ascii(s): 23 | s = (c for c in s if 0 < ord(c) < 255) 24 | s = ''.join(s) 25 | return s 26 | 27 | # Clean text: remove newlines, compact spaces, strip non_ascii, etc. 28 | def clean_text(text, lowercase=False, nopunct=False): 29 | # Convert to lowercase 30 | if lowercase: 31 | text = text.lower() 32 | 33 | # Remove punctuation 34 | if nopunct: 35 | puncts = string.punctuation 36 | for c in puncts: 37 | text = text.replace(c, ' ') 38 | 39 | # Strip non-ascii characters 40 | text = strip_non_ascii(text) 41 | 42 | # Remove newlines - Compact and strip whitespaces 43 | text = re.sub('[\r\n]+', ' ', text) 44 | text = re.sub('\s+', ' ', text) 45 | return text.strip() 46 | 47 | # Load custom stopwords list 48 | def load_stop_words(stoplist_path): 49 | stop_words = [] 50 | for line in open(stoplist_path): 51 | if line.strip()[0:1] != "#": 52 | for word in line.split(): 53 | stop_words.append(word) 54 | return stop_words 55 | 56 | # Extract keyphrases using RAKE algorithm. Limit results by minimum score. 57 | def get_keyphrases_rake(infile, stoplist_path=None, min_score=0): 58 | if stoplist_path == None: 59 | stoplist_path = 'SmartStoplist.txt' 60 | 61 | rake = Rake(stoplist_path) 62 | text = open(infile, 'r').read() 63 | keywords = rake.run(text) 64 | phrases = [] 65 | for keyword in keywords: 66 | score = keyword[1] 67 | if score >= min_score: 68 | phrases.append(keyword) 69 | 70 | return phrases 71 | 72 | # Extract keyphrases using various algorithms provided by PKE 73 | def get_keyphrases_pke(infile, mode='topic', stoplist_path=None, postags=None, ntop=100): 74 | if stoplist_path == None: 75 | stoplist_path = 'SmartStoplist.txt' 76 | stoplist = load_stop_words(stoplist_path) 77 | 78 | if postags == None: 79 | postags = ['NN', 'NNS', 'NNP', 'NNPS', 'JJ', 'JJR', 'JJS', 'VBN', 'VBD'] 80 | 81 | # Run keyphrase extractor - Topic_Rank unsupervised method 82 | if mode == 'topic': 83 | try: 84 | extractor = pke.TopicRank(input_file=infile, language='english') 85 | extractor.read_document(format='raw', stemmer=None) 86 | extractor.candidate_selection(stoplist=stoplist, pos=postags) 87 | extractor.candidate_weighting(threshold=0.25, method='average') 88 | phrases = extractor.get_n_best(300, redundancy_removal=True) 89 | except: 90 | phrases = [] 91 | 92 | # Run keyphrase extractor - Single_Rank unsupervised method 93 | elif mode == 'single': 94 | try: 95 | extractor = pke.SingleRank(input_file=infile, language='english') 96 | extractor.read_document(format='raw', stemmer=None) 97 | extractor.candidate_selection(stoplist=stoplist) 98 | extractor.candidate_weighting(normalized=True) 99 | except: 100 | phrases = [] 101 | 102 | # Run keyphrase extractor - TfIdf unsupervised method 103 | elif mode == 'tfidf': 104 | try: 105 | extractor= pke.TfIdf(input_file=infile, language='english') 106 | extractor.read_document(format='raw', stemmer=None) 107 | extractor.candidate_selection(stoplist=stoplist) 108 | extractor.candidate_weighting() 109 | except: 110 | phrases = [] 111 | 112 | # Run keyphrase extractor - KP_Miner unsupervised method 113 | elif mode == 'kpminer': 114 | try: 115 | extractor = pke.KPMiner(input_file=infile, language='english') 116 | extractor.read_document(format='raw', stemmer=None) 117 | extractor.candidate_selection(stoplist=stoplist) 118 | extractor.candidate_weighting() 119 | except: 120 | phrases = [] 121 | 122 | else: # invalid mode 123 | print("Invalid keyphrase extraction algorithm: %s" % mode) 124 | print("Valid PKE algorithms: [topic, single, kpminer, tfidf]") 125 | exit(1) 126 | 127 | phrases = extractor.get_n_best(ntop, redundancy_removal=True) 128 | return phrases 129 | 130 | def usage(): 131 | print('Usage %s filename [algo]' % os.path.basename(sys.argv[0])) 132 | print('Algo options: rake, topic, single, tfidf, kpminer') 133 | 134 | 135 | ############################## 136 | # Main processing 137 | ############################## 138 | 139 | if len(sys.argv) < 2: 140 | print('Missing content file name') 141 | usage() 142 | exit(1) 143 | 144 | infile = sys.argv[1] 145 | if len(sys.argv) >= 3: 146 | algo = sys.argv[2] 147 | if algo not in ['rake', 'topic', 'single', 'tfidf', 'kpminer']: 148 | print("Invalid keyphrase extraction algorithm: %s" % algo) 149 | usage() 150 | exit(1) 151 | else: 152 | algo = 'rake' 153 | 154 | # Read custom stopwords list from file - Applies to all algos 155 | # If no stopwords file is supplied, default uses SmartStoplist.txt 156 | stoplist_file = 'SmartStoplist_extended.txt' 157 | 158 | # Select POS tags to use for PKE candidate selection, use default if None 159 | postags = ['NN', 'NNS', 'NNP', 'NNPS', 'JJ', 'JJR', 'JJS', 'VBN', 'VBD'] 160 | 161 | # Run keyphrase extraction 162 | if algo == 'rake': 163 | min_score = 1 164 | phrases = get_keyphrases_rake(infile, stoplist_path=stoplist_file, min_score=min_score) 165 | else: 166 | ntop = 200 167 | phrases = get_keyphrases_pke(infile, mode=algo, stoplist_path=stoplist_file, postags=postags, ntop=ntop) 168 | 169 | # Report all keyphrases and their scores 170 | print('Number of extracted keyphrases = %d' % len(phrases)) 171 | for phrase in phrases: 172 | print(phrase) 173 | 174 | # Combined list of keyphrases (no scores) 175 | all_phrases = ', '.join(p[0] for p in phrases) 176 | print('\nKeyphrases list: %s' % all_phrases) 177 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # Custom Search 3 | 4 | > Sample custom search project using Azure Search and the US Tax Code. 5 | 6 | > Python scripts and Jupyter notebooks that allow you to quickly and iteratively customize, 7 | improve and measure your custom search experience. 8 | 9 | 10 | ## Custom Search Service Development Features in the Python Scripts 11 | * Upload and update search index in Azure Search 12 | * Query interactively to test results 13 | * Query on batch basis to analyze performance 14 | * Extract keyphrases to enhance search index metadata 15 | 16 | ## End-to-End Example Provided in Jupyter Notebooks 17 | * Collect, pre-process, and augment content with keyphrases 18 | * Create an Azure Search index 19 | * Query the index and retrieve results interactively and/or in batch 20 | 21 | ## Getting Started 22 | 23 | 1. Read the [Real Life Code Story](https://www.microsoft.com/reallifecode/), "[Developing a Custom Search Engine for an Expert Chat System.](https://www.microsoft.com/reallifecode/)" 24 | 2. Review the [Azure Search service features](https://azure.microsoft.com/en-us/services/search/). 25 | 3. Get a [free trial subscriptions to Azure Search.](https://azure.microsoft.com/en-us/free/) 26 | 4. Copy your Azure Search name and Key. 27 | 5. Review the [sample](https://github.com/CatalystCode/CustomSearch/tree/master/sample) 28 | search index input and enriched input in the sample folder to understand content. 29 | 6. Try the sample Jupyter notebooks for an overview of the end-2-end process for content extraction, augmentation with keyphrases, indexing and retrieval. 30 | * Step 1: Content and keyphrase extraction: [1-content_extraction.ipynb](https://github.com/CatalystCode/CustomSearch/blob/master/JupyterNotebooks/1-content_extraction.ipynb) 31 | * Step 2: Index creation: [2-content_indexing.ipynb](https://github.com/CatalystCode/CustomSearch/blob/master/JupyterNotebooks/2-content_indexing.ipynb) 32 | * Step 3: Interactive and batch search queries: [3-azure_search_query.ipynb](https://github.com/CatalystCode/CustomSearch/blob/master/JupyterNotebooks/3-azure_search_query.ipynb) 33 | 7. A command-line version of the scripts is available under the Python folder. 34 | * Run the [azsearch_mgmt.py script](https://github.com/CatalystCode/CustomSearch/blob/master/Python/azsearch_mgmt.py), using your Azure Search name, key and index name of your choice to create a search index. 35 | * Run the [azsearch_query.py script](https://github.com/CatalystCode/CustomSearch/blob/master/Python/azsearch_query.py) to interactively query your new search index and see results. 36 | * Run the [azsearch_queryall.py script](https://github.com/CatalystCode/CustomSearch/blob/master/Python/azsearch_queryall.py) to batch query your new search index and evaluate the results. 37 | * Run the [keyphrase_extract.py script](https://github.com/CatalystCode/CustomSearch/blob/master/Python/keyphrase_extract.py) to experiment with various keyphrase extraction algorithms to enrich the search index metadata. Note this script is Python 2.7 only. 38 | 39 | 40 | ## Description 41 | Querying specific content areas quickly and easily is a common enterprise need. Fast traversal of specialized publications, customer support knowledge bases or document repositories allows enterprises to deliver service efficiently and effectively. Simple FAQs don’t cover enough ground, and a string search isn’t effective or efficient for those not familiar with the domain or the document set. Instead, enterprises can deliver a custom search experience that saves their clients time and provides them better service through a question and answer format. In this project, we leveraged Azure Search and Cognitive Services and we share our custom code for iterative testing, measurement and indexer redeployment. In our solution, the customized search engine will form the foundation for delivering a question and answer experience in a specific domain area. 42 | -------------------------------------------------------------------------------- /sample/html/1.1.1.1.1.2.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 26 U.S. Code § 2 - Definitions and special rules | US Law | LII / Legal Information Institute 14 |

26 U.S. Code § 2 - Definitions and special rules

15 | 18 |

19 |

20 | § 2. 21 |

22 |

23 | Definitions and special rules 24 |

25 |

26 | 27 | 28 | 29 | (a) 30 | 31 | 32 | Definition of surviving spouse 33 | 34 |

35 | 36 | 37 | 38 | (1) 39 | 40 | 41 | In general 42 | 43 | 44 | For purposes of section 1, the term “surviving spouse” means a taxpayer— 45 | 46 |

47 | 48 | 49 | 50 | (A) 51 | 52 |

53 | whose spouse died during either of his two taxable years immediately preceding the taxable year, and 54 |

55 |

56 |

57 | 58 | 59 | 60 | (B) 61 | 62 |

63 | who maintains as his home a household which constitutes for the taxable year the principal place of abode (as a member of such household) of a dependent (i) who (within the meaning of section 152, determined without regard to subsections (b)(1), (b)(2), and (d)(1)(B) thereof) is a son, stepson, daughter, or stepdaughter of the taxpayer, and (ii) with respect to whom the taxpayer is entitled to a deduction for the taxable year under section 151. 64 |

65 |

66 |

67 | For purposes of this paragraph, an individual shall be considered as maintaining a household only if over half of the cost of maintaining the household during the taxable year is furnished by such individual. 68 |

69 |

70 |

71 | 72 | 73 | 74 | (2) 75 | 76 | 77 | Limitations 78 | 79 | 80 | Notwithstanding paragraph (1), for purposes of section 1 a taxpayer shall not be considered to be a surviving spouse— 81 | 82 |

83 | 84 | 85 | 86 | (A) 87 | 88 |

89 | if the taxpayer has remarried at any time before the close of the taxable year, or 90 |

91 |

92 |

93 | 94 | 95 | 96 | (B) 97 | 98 |

99 | unless, for the taxpayer’s taxable year during which his spouse died, a joint return could have been made under the provisions of section 6013 (without regard to subsection (a)(3) thereof). 100 |

101 |

102 |

103 |

104 | 105 | 106 | 107 | (3) 108 | 109 | 110 | Special rule where deceased spouse was in missing status 111 | 112 | 113 | If an individual was in a missing status (within the meaning of section 6013(f)(3)) as a result of service in a combat zone (as determined for purposes of section 112) and if such individual remains in such status until the date referred to in subparagraph (A) or (B), then, for purposes of paragraph (1)(A), the date on which such individual died shall be treated as the earlier of the date determined under subparagraph (A) or the date determined under subparagraph (B): 114 | 115 |

116 | 117 | 118 | 119 | (A) 120 | 121 |

122 | the date on which the determination is made under 123 | 124 | section 556 of title 37 125 | 126 | of the United States Code or under 127 | 128 | section 5566 of title 5 129 | 130 | of such Code (whichever is applicable) that such individual died while in such missing status, or 131 |

132 |

133 |

134 | 135 | 136 | 137 | (B) 138 | 139 |

140 | except in the case of the combat zone designated for purposes of the Vietnam conflict, the date which is 2 years after the date designated under section 112 as the date of termination of combatant activities in that zone. 141 |

142 |

143 |

144 |

145 |

146 | 147 | 148 | 149 | (b) 150 | 151 | 152 | Definition of head of household 153 | 154 |

155 | 156 | 157 | 158 | (1) 159 | 160 | 161 | In general 162 | 163 | 164 | For purposes of this subtitle, an individual shall be considered a head of a household if, and only if, such individual is not married at the close of his taxable year, is not a surviving spouse (as defined in subsection (a)), and either— 165 | 166 |

167 | 168 | 169 | 170 | (A) 171 | 172 | 173 | maintains as his home a household which constitutes for more than one-half of such taxable year the principal place of abode, as a member of such household, of— 174 | 175 |

176 | 177 | 178 | 179 | (i) 180 | 181 | 182 | a qualifying child of the individual (as defined in section 152(c), determined without regard to section 152(e)), but not if such child— 183 | 184 |

185 | 186 | 187 | 188 | (I) 189 | 190 |

191 | is married at the close of the taxpayer’s taxable year, and 192 |

193 |

194 |

195 | 196 | 197 | 198 | (II) 199 | 200 |

201 | is not a dependent of such individual by reason of section 152(b)(2) or 152(b)(3), or both, or 202 |

203 |

204 |

205 |

206 | 207 | 208 | 209 | (ii) 210 | 211 |

212 | any other person who is a dependent of the taxpayer, if the taxpayer is entitled to a deduction for the taxable year for such person under section 151, or 213 |

214 |

215 |

216 |

217 | 218 | 219 | 220 | (B) 221 | 222 |

223 | maintains a household which constitutes for such taxable year the principal place of abode of the father or mother of the taxpayer, if the taxpayer is entitled to a deduction for the taxable year for such father or mother under section 151. 224 |

225 |

226 |

227 | For purposes of this paragraph, an individual shall be considered as maintaining a household only if over half of the cost of maintaining the household during the taxable year is furnished by such individual. 228 |

229 |

230 |

231 | 232 | 233 | 234 | (2) 235 | 236 | 237 | Determination of status 238 | 239 | 240 | For purposes of this subsection— 241 | 242 |

243 | 244 | 245 | 246 | (A) 247 | 248 |

249 | an individual who is legally separated from his spouse under a decree of divorce or of separate maintenance shall not be considered as married; 250 |

251 |

252 |

253 | 254 | 255 | 256 | (B) 257 | 258 |

259 | a taxpayer shall be considered as not married at the close of his taxable year if at any time during the taxable year his spouse is a nonresident alien; and 260 |

261 |

262 |

263 | 264 | 265 | 266 | (C) 267 | 268 |

269 | a taxpayer shall be considered as married at the close of his taxable year if his spouse (other than a spouse described in subparagraph (B)) died during the taxable year. 270 |

271 |

272 |

273 |

274 | 275 | 276 | 277 | (3) 278 | 279 | 280 | Limitations 281 | 282 | 283 | Notwithstanding paragraph (1), for purposes of this subtitle a taxpayer shall not be considered to be a head of a household— 284 | 285 |

286 | 287 | 288 | 289 | (A) 290 | 291 |

292 | if at any time during the taxable year he is a nonresident alien; or 293 |

294 |

295 |

296 | 297 | 298 | 299 | (B) 300 | 301 | 302 | by reason of an individual who would not be a dependent for the taxable year but for— 303 | 304 |

305 | 306 | 307 | 308 | (i) 309 | 310 |

311 | subparagraph (H) of section 152(d)(2), or 312 |

313 |

314 |

315 | 316 | 317 | 318 | (ii) 319 | 320 |

321 | paragraph (3) of section 152(d). 322 |

323 |

324 |

325 |

326 |

327 |

328 | 329 | 330 | 331 | (c) 332 | 333 | 334 | Certain married individuals living apart 335 | 336 |

337 |

338 | For purposes of this part, an individual shall be treated as not married at the close of the taxable year if such individual is so treated under the provisions of section 7703(b). 339 |

340 |

341 |

342 |

343 | 344 | 345 | 346 | (d) 347 | 348 | 349 | Nonresident aliens 350 | 351 |

352 |

353 | In the case of a nonresident alien individual, the taxes imposed by sections 1 and 55 shall apply only as provided by section 871 or 877. 354 |

355 |

356 |

357 |

358 | 359 | 360 | 361 | (e) 362 | 363 | 364 | Cross reference 365 | 366 |

367 |

368 | For definition of taxable income, see section 63. 369 |

370 |

371 |

372 |

373 | (Aug. 16, 1954, ch. 736, 374 | 375 | 68A Stat. 8 376 | 377 | ; 378 | 379 | Pub. L. 88–272, title I 380 | 381 | , § 112(b), 382 | 383 | Feb. 26, 1964 384 | 385 | , 386 | 387 | 78 Stat. 24 388 | 389 | ; 390 | 391 | Pub. L. 91–172, title VIII 392 | 393 | , § 803(b), 394 | 395 | Dec. 30, 1969 396 | 397 | , 398 | 399 | 83 Stat. 682 400 | 401 | ; 402 | 403 | Pub. L. 93–597 404 | 405 | , § 3(b), 406 | 407 | Jan. 2, 1975 408 | 409 | , 410 | 411 | 88 Stat. 1951 412 | 413 | ; 414 | 415 | Pub. L. 94–455, title XIX 416 | 417 | , § 1901(a)(1), (b)(9), 418 | 419 | Oct. 4, 1976 420 | 421 | , 422 | 423 | 90 Stat. 1764 424 | 425 | , 1795; 426 | 427 | Pub. L. 94–569 428 | 429 | , § 3(a), 430 | 431 | Oct. 20, 1976 432 | 433 | , 434 | 435 | 90 Stat. 2699 436 | 437 | ; 438 | 439 | Pub. L. 97–448, title III 440 | 441 | , § 307(a), 442 | 443 | Jan. 12, 1983 444 | 445 | , 446 | 447 | 96 Stat. 2407 448 | 449 | ; 450 | 451 | Pub. L. 98–369, div. A, title IV 452 | 453 | , § 423(c)(2), 454 | 455 | July 18, 1984 456 | 457 | , 458 | 459 | 98 Stat. 801 460 | 461 | ; 462 | 463 | Pub. L. 99–514, title XIII 464 | 465 | , § 1301(j)(10), title XVII, § 1708(a)(1), 466 | 467 | Oct. 22, 1986 468 | 469 | , 470 | 471 | 100 Stat. 2658 472 | 473 | , 2782; 474 | 475 | Pub. L. 100–647, title I 476 | 477 | , § 1007(g)(13)(A), 478 | 479 | Nov. 10, 1988 480 | 481 | , 482 | 483 | 102 Stat. 3436 484 | 485 | ; 486 | 487 | Pub. L. 108–311, title II 488 | 489 | , §§ 202, 207(1), 490 | 491 | Oct. 4, 2004 492 | 493 | , 494 | 495 | 118 Stat. 1175 496 | 497 | , 1177; 498 | 499 | Pub. L. 109–135, title IV 500 | 501 | , § 412(a), 502 | 503 | Dec. 21, 2005 504 | 505 | , 506 | 507 | 119 Stat. 2636 508 | 509 | .) 510 |

511 |

512 | 513 | 514 | -------------------------------------------------------------------------------- /sample/html/1.1.1.1.1.3.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 26 U.S. Code § 3 - Tax tables for individuals | US Law | LII / Legal Information Institute 14 |

26 U.S. Code § 3 - Tax tables for individuals

15 | 18 |

19 |

20 | § 3. 21 |

22 |

23 | Tax tables for individuals 24 |

25 |

26 | 27 | 28 | 29 | (a) 30 | 31 | 32 | Imposition of tax table tax 33 | 34 |

35 | 36 | 37 | 38 | (1) 39 | 40 | 41 | In general 42 | 43 | 44 | In lieu of the tax imposed by section 1, there is hereby imposed for each taxable year on the taxable income of every individual— 45 | 46 |

47 | 48 | 49 | 50 | (A) 51 | 52 |

53 | who does not itemize his deductions for the taxable year, and 54 |

55 |

56 |

57 | 58 | 59 | 60 | (B) 61 | 62 |

63 | whose taxable income for such taxable year does not exceed the ceiling amount, 64 |

65 |

66 |

67 | a tax determined under tables, applicable to such taxable year, which shall be prescribed by the Secretary and which shall be in such form as he determines appropriate. In the table so prescribed, the amounts of the tax shall be computed on the basis of the rates prescribed by section 1. 68 |

69 |

70 |

71 | 72 | 73 | 74 | (2) 75 | 76 | 77 | Ceiling amount defined 78 | 79 |

80 |

81 | For purposes of paragraph (1), the term “ceiling amount” means, with respect to any taxpayer, the amount (not less than $20,000) determined by the Secretary for the tax rate category in which such taxpayer falls. 82 |

83 |

84 |

85 |

86 | 87 | 88 | 89 | (3) 90 | 91 | 92 | Authority to prescribe tables for taxpayers who itemize deductions 93 | 94 |

95 |

96 | The Secretary may provide that this section shall apply also for any taxable year to individuals who itemize their deductions. Any tables prescribed under the preceding sentence shall be on the basis of taxable income. 97 |

98 |

99 |

100 |

101 |

102 | 103 | 104 | 105 | (b) 106 | 107 | 108 | Section inapplicable to certain individuals 109 | 110 | 111 | This section shall not apply to— 112 | 113 |

114 | 115 | 116 | 117 | (1) 118 | 119 |

120 | an individual making a return under section 443(a)(1) for a period of less than 12 months on account of a change in annual accounting period, and 121 |

122 |

123 |

124 | 125 | 126 | 127 | (2) 128 | 129 |

130 | an estate or trust. 131 |

132 |

133 |

134 |

135 | 136 | 137 | 138 | (c) 139 | 140 | 141 | Tax treated as imposed by section 1 142 | 143 |

144 |

145 | For purposes of this title, the tax imposed by this section shall be treated as tax imposed by section 1. 146 |

147 |

148 |

149 |

150 | 151 | 152 | 153 | (d) 154 | 155 | 156 | Taxable income 157 | 158 |

159 |

160 | Whenever it is necessary to determine the taxable income of an individual to whom this section applies, the taxable income shall be determined under section 63. 161 |

162 |

163 |

164 |

165 | 166 | 167 | 168 | (e) 169 | 170 | 171 | Cross reference 172 | 173 |

174 |

175 | For computation of tax by Secretary, see section 6014. 176 |

177 |

178 |

179 |

180 | (Aug. 16, 1954, ch. 736, 181 | 182 | 68A Stat. 8 183 | 184 | ; 185 | 186 | Pub. L. 88–272, title III 187 | 188 | , § 301(a), 189 | 190 | Feb. 26, 1964 191 | 192 | , 193 | 194 | 78 Stat. 129 195 | 196 | ; 197 | 198 | Pub. L. 91–172, title VIII 199 | 200 | , § 803(c), 201 | 202 | Dec. 30, 1969 203 | 204 | , 205 | 206 | 83 Stat. 684 207 | 208 | ; 209 | 210 | Pub. L. 94–12, title II 211 | 212 | , § 201(c), 213 | 214 | Mar. 29, 1975 215 | 216 | , 217 | 218 | 89 Stat. 29 219 | 220 | ; 221 | 222 | Pub. L. 94–455, title V 223 | 224 | , § 501(a), 225 | 226 | Oct. 4, 1976 227 | 228 | , 229 | 230 | 90 Stat. 1558 231 | 232 | ; 233 | 234 | Pub. L. 95–30, title I 235 | 236 | , § 101(b), 237 | 238 | May 23, 1977 239 | 240 | , 241 | 242 | 91 Stat. 131 243 | 244 | ; 245 | 246 | Pub. L. 95–600, title IV 247 | 248 | , § 401(b)(1), 249 | 250 | Nov. 6, 1978 251 | 252 | , 253 | 254 | 92 Stat. 2867 255 | 256 | ; 257 | 258 | Pub. L. 95–600, title II 259 | 260 | , § 202(g), as added 261 | 262 | Pub. L. 96–222, title I 263 | 264 | , § 108(a)(1)(A), 265 | 266 | Apr. 1, 1980 267 | 268 | , 269 | 270 | 94 Stat. 223 271 | 272 | ; 273 | 274 | Pub. L. 96–222, title I 275 | 276 | , § 108(a)(1)(E), 277 | 278 | Apr. 1, 1980 279 | 280 | , 281 | 282 | 94 Stat. 225 283 | 284 | ; 285 | 286 | Pub. L. 97–34, title I 287 | 288 | , §§ 101(b)(2)(B), (C), (c)(2)(A), 121(c)(3), 289 | 290 | Aug. 13, 1981 291 | 292 | , 293 | 294 | 95 Stat. 183 295 | 296 | , 197; 297 | 298 | Pub. L. 99–514, title I 299 | 300 | , §§ 102(b), 141(b)(1), 301 | 302 | Oct. 22, 1986 303 | 304 | , 305 | 306 | 100 Stat. 2102 307 | 308 | , 2117.) 309 |

310 |

311 | 312 | 313 | -------------------------------------------------------------------------------- /sample/html/1.1.1.1.2.1.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 26 U.S. Code § 11 - Tax imposed | US Law | LII / Legal Information Institute 14 |

26 U.S. Code § 11 - Tax imposed

15 | 18 |

19 |

20 | § 11. 21 |

22 |

23 | Tax imposed 24 |

25 |

26 | 27 | 28 | 29 | (a) 30 | 31 | 32 | Corporations in general 33 | 34 |

35 |

36 | A tax is hereby imposed for each taxable year on the taxable income of every corporation. 37 |

38 |

39 |

40 |

41 | 42 | 43 | 44 | (b) 45 | 46 | 47 | Amount of tax 48 | 49 |

50 | 51 | 52 | 53 | (1) 54 | 55 | 56 | In general 57 | 58 | 59 | The amount of the tax imposed by subsection (a) shall be the sum of— 60 | 61 |

62 | 63 | 64 | 65 | (A) 66 | 67 |

68 | 15 percent of so much of the taxable income as does not exceed $50,000, 69 |

70 |

71 |

72 | 73 | 74 | 75 | (B) 76 | 77 |

78 | 25 percent of so much of the taxable income as exceeds $50,000 but does not exceed $75,000, 79 |

80 |

81 |

82 | 83 | 84 | 85 | (C) 86 | 87 |

88 | 34 percent of so much of the taxable income as exceeds $75,000 but does not exceed $10,000,000, and 89 |

90 |

91 |

92 | 93 | 94 | 95 | (D) 96 | 97 |

98 | 35 percent of so much of the taxable income as exceeds $10,000,000. 99 |

100 |

101 |

102 | In the case of a corporation which has taxable income in excess of $100,000 for any taxable year, the amount of tax determined under the preceding sentence for such taxable year shall be increased by the lesser of (i) 5 percent of such excess, or (ii) $11,750. In the case of a corporation which has taxable income in excess of $15,000,000, the amount of the tax determined under the foregoing provisions of this paragraph shall be increased by an additional amount equal to the lesser of (i) 3 percent of such excess, or (ii) $100,000. 103 |

104 |

105 |

106 | 107 | 108 | 109 | (2) 110 | 111 | 112 | Certain personal service corporations not eligible for graduated rates 113 | 114 |

115 |

116 | Notwithstanding paragraph (1), the amount of the tax imposed by subsection (a) on the taxable income of a qualified personal service corporation (as defined in section 448(d)(2)) shall be equal to 35 percent of the taxable income. 117 |

118 |

119 |

120 |

121 |

122 | 123 | 124 | 125 | (c) 126 | 127 | 128 | Exceptions 129 | 130 | 131 | Subsection (a) shall not apply to a corporation subject to a tax imposed by— 132 | 133 |

134 | 135 | 136 | 137 | (1) 138 | 139 |

140 | section 594 (relating to mutual savings banks conducting life insurance business), 141 |

142 |

143 |

144 | 145 | 146 | 147 | (2) 148 | 149 |

150 | subchapter L (sec. 801 and following, relating to insurance companies), or 151 |

152 |

153 |

154 | 155 | 156 | 157 | (3) 158 | 159 |

160 | subchapter M (sec. 851 and following, relating to regulated investment companies and real estate investment trusts). 161 |

162 |

163 |

164 |

165 | 166 | 167 | 168 | (d) 169 | 170 | 171 | Foreign corporations 172 | 173 |

174 |

175 | In the case of a foreign corporation, the taxes imposed by subsection (a) and section 55 shall apply only as provided by section 882. 176 |

177 |

178 |

179 |

180 | (Aug. 16, 1954, ch. 736, 181 | 182 | 68A Stat. 11 183 | 184 | ; Mar. 30, 1955, ch. 18, § 2, 185 | 186 | 69 Stat. 14 187 | 188 | ; Mar. 29, 1956, ch. 115, § 2, 189 | 190 | 70 Stat. 66 191 | 192 | ; 193 | 194 | Pub. L. 85–12 195 | 196 | , § 2, 197 | 198 | Mar. 29, 1957 199 | 200 | , 201 | 202 | 71 Stat. 9 203 | 204 | ; 205 | 206 | Pub. L. 85–475 207 | 208 | , § 2, 209 | 210 | June 30, 1958 211 | 212 | , 213 | 214 | 72 Stat. 259 215 | 216 | ; 217 | 218 | Pub. L. 86–75 219 | 220 | , § 2, 221 | 222 | June 30, 1959 223 | 224 | , 225 | 226 | 73 Stat. 157 227 | 228 | ; 229 | 230 | Pub. L. 86–564, title II 231 | 232 | , § 201, 233 | 234 | June 30, 1960 235 | 236 | , 237 | 238 | 74 Stat. 290 239 | 240 | ; 241 | 242 | Pub. L. 86–779 243 | 244 | , § 10(d), 245 | 246 | Sept. 14, 1960 247 | 248 | , 249 | 250 | 74 Stat. 1009 251 | 252 | ; 253 | 254 | Pub. L. 87–72 255 | 256 | , § 2, 257 | 258 | June 30, 1961 259 | 260 | , 261 | 262 | 75 Stat. 193 263 | 264 | ; 265 | 266 | Pub. L. 87–508 267 | 268 | , § 2, 269 | 270 | June 28, 1962 271 | 272 | , 273 | 274 | 76 Stat. 114 275 | 276 | ; 277 | 278 | Pub. L. 88–52 279 | 280 | , § 2, 281 | 282 | June 29, 1963 283 | 284 | , 285 | 286 | 77 Stat. 72 287 | 288 | ; 289 | 290 | Pub. L. 88–272, title I 291 | 292 | , § 121, 293 | 294 | Feb. 26, 1964 295 | 296 | , 297 | 298 | 78 Stat. 25 299 | 300 | ; 301 | 302 | Pub. L. 89–809, title I 303 | 304 | , § 104(b)(2), 305 | 306 | Nov. 13, 1966 307 | 308 | , 309 | 310 | 80 Stat. 1557 311 | 312 | ; 313 | 314 | Pub. L. 91–172, title IV 315 | 316 | , § 401(b)(2)(B), 317 | 318 | Dec. 30, 1969 319 | 320 | , 321 | 322 | 83 Stat. 602 323 | 324 | ; 325 | 326 | Pub. L. 94–12, title III 327 | 328 | , § 303(a), (b), 329 | 330 | Mar. 29, 1975 331 | 332 | , 333 | 334 | 89 Stat. 44 335 | 336 | ; 337 | 338 | Pub. L. 94–164 339 | 340 | , § 4(a)–(c), 341 | 342 | Dec. 23, 1975 343 | 344 | , 345 | 346 | 89 Stat. 973 347 | 348 | , 974; 349 | 350 | Pub. L. 94–455, title IX 351 | 352 | , § 901(a), 353 | 354 | Oct. 4, 1976 355 | 356 | , 357 | 358 | 90 Stat. 1606 359 | 360 | ; 361 | 362 | Pub. L. 95–30, title II 363 | 364 | , § 201(1), (2), 365 | 366 | May 23, 1977 367 | 368 | , 369 | 370 | 91 Stat. 141 371 | 372 | ; 373 | 374 | Pub. L. 95–600, title III 375 | 376 | , § 301(a), 377 | 378 | Nov. 6, 1978 379 | 380 | , 381 | 382 | 92 Stat. 2820 383 | 384 | ; 385 | 386 | Pub. L. 97–34, title II 387 | 388 | , § 231(a), 389 | 390 | Aug. 13, 1981 391 | 392 | , 393 | 394 | 95 Stat. 249 395 | 396 | ; 397 | 398 | Pub. L. 98–369, div. A, title I 399 | 400 | , § 66(a), 401 | 402 | July 18, 1984 403 | 404 | , 405 | 406 | 98 Stat. 585 407 | 408 | ; 409 | 410 | Pub. L. 99–514, title VI 411 | 412 | , § 601(a), 413 | 414 | Oct. 22, 1986 415 | 416 | , 417 | 418 | 100 Stat. 2249 419 | 420 | ; 421 | 422 | Pub. L. 100–203, title X 423 | 424 | , § 10224(a), 425 | 426 | Dec. 22, 1987 427 | 428 | , 429 | 430 | 101 Stat. 1330–412 431 | 432 | ; 433 | 434 | Pub. L. 100–647, title I 435 | 436 | , § 1007(g)(13)(B), 437 | 438 | Nov. 10, 1988 439 | 440 | , 441 | 442 | 102 Stat. 3436 443 | 444 | ; 445 | 446 | Pub. L. 103–66, title XIII 447 | 448 | , § 13221(a), (b), 449 | 450 | Aug. 10, 1993 451 | 452 | , 453 | 454 | 107 Stat. 477 455 | 456 | .) 457 |

458 |

459 | 460 | 461 | -------------------------------------------------------------------------------- /sample/html/1.1.1.1.4.1.7.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 26 U.S. Code § 25B - Elective deferrals and IRA contributions by certain individuals | US Law | LII / Legal Information Institute 14 |

26 U.S. Code § 25B - Elective deferrals and IRA contributions by certain individuals

15 | 18 |

19 |

20 | § 25B. 21 |

22 |

23 | Elective deferrals and IRA contributions by certain individuals 24 |

25 |

26 | 27 | 28 | 29 | (a) 30 | 31 | 32 | Allowance of credit 33 | 34 |

35 |

36 | In the case of an eligible individual, there shall be allowed as a credit against the tax imposed by this subtitle for the taxable year an amount equal to the applicable percentage of so much of the qualified retirement savings contributions of the eligible individual for the taxable year as do not exceed $2,000. 37 |

38 |

39 |

40 |

41 | 42 | 43 | 44 | (b) 45 | 46 | 47 | Applicable percentage 48 | 49 | 50 | For purposes of this section— 51 | 52 |

53 | 54 | 55 | 56 | (1) 57 | 58 | 59 | Joint returns 60 | 61 | 62 | In the case of a joint return, the applicable percentage is— 63 | 64 |

65 | 66 | 67 | 68 | (A) 69 | 70 |

71 | if the adjusted gross income of the taxpayer is not over $30,000, 50 percent, 72 |

73 |

74 |

75 | 76 | 77 | 78 | (B) 79 | 80 |

81 | if the adjusted gross income of the taxpayer is over $30,000 but not over $32,500, 20 percent, 82 |

83 |

84 |

85 | 86 | 87 | 88 | (C) 89 | 90 |

91 | if the adjusted gross income of the taxpayer is over $32,500 but not over $50,000, 10 percent, and 92 |

93 |

94 |

95 | 96 | 97 | 98 | (D) 99 | 100 |

101 | if the adjusted gross income of the taxpayer is over $50,000, zero percent. 102 |

103 |

104 |

105 |

106 | 107 | 108 | 109 | (2) 110 | 111 | 112 | Other returns 113 | 114 | 115 | In the case of— 116 | 117 |

118 | 119 | 120 | 121 | (A) 122 | 123 |

124 | a head of household, the applicable percentage shall be determined under paragraph (1) except that such paragraph shall be applied by substituting for each dollar amount therein (as adjusted under paragraph (3)) a dollar amount equal to 75 percent of such dollar amount, and 125 |

126 |

127 |

128 | 129 | 130 | 131 | (B) 132 | 133 |

134 | any taxpayer not described in paragraph (1) or subparagraph (A), the applicable percentage shall be determined under paragraph (1) except that such paragraph shall be applied by substituting for each dollar amount therein (as adjusted under paragraph (3)) a dollar amount equal to 50 percent of such dollar amount. 135 |

136 |

137 |

138 |

139 | 140 | 141 | 142 | (3) 143 | 144 | 145 | Inflation adjustment 146 | 147 | 148 | In the case of any taxable year beginning in a calendar year after 2006, each of the dollar amounts in paragraph (1) shall be increased by an amount equal to— 149 | 150 |

151 | 152 | 153 | 154 | (A) 155 | 156 |

157 | such dollar amount, multiplied by 158 |

159 |

160 |

161 | 162 | 163 | 164 | (B) 165 | 166 |

167 | the cost-of-living adjustment determined under section 1(f)(3) for the calendar year in which the taxable year begins, determined by substituting “calendar year 2005” for “calendar year 1992” in subparagraph (B) thereof. 168 |

169 |

170 |

171 | Any increase determined under the preceding sentence shall be rounded to the nearest multiple of $500. 172 |

173 |

174 |

175 |

176 | 177 | 178 | 179 | (c) 180 | 181 | 182 | Eligible individual 183 | 184 | 185 | For purposes of this section— 186 | 187 |

188 | 189 | 190 | 191 | (1) 192 | 193 | 194 | In general 195 | 196 |

197 |

198 | The term “eligible individual” means any individual if such individual has attained the age of 18 as of the close of the taxable year. 199 |

200 |

201 |

202 |

203 | 204 | 205 | 206 | (2) 207 | 208 | 209 | Dependents and full-time students not eligible 210 | 211 | 212 | The term “eligible individual” shall not include— 213 | 214 |

215 | 216 | 217 | 218 | (A) 219 | 220 |

221 | any individual with respect to whom a deduction under section 151 is allowed to another taxpayer for a taxable year beginning in the calendar year in which such individual’s taxable year begins, and 222 |

223 |

224 |

225 | 226 | 227 | 228 | (B) 229 | 230 |

231 | any individual who is a student (as defined in section 152(f)(2)). 232 |

233 |

234 |

235 |

236 |

237 | 238 | 239 | 240 | (d) 241 | 242 | 243 | Qualified retirement savings contributions 244 | 245 | 246 | For purposes of this section— 247 | 248 |

249 | 250 | 251 | 252 | (1) 253 | 254 | 255 | In general 256 | 257 | 258 | The term “qualified retirement savings contributions” means, with respect to any taxable year, the sum of— 259 | 260 |

261 | 262 | 263 | 264 | (A) 265 | 266 |

267 | the amount of the qualified retirement contributions (as defined in section 219(e)) made by the eligible individual, 268 |

269 |

270 |

271 | 272 | 273 | 274 | (B) 275 | 276 | 277 | the amount of— 278 | 279 |

280 | 281 | 282 | 283 | (i) 284 | 285 |

286 | any elective deferrals (as defined in section 402(g)(3)) of such individual, and 287 |

288 |

289 |

290 | 291 | 292 | 293 | (ii) 294 | 295 |

296 | any elective deferral of compensation by such individual under an eligible deferred compensation plan (as defined in section 457(b)) of an eligible employer described in section 457(e)(1)(A), and 297 |

298 |

299 |

300 |

301 | 302 | 303 | 304 | (C) 305 | 306 |

307 | the amount of voluntary employee contributions by such individual to any qualified retirement plan (as defined in section 4974(c)). 308 |

309 |

310 |

311 |

312 | 313 | 314 | 315 | (2) 316 | 317 | 318 | Reduction for certain distributions 319 | 320 |

321 | 322 | 323 | 324 | (A) 325 | 326 | 327 | In general 328 | 329 |

330 |

331 | The qualified retirement savings contributions determined under paragraph (1) shall be reduced (but not below zero) by the aggregate distributions received by the individual during the testing period from any entity of a type to which contributions under paragraph (1) may be made. The preceding sentence shall not apply to the portion of any distribution which is not includible in gross income by reason of a trustee-to-trustee transfer or a rollover distribution. 332 |

333 |

334 |

335 |

336 | 337 | 338 | 339 | (B) 340 | 341 | 342 | Testing period 343 | 344 | 345 | For purposes of subparagraph (A), the testing period, with respect to a taxable year, is the period which includes— 346 | 347 |

348 | 349 | 350 | 351 | (i) 352 | 353 |

354 | such taxable year, 355 |

356 |

357 |

358 | 359 | 360 | 361 | (ii) 362 | 363 |

364 | the 2 preceding taxable years, and 365 |

366 |

367 |

368 | 369 | 370 | 371 | (iii) 372 | 373 |

374 | the period after such taxable year and before the due date (including extensions) for filing the return of tax for such taxable year. 375 |

376 |

377 |

378 |

379 | 380 | 381 | 382 | (C) 383 | 384 | 385 | Excepted distributions 386 | 387 | 388 | There shall not be taken into account under subparagraph (A)— 389 | 390 |

391 | 392 | 393 | 394 | (i) 395 | 396 |

397 | any distribution referred to in section 72(p), 401(k)(8), 401(m)(6), 402(g)(2), 404(k), or 408(d)(4), and 398 |

399 |

400 |

401 | 402 | 403 | 404 | (ii) 405 | 406 |

407 | any distribution to which section 408A(d)(3) applies. 408 |

409 |

410 |

411 |

412 | 413 | 414 | 415 | (D) 416 | 417 | 418 | Treatment of distributions received by spouse of individual 419 | 420 |

421 |

422 | For purposes of determining distributions received by an individual under subparagraph (A) for any taxable year, any distribution received by the spouse of such individual shall be treated as received by such individual if such individual and spouse file a joint return for such taxable year and for the taxable year during which the spouse receives the distribution. 423 |

424 |

425 |

426 |

427 |

428 |

429 | 430 | 431 | 432 | (e) 433 | 434 | 435 | Adjusted gross income 436 | 437 |

438 |

439 | For purposes of this section, adjusted gross income shall be determined without regard to sections 911, 931, and 933. 440 |

441 |

442 |

443 |

444 | 445 | 446 | 447 | (f) 448 | 449 | 450 | Investment in the contract 451 | 452 |

453 |

454 | Notwithstanding any other provision of law, a qualified retirement savings contribution shall not fail to be included in determining the investment in the contract for purposes of section 72 by reason of the credit under this section. 455 |

456 |

457 |

458 |

459 | (Added and amended 460 | 461 | Pub. L. 107–16, title VI 462 | 463 | , § 618(a), (b)(1), 464 | 465 | June 7, 2001 466 | 467 | , 468 | 469 | 115 Stat. 106 470 | 471 | , 108; 472 | 473 | Pub. L. 107–147, title IV 474 | 475 | , §§ 411(m), 417(1), 476 | 477 | Mar. 9, 2002 478 | 479 | , 480 | 481 | 116 Stat. 48 482 | 483 | , 56; 484 | 485 | Pub. L. 108–311, title II 486 | 487 | , § 207(4), 488 | 489 | Oct. 4, 2004 490 | 491 | , 492 | 493 | 118 Stat. 1177 494 | 495 | ; 496 | 497 | Pub. L. 109–135, title IV 498 | 499 | , § 402(i)(3)(D), 500 | 501 | Dec. 21, 2005 502 | 503 | , 504 | 505 | 119 Stat. 2614 506 | 507 | ; 508 | 509 | Pub. L. 109–280, title VIII 510 | 511 | , §§ 812, 833(a), 512 | 513 | Aug. 17, 2006 514 | 515 | , 516 | 517 | 120 Stat. 997 518 | 519 | , 1003; 520 | 521 | Pub. L. 110–343, div. B, title I 522 | 523 | , § 106(e)(2)(C), title II, § 205(d)(1)(C), 524 | 525 | Oct. 3, 2008 526 | 527 | , 528 | 529 | 122 Stat. 3817 530 | 531 | , 3838; 532 | 533 | Pub. L. 111–5, div. B, title I 534 | 535 | , §§ 1004(b)(4), 1142(b)(1)(C), 1144(b)(1)(C), 536 | 537 | Feb. 17, 2009 538 | 539 | , 540 | 541 | 123 Stat. 314 542 | 543 | , 330, 332; 544 | 545 | Pub. L. 111–148, title X 546 | 547 | , § 10909(b)(2)(D), (c), 548 | 549 | Mar. 23, 2010 550 | 551 | , 552 | 553 | 124 Stat. 1023 554 | 555 | ; 556 | 557 | Pub. L. 111–312, title I 558 | 559 | , § 101(b)(1), 560 | 561 | Dec. 17, 2010 562 | 563 | , 564 | 565 | 124 Stat. 3298 566 | 567 | ; 568 | 569 | Pub. L. 112–240, title I 570 | 571 | , § 104(c)(2)(E), 572 | 573 | Jan. 2, 2013 574 | 575 | , 576 | 577 | 126 Stat. 2322 578 | 579 | .) 580 |

581 |

582 | 583 | 584 | -------------------------------------------------------------------------------- /sample/html/1.1.1.11.2.1.2.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 26 U.S. Code § 722 - Basis of contributing partner’s interest | US Law | LII / Legal Information Institute 14 |

26 U.S. Code § 722 - Basis of contributing partner’s interest

15 | 18 |

19 |

20 | § 722. 21 |

22 |

23 | Basis of contributing partner’s interest 24 |

25 |

26 |

27 | The basis of an interest in a partnership acquired by a contribution of property, including money, to the partnership shall be the amount of such money and the adjusted basis of such property to the contributing partner at the time of the contribution increased by the amount (if any) of gain recognized under section 721(b) to the contributing partner at such time. 28 |

29 |

30 |

31 | (Aug. 16, 1954, ch. 736, 32 | 33 | 68A Stat. 245 34 | 35 | ; 36 | 37 | Pub. L. 94–455, title XXI 38 | 39 | , § 2131(c), 40 | 41 | Oct. 4, 1976 42 | 43 | , 44 | 45 | 90 Stat. 1924 46 | 47 | ; 48 | 49 | Pub. L. 98–369, div. A, title VII 50 | 51 | , § 722(f)(1), 52 | 53 | July 18, 1984 54 | 55 | , 56 | 57 | 98 Stat. 974 58 | 59 | .) 60 |

61 |

62 | 63 | 64 | -------------------------------------------------------------------------------- /sample/html/styles/css_Q4z0-iME7xTpui0Tzf4MEFv02rRuJ1dHZbo9kP_JLBg.css: -------------------------------------------------------------------------------- 1 | div.EFFDNOTP H1{display:none;}div.EFFDNOTP{margin-top:1em;}div.EFFDNOTP span.GPH.GPH img{padding-bottom:25px;}div.EFFDNOTP span.GPH.GPH span.BCAP.GPHBCAP{display:block;clear:both;}div.PGHEAD{display:none;}.HED,.HD1,.HD2,.HD3,.HD4{margin-top:0.5em;margin-bottom:0.5em;}.HED{font-size:20px;font-variant:small-caps;font-weight:bold;}.EXAMPLEP.HED.-P,.NOTEP.HED.-P{font-size:15px;font-weight:normal;font-variant:none;}.HD1{font-size:18px;font-variant:small-caps;}.HD2{font-size:16px;font-style:italic;}.HD3{font-size:15px;}.HD4{font-size:14px;}div.P+div.P{margin-top:0.5em;}div.SOURCE,div.NOTE,div.EXTRACT,div.AUTH,div.SECAUTH,div.CROSSREF{margin-top:0.5em;margin-bottom:0.5em;}div.SOURCE div.SOURCE-HED,div.AUTH div.AUTH-HED,div.CROSSREF div.CROSSREF-HED{font-variant:small-caps;margin-right:0.5em;}div.NOTE div.NOTE-HED{font-variant:small-caps;}div.EXTRACT div.HD.HD1{font-variant:small-caps;font-weight:bold;}li.tocitem span.subjectgroup{font-variant:small-caps;font-weight:bold;}div.EXAMPLE{border:dotted 1px #ccc;padding:10px;margin:10px 0;}sup{font-size:11px;vertical-align:top;}sub{font-size:11px;vertical-align:bottom;}div.MATH{margin-left:20px;}div.table,table.gpotable tbody,table.GPOTABLE tbody{max-width:620px;}table.gpotable{}table.gpotable tbody{border:solid 1px #ccc;}table.gpotable th,table.gpotable td{padding:0 5px;}table.gpotable caption strong,table.gpotable caption.ttitle{font-variant:small-caps;font-weight:bold;}table.gpotable caption div.description{font-variant:small-caps;font-size:11px;}table.gpotable thead th{border:solid 1px #ccc;font-weight:normal;text-align:center;}table.gpotable tbody td:first-child{text-align:left;}table.gpotable tbody td{border-left:solid 1px #ccc;text-align:right;}table.gpotable tbody tr:first-child,table.gpotable tfoot tr:first-child{border-top:solid 1px #ccc;}table.gpotable tfoot td.tnote{font-size:12px;}.GPOTABLE{margin-top:10pt;margin-bottom:10pt;display:block;border-collapse:collapse;empty-cells:show;border-bottom-style:dotted;border-width:0px;border-top-style:dotted;border-width:0px;border-color:black;}.GPOTABLE tbody{border-bottom:solid 1px #000;}.GPOHEADERS{font-weight:bold;border-width:1px;border-bottom-style:solid;border-top-style:solid;border-width:1px;border-color:black;}.GPOTABLE-TTITLE{text-align:center;font-weight:bold;}.CHED{padding:5px;font-weight:bold;border-top-style:solid;border-bottom-style:solid;border-width:1px;border-color:black;}.ROW{width:100%;}.ENT{padding:5px;}.MyENT{padding:5px;border-color:black;}.TNOTE{padding-left:15px;border-color:black;}#GPOCELLS{}.GPOTABLE TR{border-top:solid 1px #000;border-bottom:solid 1px #000;}div.source,div.authority{margin-top:1em;}#sidebar-right #sidebar-right-inner #block-block-50 fieldset{font-size:12px;}#sidebar-right #sidebar-right-inner #block-block-50 fieldset label{display:block;clear:both;float:left;}#sidebar-right #sidebar-right-inner #block-block-50 fieldset input{margin-left:3px;}#sidebar-right #sidebar-right-inner #block-block-50 fieldset button{display:block;clear:both;}#sidebar-right #sidebar-right-inner #block-block-50 fieldset a{margin-top:5px;}ul.toc{list-style-type:none;}li.tocitem{list-style-type:none;padding:2px 0;}.breadcrumb{}#prevnext{FONT-SIZE:1em;TEXT-ALIGN:right;}.swforcesect A{FONT-SIZE:larger;FONT-WEIGHT:bold;PADDING-BOTTOM:10px;}H1,H2,H3,H4,H5,H6{FONT-WEIGHT:normal;}H1.popnamesbar,.popnamesbar{MARGIN-LEFT:100px;MARGIN-RIGHT:100px;}H2.catchline{FONT-SIZE:1.6em;}HR{COLOR:#083194;HEIGHT:3px;BACKGROUND-COLOR:#ffffff}HR.footsep{HEIGHT:1px}.catchline{FONT-WEIGHT:bold;FONT-SIZE:1.2em;}.leader{FONT-WEIGHT:bold;margin-top:1em;}.leader{FONT-WEIGHT:bold}.labelleader{FONT-WEIGHT:bold}.special-word{COLOR:red}.header{PADDING-RIGHT:10px;PADDING-LEFT:10px;PADDING-TOP:10px;PADDING-BOTTOM:10px;MARGIN-LEFT:auto;WIDTH:100%;FONT-WEIGHT:bold;FONT-SIZE:1.1em;TEXT-ALIGN:left;VERTICAL-ALIGN:top;}.dates{FONT-STYLE:italic;FONT-SIZE:0.95em;TEXT-ALIGN:right}.statute{CLEAR:both;PADDING-RIGHT:10px;PADDING-LEFT:10px;PADDING-TOP:10px;PADDING-BOTTOM:10px;MARGIN-LEFT:auto;WIDTH:100%;}.statutetoc{CLEAR:both;PADDING-RIGHT:2px;PADDING-LEFT:2px;PADDING-TOP:2px;PADDING-BOTTOM:2px;WIDTH:100%;}.classification{WIDTH:90%;FONT-SIZE:1.0em;PADDING:4px;COLOR:#000000;BACKGROUND-COLOR:#eeeeee;}TR.classifilight1{BACKGROUND-COLOR:#f5f5f5;}TR.classifilight0{BACKGROUND-COLOR:#eeeeee;}.cltop{FONT-WEIGHT:bold;FONT-SIZE:1.1em;TEXT-ALIGN:center}.clbot{FONT-WEIGHT:bold;FONT-SIZE:1.1em;BORDER-BOTTOM:black 1px dashed;TEXT-ALIGN:center}.cllft{padding-right:0;padding-left:20px;PADDING-BOTTOM:1px;PADDING-TOP:1px;TEXT-ALIGN:left}.clcen{PADDING-RIGHT:10px;padding-left:10px;PADDING-BOTTOM:1px;PADDING-TOP:1px;TEXT-ALIGN:center}.clrt{PADDING-RIGHT:20px;PADDING-LEFT:0;PADDING-BOTTOM:1px;PADDING-TOP:1px;TEXT-ALIGN:right}.misc{PADDING-RIGHT:10px;padding-left:10px;PADDING-BOTTOM:1px;PADDING-TOP:1px}.section{PADDING-RIGHT:20px;padding-left:20px;PADDING-BOTTOM:10px;PADDING-TOP:10px}.supsec{PADDING-RIGHT:20px;padding-left:20px;PADDING-BOTTOM:20px;PADDING-TOP:20px}.statlinks{PADDING-RIGHT:10px;PADDING-LEFT:10px;PADDING-BOTTOM:10px;VERTICAL-ALIGN:top;BORDER-LEFT:black 2px solid;PADDING-TOP:10px;TEXT-ALIGN:left}h1#page-title.title{text-transform:none;}.section-uscode .block-uscode fieldset{background:#FFFFCC;}.section-uscode .outdent-,.section-uscode .outdent-1,.section-uscode .outdent-2,.section-uscode .outdent-3,.section-uscode .outdent-4,.section-uscode .outdent-5,.section-uscode .outdent-6,.section-uscode .outdent-7,.section-uscode .outdent-8,.section-uscode .outdent-9{padding-top:4px;}.section-uscode .outdent-,.section-uscode .outdent-1{MARGIN-LEFT:-10px;}.section-uscode .outdent-2{MARGIN-LEFT:-20px;}.section-uscode .outdent-3{MARGIN-LEFT:-25px;}.section-uscode .outdent-4{MARGIN-LEFT:-30px;}.section-uscode .outdent-5{MARGIN-LEFT:-35px;}.section-uscode .outdent-6{MARGIN-LEFT:-40px;}.section-uscode .outdent-7{MARGIN-LEFT:-45px;}.section-uscode .outdent-8{MARGIN-LEFT:-50px;}.section-uscode .outdent-9{MARGIN-LEFT:-55px;}.section-uscode .ptext-,.section-uscode .psection-0,.section-uscode .ptext-0,.section-uscode .psection-1,.section-uscode .ptext-1,.section-uscode .psection-2,.section-uscode .ptext-2,.section-uscode .psection-3,.section-uscode .ptext-3,.section-uscode .psection-4,.section-uscode .ptext-4,.section-uscode .psection-5,.section-uscode .ptext-5,.section-uscode .psection-6,.section-uscode .ptext-6,.section-uscode .psection-7,.section-uscode .ptext-7,.section-uscode .psection-8,.section-uscode .ptext-8,.section-uscode .psection-9,.section-uscode .ptext-9,.section-uscode .psection-10,.section-uscode .ptext-10,.section-uscode .psection-11,.section-uscode .ptext-11,.section-uscode .psection-12,.section-uscode .ptext-12,.section-uscode .psection-13,.section-uscode .ptext-13,.section-uscode .psection-21,.section-uscode .psection-22,.section-uscode .psection-23{padding:5px 0;}.section-uscode .section-uscode .section-uscode .psection-0{}.section-uscode .ptext-{}.section-uscode .ptext-0{}.section-uscode .psection-1{}.section-uscode .ptext-1{}.section-uscode .psection-2{MARGIN-LEFT:20px;}.section-uscode .ptext-2{}.section-uscode .psection-3{MARGIN-LEFT:10px;}.section-uscode .ptext-3{}.section-uscode .psection-4{MARGIN-LEFT:10px;}.section-uscode .ptext-4{}.section-uscode .psection-5{MARGIN-LEFT:10px;}.section-uscode .ptext-5{}.section-uscode .psection-6{MARGIN-LEFT:10px;}.section-uscode .ptext-6{}.section-uscode .psection-7{MARGIN-LEFT:10px;}.section-uscode .ptext-7{}.section-uscode .psection-8{MARGIN-LEFT:5px;}.section-uscode .ptext-8{}.section-uscode .psection-9{MARGIN-LEFT:5px;}.section-uscode .ptext-9{}.section-uscode .ptext-11{}.section-uscode .ptext-12{}.section-uscode .ptext-13{}.section-uscode .ptext-21{}.section-uscode .ptext-22{}.section-uscode .ptext-23{}.section-uscode .ptext-24{}.section-uscode .leading-01{PADDING:1px;}.section-uscode .leading-02{PADDING:2px;}.section-uscode .leading-03{PADDING:3px;}.section-uscode .leading-04{PADDING:4px;}.section-uscode .leading-05{PADDING:5px;}.section-uscode .leading-06{PADDING:6px;}.section-uscode .leading-08{PADDING:8px;}.section-uscode .leading-10{PADDING:10px;}.section-uscode .leading-11{PADDING:11px;}.section-uscode .leading-12{PADDING:12px;}.section-uscode .leading-14{PADDING:1.section-uscode .2em;}.section-uscode .leading-15{PADDING:15px;}.section-uscode .leading-16{PADDING:16px;}.section-uscode .leading-18{PADDING:18px;}.section-uscode .leading-24{PADDING:24px;}.section-uscode .leading-29{PADDING:29px;}.section-cfr .outdent-,.section-cfr .outdent-1,.section-cfr .outdent-2,.section-cfr .outdent-3,.section-cfr .outdent-4,.section-cfr .outdent-5,.section-cfr .outdent-6,.section-cfr .outdent-7,.section-cfr .outdent-8,.section-cfr .outdent-9{padding-top:4px;}.section-cfr .outdent-,.section-cfr .outdent-1{MARGIN-LEFT:-10px;}.section-cfr .outdent-2{MARGIN-LEFT:-20px;}.section-cfr .outdent-3{MARGIN-LEFT:-25px;}.section-cfr .outdent-4{MARGIN-LEFT:-30px;}.section-cfr .outdent-5{MARGIN-LEFT:-35px;}.section-cfr .outdent-6{MARGIN-LEFT:-40px;}.section-cfr .outdent-7{MARGIN-LEFT:-45px;}.section-cfr .outdent-8{MARGIN-LEFT:-50px;}.section-cfr .outdent-9{MARGIN-LEFT:-55px;}.section-cfr .ptext-,.section-cfr .psection-0,.section-cfr .ptext-0,.section-cfr .psection-1,.section-cfr .ptext-1,.section-cfr .psection-2,.section-cfr .ptext-2,.section-cfr .psection-3,.section-cfr .ptext-3,.section-cfr .psection-4,.section-cfr .ptext-4,.section-cfr .psection-5,.section-cfr .ptext-5,.section-cfr .psection-6,.section-cfr .ptext-6,.section-cfr .psection-7,.section-cfr .ptext-7,.section-cfr .psection-8,.section-cfr .ptext-8,.section-cfr .psection-9,.section-cfr .ptext-9,.section-cfr .psection-10,.section-cfr .ptext-10,.section-cfr .psection-11,.section-cfr .ptext-11,.section-cfr .psection-12,.section-cfr .ptext-12,.section-cfr .psection-13,.section-cfr .ptext-13,.section-cfr .psection-21,.section-cfr .psection-22,.section-cfr .psection-23{padding:5px 0;}.section-cfr .section-uscode .section-cfr .psection-0{}.section-cfr .ptext-{}.section-cfr .ptext-0{}.section-cfr .psection-1{}.section-cfr .ptext-1{}.section-cfr .psection-2{MARGIN-LEFT:20px;}.section-cfr .ptext-2{}.section-cfr .psection-3{MARGIN-LEFT:30px;}.section-cfr .ptext-3{}.section-cfr .psection-4{MARGIN-LEFT:40px;}.section-cfr .ptext-4{}.section-cfr .psection-5{MARGIN-LEFT:45px;}.section-cfr .ptext-5{}.section-cfr .psection-6{MARGIN-LEFT:50px;}.section-cfr .ptext-6{}.section-cfr .psection-7{MARGIN-LEFT:55px;}.section-cfr .ptext-7{}.section-cfr .psection-8{MARGIN-LEFT:60px;}.section-cfr .ptext-8{}.section-cfr .psection-9{MARGIN-LEFT:65px;}.section-cfr .ptext-9{}.section-cfr .ptext-11{}.section-cfr .ptext-12{}.section-cfr .ptext-13{}.section-cfr .ptext-21{}.section-cfr .ptext-22{}.section-cfr .ptext-23{}.section-cfr .ptext-24{}.section-cfr .leading-01{PADDING:1px;}.section-cfr .leading-02{PADDING:2px;}.section-cfr .leading-03{PADDING:3px;}.section-cfr .leading-04{PADDING:4px;}.section-cfr .leading-05{PADDING:5px;}.section-cfr .leading-06{PADDING:6px;}.section-cfr .leading-08{PADDING:8px;}.section-cfr .leading-10{PADDING:10px;}.section-cfr .leading-11{PADDING:11px;}.section-cfr .leading-12{PADDING:12px;}.section-cfr .leading-14{PADDING:1.section-cfr .2em;}.section-cfr .leading-15{PADDING:15px;}.section-cfr .leading-16{PADDING:16px;}.section-cfr .leading-18{PADDING:18px;}.section-cfr .leading-24{PADDING:24px;}.section-cfr .leading-29{PADDING:29px;}.srchlead{FONT-STYLE:italic}.greyout{COLOR:#999999;}.enum{FONT-WEIGHT:bold}.enumbell,.enumxml{FONT-WEIGHT:bold}.enumlstr{FONT-WEIGHT:bold;COLOR:#000066;}@media print{.annot-right,.prevnext,.dates{DISPLAY:none !important;}.localnav,.backtrail,H2.catchline,.statute{WIDTH:auto !important;HEIGHT:auto !important;MARGIN:auto !important;BORDER:0px !important;FLOAT:none !important;BACKGROUND:#fff !important;COLOR:#000;}H2.catchline,.statute{DISPLAY:block !important;}}div.wip{background:#f0f0f0;}div.wip+div{background:#fff;}div.table table td,div.table table th{border:solid 1px #ccc;padding:5px;} 2 | body{font-weight:normal;font-size:10pt;color:#000000;font-family:verdana,arial,helvetica,sans-serif;background-color:#ffffff}strong{font-weight:bold;font-family:verdana,arial,helvetica,sans-serif}p,ol,ul,li,td,th,tr,div,dl,dd,dt,span{font-weight:normal;font-size:10pt}b{font-weight:bold;font-size:10pt}h1{font-weight:normal;font-size:14pt;color:#222255;}h2{font-size:13pt;font-weight:normal;color:#222255;}h3{font-size:12pt;font-weight:normal;color:#222255;}h4{font-size:11pt;font-weight:normal;color:#222255;}h5{font-size:11pt;font-weight:normal;color:#222255;}h6{font-size:11pt;font-weight:normal;color:#222255;}pre{font-family:courier monospace}hr{color:#083194;height:3px;}hr.footsep{height:2px}a.footnoteRef{font-size:70%;vertical-align:top;}a.footnote{font-size:80%;vertical-align:top;}span.footnote{font-size:90%;}sup{top:-0.5em;}sub{bottom:-0.25em;}sub,sup{font-size:70%;line-height:0;position:relative;vertical-align:baseline;}.indent-1{display:block;margin-left:0pt;text-align:center;}.indent0{display:block;padding:5pt;margin-left:4%;text-indent:0em;}.indent1{display:block;padding:4pt;margin-left:4%;text-indent:0em;}.indent2{display:block;padding:4pt;margin-left:4%;text-indent:0em;}.indent3{display:block;padding:4pt;margin-left:4%;text-indent:0em;}.indent4{display:block;padding:4pt;margin-left:3%;text-indent:0em;}.indent5{display:block;padding:4pt;margin-left:3%;text-indent:0em;}.indent6{display:block;padding:4pt;margin-left:3%;text-indent:0em;}.indent7{display:block;padding:4pt;margin-left:3%;text-indent:0em;}inline{display:inline;}.smallCaps{font-variant:small-caps !important;}.small-caps{font-variant:small-caps;}.smallcaps{font-variant:small-caps;}.italic{font-style:italic !important;}.bold{font-weight:bold !important;}.centered{display:block;margin-top:6pt;margin-bottom:6pt;margin-left:0pt;margin-right:0pt;text-indent:0pt;text-align:center;}.inline{display:inline !important;}.fontsize6{font-size:9pt;}.fontsize7{font-size:10pt;}.fontsize8{font-size:12pt;}.special-word{color:red;}.header{font-size:1.1em;font-weight:bold;margin-left:auto;padding:10px;text-align:left;vertical-align:top;width:100%;}.heading{font-weight:bold;text-align:left;}.section .num{font-weight:bold;}.quotedContent{display:inline;}div.quotedContent > div.section{display:block;margin-top:3pt;}def{display:inline;}term{display:inline;font-variant:small-caps;}table ul{display:block;margin-top:0pt;margin-bottom:0pt;margin-left:-18pt;}table li{display:block;margin-left:12pt;text-align:justify;}sup{vertical-align:super;font-size:small;}sub{vertical-align:sub;font-size:small;}italic,i,i{font-style:italic;}bold,b,b{font-weight:bold;}p[role="listItem"]{text-indent:0pt;display:block;margin-left:2em;}div.supcontent > div.num{display:none;}div.supcontent > div.heading{display:none;}div.section{display:block;margin-top:1em;margin-bottom:3pt;margin-left:0pt;font-weight:normal;font-size:11pt;}div.section > div.heading{font-weight:bold;display:none;}div.section > div.num{font-weight:bold;display:none;}div.section > div.heading:after{display:block;content:" ";}div.section[status] > div.num{color:red;}div.section[status] > div.heading{color:red;}div.section > div.content{display:block;text-indent:12pt;}div[class="msec2"] div.heading{display:inline;font-weight:bold;margin-top:1em;}div[class="msec2"] div.num{display:inline;font-weight:bold;}div.section:nth-child(2){margin-top:3em;}span[class="heading bold"]:after{content:" "}.subsection,.paragraph,.subparagraph,.clause,.subclause,.item,.subitem,.subsubitem,.level{display:block;margin-top:3pt;margin-left:12pt;margin-bottom:3pt;font-size:11pt;}div.subsection > .heading{font-weight:bold;font-variant:small-caps;}div.paragraph > .heading{font-weight:bold;font-variant:small-caps;}span.num:after{content:" "}span.num+div.content{display:inline;}.notes{display:block;margin-top:1em;margin-bottom:3pt;margin-left:0pt;font-weight:normal;}.note .heading{font-weight:bold;margin-top:1em;}.m-notes{margin-top:1em;}.notes:nth-child(2){margin-top:3em;}.note .centered{text-align:center;}}table{display:table;border-style:none;border-color:black;margin:1em auto;border-collapse:collapse;text-align:center;}colgroup{display:table-column-group}col{display:table-column}thead{text-align:center;margin:3pt;}tbody{display:table-row-group;border-bottom:2px solid #95b3d7;padding:7px 9px 0;text-align:justify;}tr{display:table-row;}td,th{display:table-cell;font-weight :normal;padding:4pt 8pt;}table body{font-family:Ionic,Times,serif;font-size:7pt;text-align:justify;}.thinsp{width:0.16em;display:inline-block;}.small-caps{font-variant:small-caps;}table .sectionNumber{font-weight:bold;}tbody tr td{border-right:1px solid #95b3d7;}tbody tr td:last-of-type{border-right:0px;}tr.header{vertical-align:middle;text-align:center;background:none repeat scroll 0 0 #f0f0f0;border-bottom:1px solid #95b3d7;border-top:2px solid #95b3d7;}tr.title{vertical-align:middle;margin-bottom:6pt;text-align:center;font-variant:small-caps;}tr.title p span{font-weight:bold;}td > p{margin:0pt;}th > p{margin:0pt;margin-left:4pt;margin-right:4pt;}table p.leaders{max-width:40em;padding:0;overflow-x:hidden;list-style:none;baseline-shift:baseline;}table p.leaders span:after{float:left;width:0;baseline-shift:baseline;white-space:nowrap;content:". . . . . . . . . . . . . . . . . . . . " 3 | ". . . . . . . . . . . . . . . . . . . . " 4 | ". . . . . . . . . . . . . . . . . . . . " 5 | ". . . . . . . . . . . . . . . . . . . . "}table p.leaders span:first-child{padding-right:0.33em;background:white} 6 | -------------------------------------------------------------------------------- /sample/html/styles/css_XgGKW_fNRFCK5BruHWlbChY4U8WE0xT4CWGilKSjSXA.css: -------------------------------------------------------------------------------- 1 | .ctools-locked{color:red;border:1px solid red;padding:1em;}.ctools-owns-lock{background:#FFFFDD none repeat scroll 0 0;border:1px solid #F0C020;padding:1em;}a.ctools-ajaxing,input.ctools-ajaxing,button.ctools-ajaxing,select.ctools-ajaxing{padding-right:18px !important;background:url(/sites/all/modules/ctools/images/status-active.gif) right center no-repeat;}div.ctools-ajaxing{float:left;width:18px;background:url(/sites/all/modules/ctools/images/status-active.gif) center center no-repeat;} 2 | .quicktabs-hide{display:none;}ul.quicktabs-tabs{margin-top:0;}ul.quicktabs-tabs li{display:inline;background:none;list-style-type:none;padding:2px;white-space:nowrap;}ul.quicktabs-tabs li a:focus{outline:none;} 3 | .quicktabs_main.quicktabs-style-zen{clear:both;}ul.quicktabs-tabs.quicktabs-style-zen{margin:0 0 10px 0;padding:0 0 3px;font-size:1em;list-style:none;height:21px;background:transparent url(/sites/all/modules/quicktabs/quicktabs_tabstyles/tabstyles/zen/images/tab-bar.png) repeat-x left bottom;}*html ul.quicktabs-tabs.quicktabs-style-zen li{margin-bottom:-5px;}ul.quicktabs-tabs.quicktabs-style-zen li{float:left;margin:0 5px;padding:0 0 0 5px;background:transparent url(/sites/all/modules/quicktabs/quicktabs_tabstyles/tabstyles/zen/images/tab-left-ie6.png) no-repeat left -38px;}ul.quicktabs-tabs.quicktabs-style-zen li a{font:bold 12px/170% Verdana;font-size-adjust:none;display:block;margin:0;padding:4px 17px 0px 12px;border-width:0;font-weight:bold;text-decoration:none;background:transparent url(/sites/all/modules/quicktabs/quicktabs_tabstyles/tabstyles/zen/images/tab-right-ie6.png) no-repeat right -38px;}ul.quicktabs-tabs.quicktabs-style-zen li:hover a{border-width:0;background:transparent url(/sites/all/modules/quicktabs/quicktabs_tabstyles/tabstyles/zen/images/tab-right-ie6.png) no-repeat right -76px;}quicktabs-tabs.quicktabs-style-zen li:hover{background:transparent url(/sites/all/modules/quicktabs/quicktabs_tabstyles/tabstyles/zen/images/tab-left-ie6.png) no-repeat left -76px;}ul.quicktabs-tabs.quicktabs-style-zen li.active a,ul.quicktabs-tabs.quicktabs-style-zen li.active a:hover{text-decoration:none;border-width:0;background:transparent url(/sites/all/modules/quicktabs/quicktabs_tabstyles/tabstyles/zen/images/tab-right-ie6.png) no-repeat right 0;}ul.quicktabs-tabs.quicktabs-style-zen li.active{background:transparent url(/sites/all/modules/quicktabs/quicktabs_tabstyles/tabstyles/zen/images/tab-left-ie6.png) no-repeat left 0;} 4 | -------------------------------------------------------------------------------- /sample/html/styles/css_dolo-SIAwemLdrlTs99Lrug9kFXMYlMG3OlznBv4Kho.css: -------------------------------------------------------------------------------- 1 | .footnotes{clear:both;margin-top:4em;margin-bottom:2em;border-top:1px solid #000000;}.footnotes{font-size:0.9em;}.see-footnote{vertical-align:top;position:relative;top:-0.25em;font-size:0.9em;}ul.footnotes{list-style-type:none;margin-left:0;padding-left:0;}ul.footnotes li{margin-left:2.5em;list-style-type:none;background:none;}ul.footnotes{position:relative;}.footnotes .footnote-label{position:absolute;left:0px;z-index:2;}.see-footnote:target,.footnotes .footnote:target{background-color:#eeeeee;}.see-footnote:target{border:solid 1px #aaaaaa;}.footnotes .footnote-multi{vertical-align:top;position:relative;top:-0.25em;font-size:0.75em;}#fn1{border-top:1px solid #000000;margin-top:3em;}.footnote{font-size:0.9em;} 2 | .book-navigation .menu{border-top:1px solid #888;padding:1em 0 0 3em;}.book-navigation .page-links{border-top:1px solid #888;border-bottom:1px solid #888;text-align:center;padding:0.5em;}.book-navigation .page-previous{text-align:left;width:42%;display:block;float:left;}.book-navigation .page-up{margin:0 5%;width:4%;display:block;float:left;}.book-navigation .page-next{text-align:right;width:42%;display:block;float:right;}#book-outline{min-width:56em;}.book-outline-form .form-item{margin-top:0;margin-bottom:0;}html.js #edit-book-pick-book{display:none;}.form-item-book-bid .description{clear:both;}#book-admin-edit select{margin-right:24px;}#book-admin-edit select.progress-disabled{margin-right:0;}#book-admin-edit tr.ajax-new-content{background-color:#ffd;}#book-admin-edit .form-item{float:left;} 3 | #comments{margin-top:15px;}.indented{margin-left:25px;}.comment-unpublished{background-color:#fff4f4;}.comment-preview{background-color:#ffffea;} 4 | .container-inline-date{clear:both;}.container-inline-date .form-item{float:none;margin:0;padding:0;}.container-inline-date > .form-item{display:inline-block;margin-right:0.5em;margin-bottom:10px;vertical-align:top;}.container-inline-date .form-item .form-item{float:left;}.container-inline-date .form-item,.container-inline-date .form-item input{width:auto;}.container-inline-date .description{clear:both;}.container-inline-date .form-item input,.container-inline-date .form-item select,.container-inline-date .form-item option{margin-right:5px;}.container-inline-date .date-spacer{margin-left:-5px;}.views-right-60 .container-inline-date div{margin:0;padding:0;}.container-inline-date .date-timezone .form-item{clear:both;float:none;width:auto;}.container-inline-date .date-padding{padding:10px;float:left;}.views-exposed-form .container-inline-date .date-padding{padding:0;}#calendar_div,#calendar_div td,#calendar_div th{margin:0;padding:0;}#calendar_div,.calendar_control,.calendar_links,.calendar_header,.calendar{border-collapse:separate;margin:0;width:185px;}.calendar td{padding:0;}span.date-display-single{}span.date-display-start{}span.date-display-end{}.date-prefix-inline{display:inline-block;}.date-clear{clear:both;display:block;float:none;}.date-no-float{clear:both;float:none;width:98%;}.date-float{clear:none;float:left;width:auto;}.date-float .form-type-checkbox{padding-right:1em;}.form-type-date-select .form-type-select[class$=hour]{margin-left:.75em;}.date-container .date-format-delete{float:left;margin-top:1.8em;margin-left:1.5em;}.date-container .date-format-name{float:left;}.date-container .date-format-type{float:left;padding-left:10px;}.date-container .select-container{clear:left;float:left;}div.date-calendar-day{background:#F3F3F3;border-top:1px solid #EEE;border-left:1px solid #EEE;border-right:1px solid #BBB;border-bottom:1px solid #BBB;color:#999;float:left;line-height:1;margin:6px 10px 0 0;text-align:center;width:40px;}div.date-calendar-day span{display:block;text-align:center;}div.date-calendar-day span.month{background-color:#B5BEBE;color:white;font-size:.9em;padding:2px;text-transform:uppercase;}div.date-calendar-day span.day{font-size:2em;font-weight:bold;}div.date-calendar-day span.year{font-size:.9em;padding:2px;}.form-item.form-item-instance-widget-settings-input-format-custom,.form-item.form-item-field-settings-enddate-required{margin-left:1.3em;}#edit-field-settings-granularity .form-type-checkbox{margin-right:.6em;}.date-year-range-select{margin-right:1em;} 5 | .field .field-label{font-weight:bold;}.field-label-inline .field-label,.field-label-inline .field-items{float:left;}form .field-multiple-table{margin:0;}form .field-multiple-table th.field-label{padding-left:0;}form .field-multiple-table td.field-multiple-drag{width:30px;padding-right:0;}form .field-multiple-table td.field-multiple-drag a.tabledrag-handle{padding-right:.5em;}form .field-add-more-submit{margin:.5em 0 0;} 6 | .node-unpublished{background-color:#fff4f4;}.preview .node{background-color:#ffffea;}td.revision-current{background:#ffc;} 7 | .search-form{margin-bottom:1em;}.search-form input{margin-top:0;margin-bottom:0;}.search-results{list-style:none;}.search-results p{margin-top:0;}.search-results .title{font-size:1.2em;}.search-results li{margin-bottom:1em;}.search-results .search-snippet-info{padding-left:1em;}.search-results .search-info{font-size:0.85em;}.search-advanced .criterion{float:left;margin-right:2em;}.search-advanced .action{float:left;clear:left;} 8 | #permissions td.module{font-weight:bold;}#permissions td.permission{padding-left:1.5em;}#permissions tr.odd .form-item,#permissions tr.even .form-item{white-space:normal;}#user-admin-settings fieldset .fieldset-description{font-size:0.85em;padding-bottom:.5em;}#user-admin-roles td.edit-name{clear:both;}#user-admin-roles .form-item-name{float:left;margin-right:1em;}.password-strength{width:17em;float:right;margin-top:1.4em;}.password-strength-title{display:inline;}.password-strength-text{float:right;font-weight:bold;}.password-indicator{background-color:#C4C4C4;height:0.3em;width:100%;}.password-indicator div{height:100%;width:0%;background-color:#47C965;}input.password-confirm,input.password-field{width:16em;margin-bottom:0.4em;}div.password-confirm{float:right;margin-top:1.5em;visibility:hidden;width:17em;}div.form-item div.password-suggestions{padding:0.2em 0.5em;margin:0.7em 0;width:38.5em;border:1px solid #B4B4B4;}div.password-suggestions ul{margin-bottom:0;}.confirm-parent,.password-parent{clear:left;margin:0;width:36.3em;}.profile{clear:both;margin:1em 0;}.profile .user-picture{float:right;margin:0 1em 1em 0;}.profile h3{border-bottom:1px solid #ccc;}.profile dl{margin:0 0 1.5em 0;}.profile dt{margin:0 0 0.2em 0;font-weight:bold;}.profile dd{margin:0 0 1em 0;} 9 | .views-exposed-form .views-exposed-widget{float:left;padding:.5em 1em 0 0;}.views-exposed-form .views-exposed-widget .form-submit{margin-top:1.6em;}.views-exposed-form .form-item,.views-exposed-form .form-submit{margin-top:0;margin-bottom:0;}.views-exposed-form label{font-weight:bold;}.views-exposed-widgets{margin-bottom:.5em;}.views-align-left{text-align:left;}.views-align-right{text-align:right;}.views-align-center{text-align:center;}.views-view-grid tbody{border-top:none;}.view .progress-disabled{float:none;} 10 | -------------------------------------------------------------------------------- /sample/html/styles/css_kShW4RPmRstZ3SpIC-ZvVGNFVAi0WEMuCnI0ZkYIaFw.css: -------------------------------------------------------------------------------- 1 | #autocomplete{border:1px solid;overflow:hidden;position:absolute;z-index:100;}#autocomplete ul{list-style:none;list-style-image:none;margin:0;padding:0;}#autocomplete li{background:#fff;color:#000;cursor:default;white-space:pre;zoom:1;}html.js input.form-autocomplete{background-image:url(/misc/throbber-inactive.png);background-position:100% center;background-repeat:no-repeat;}html.js input.throbbing{background-image:url(/misc/throbber-active.gif);background-position:100% center;}html.js fieldset.collapsed{border-bottom-width:0;border-left-width:0;border-right-width:0;height:1em;}html.js fieldset.collapsed .fieldset-wrapper{display:none;}fieldset.collapsible{position:relative;}fieldset.collapsible .fieldset-legend{display:block;}.form-textarea-wrapper textarea{display:block;margin:0;width:100%;-moz-box-sizing:border-box;-webkit-box-sizing:border-box;box-sizing:border-box;}.resizable-textarea .grippie{background:#eee url(/misc/grippie.png) no-repeat center 2px;border:1px solid #ddd;border-top-width:0;cursor:s-resize;height:9px;overflow:hidden;}body.drag{cursor:move;}.draggable a.tabledrag-handle{cursor:move;float:left;height:1.7em;margin-left:-1em;overflow:hidden;text-decoration:none;}a.tabledrag-handle:hover{text-decoration:none;}a.tabledrag-handle .handle{background:url(/misc/draggable.png) no-repeat 6px 9px;height:13px;margin:-0.4em 0.5em;padding:0.42em 0.5em;width:13px;}a.tabledrag-handle-hover .handle{background-position:6px -11px;}div.indentation{float:left;height:1.7em;margin:-0.4em 0.2em -0.4em -0.4em;padding:0.42em 0 0.42em 0.6em;width:20px;}div.tree-child{background:url(/misc/tree.png) no-repeat 11px center;}div.tree-child-last{background:url(/misc/tree-bottom.png) no-repeat 11px center;}div.tree-child-horizontal{background:url(/misc/tree.png) no-repeat -11px center;}.tabledrag-toggle-weight-wrapper{text-align:right;}table.sticky-header{background-color:#fff;margin-top:0;}.progress .bar{background-color:#fff;border:1px solid;}.progress .filled{background-color:#000;height:1.5em;width:5px;}.progress .percentage{float:right;}.ajax-progress{display:inline-block;}.ajax-progress .throbber{background:transparent url(/misc/throbber-active.gif) no-repeat 0px center;float:left;height:15px;margin:2px;width:15px;}.ajax-progress .message{padding-left:20px;}tr .ajax-progress .throbber{margin:0 2px;}.ajax-progress-bar{width:16em;}.container-inline div,.container-inline label{display:inline;}.container-inline .fieldset-wrapper{display:block;}.nowrap{white-space:nowrap;}html.js .js-hide{display:none;}.element-hidden{display:none;}.element-invisible{position:absolute !important;clip:rect(1px 1px 1px 1px);clip:rect(1px,1px,1px,1px);overflow:hidden;height:1px;}.element-invisible.element-focusable:active,.element-invisible.element-focusable:focus{position:static !important;clip:auto;overflow:visible;height:auto;}.clearfix:after{content:".";display:block;height:0;clear:both;visibility:hidden;}* html .clearfix{height:1%;}*:first-child + html .clearfix{min-height:1%;} 2 | div.messages{background-position:8px 8px;background-repeat:no-repeat;border:1px solid;margin:6px 0;padding:10px 10px 10px 50px;}div.status{background-image:url(/misc/message-24-ok.png);border-color:#be7;}div.status,.ok{color:#234600;}div.status,table tr.ok{background-color:#f8fff0;}div.warning{background-image:url(/misc/message-24-warning.png);border-color:#ed5;}div.warning,.warning{color:#840;}div.warning,table tr.warning{background-color:#fffce5;}div.error{background-image:url(/misc/message-24-error.png);border-color:#ed541d;}div.error,.error{color:#8c2e0b;}div.error,table tr.error{background-color:#fef5f1;}div.error p.error{color:#333;}div.messages ul{margin:0 0 0 1em;padding:0;}div.messages ul li{list-style-image:none;} 3 | fieldset{margin-bottom:1em;padding:0.5em;}form{margin:0;padding:0;}hr{border:1px solid gray;height:1px;}img{border:0;}table{border-collapse:collapse;}th{border-bottom:3px solid #ccc;padding-right:1em;text-align:left;}tbody{border-top:1px solid #ccc;}tr.even,tr.odd{background-color:#eee;border-bottom:1px solid #ccc;padding:0.1em 0.6em;}th.active img{display:inline;}td.active{background-color:#ddd;}.item-list .title{font-weight:bold;}.item-list ul{margin:0 0 0.75em 0;padding:0;}.item-list ul li{margin:0 0 0.25em 1.5em;padding:0;}.form-item,.form-actions{margin-top:1em;margin-bottom:1em;}tr.odd .form-item,tr.even .form-item{margin-top:0;margin-bottom:0;white-space:nowrap;}.form-item .description{font-size:0.85em;}label{display:block;font-weight:bold;}label.option{display:inline;font-weight:normal;}.form-checkboxes .form-item,.form-radios .form-item{margin-top:0.4em;margin-bottom:0.4em;}.form-type-radio .description,.form-type-checkbox .description{margin-left:2.4em;}input.form-checkbox,input.form-radio{vertical-align:middle;}.marker,.form-required{color:#f00;}.form-item input.error,.form-item textarea.error,.form-item select.error{border:2px solid red;}.container-inline .form-actions,.container-inline.form-actions{margin-top:0;margin-bottom:0;}.more-link{text-align:right;}.more-help-link{text-align:right;}.more-help-link a{background:url(/misc/help.png) 0 50% no-repeat;padding:1px 0 1px 20px;}.item-list .pager{clear:both;text-align:center;}.item-list .pager li{background-image:none;display:inline;list-style-type:none;padding:0.5em;}.pager-current{font-weight:bold;}#autocomplete li.selected{background:#0072b9;color:#fff;}html.js fieldset.collapsible .fieldset-legend{background:url(/misc/menu-expanded.png) 5px 65% no-repeat;padding-left:15px;}html.js fieldset.collapsed .fieldset-legend{background-image:url(/misc/menu-collapsed.png);background-position:5px 50%;}.fieldset-legend span.summary{color:#999;font-size:0.9em;margin-left:0.5em;}tr.drag{background-color:#fffff0;}tr.drag-previous{background-color:#ffd;}.tabledrag-toggle-weight{font-size:0.9em;}body div.tabledrag-changed-warning{margin-bottom:0.5em;}tr.selected td{background:#ffc;}td.checkbox,th.checkbox{text-align:center;}.progress{font-weight:bold;}.progress .bar{background:#ccc;border-color:#666;margin:0 0.2em;-moz-border-radius:3px;-webkit-border-radius:3px;border-radius:3px;}.progress .filled{background:#0072b9 url(/misc/progress.gif);} 4 | -------------------------------------------------------------------------------- /sample/html/styles/css_rJ3pqftttKVzxtjsOG18hAid4RqqjfFMw3d1C89lWd4.css: -------------------------------------------------------------------------------- 1 | a[rel='lightbox[lii_cfr_content_img]'] img{width:auto;min-height:50px;padding-right:25px;background:url(/sites/all/themes/liizenboot/colorbox/images/zoom_rotate.gif) no-repeat top right #dce0df;}#cboxRotateLeft,#cboxRotateRight,#cboxZoomOut,#cboxZoomIn{border:0;padding:0;margin:0;overflow:visible;width:auto;background:none;cursor:pointer;}#cboxRotateLeft{background:url(/sites/all/themes/liizenboot/colorbox/images/controls.png) no-repeat scroll -26px -122px transparent;bottom:0;height:25px;right:0;position:absolute;text-indent:-9999px;width:25px;}#cboxRotateLeft:hover{background-position:-51px -122px;}#cboxRotateRight{background:url(/sites/all/themes/liizenboot/colorbox/images/controls.png) no-repeat scroll -26px -98px transparent;bottom:0;height:25px;right:25px;position:absolute;text-indent:-9999px;width:25px;}#cboxRotateRight:hover{background-position:-51px -98px}#cboxZoomOut{background:url(/sites/all/themes/liizenboot/colorbox/images/controls.png) no-repeat scroll -76px -74px transparent;bottom:0;height:25px;right:50px;position:absolute;text-indent:-9999px;width:25px;}#cboxZoomOut:hover{background-position:-102px -74px;}#cboxZoomIn{background:url(/sites/all/themes/liizenboot/colorbox/images/controls.png) no-repeat scroll -76px -98px transparent;bottom:0;height:25px;right:50px;position:absolute;text-indent:-9999px;width:25px;}#cboxZoomIn:hover{background-position:-101px -99px;}#cboxPlay{background:url(/sites/all/themes/liizenboot/colorbox/images/controls.png) no-repeat scroll -26px -51px transparent;bottom:0;height:25px;left:54px;position:absolute;text-indent:-9999px;width:25px;}#cboxPlay:hover{background-position:-51px -51px;}#cboxPause{background:url(/sites/all/themes/liizenboot/colorbox/images/controls.png) no-repeat scroll -77px -51px transparent;bottom:0;height:25px;left:54px;position:absolute;text-indent:-9999px;width:25px;}#cboxPause:hover{background-position:-102px -51px;}div#cboxContent #cboxCurrent{left:138px !important;}.cboxPhoto{min-width:225px;height:auto;} 2 | #colorbox,#cboxOverlay,#cboxWrapper{position:absolute;top:0;left:0;z-index:9999;overflow:hidden;}#cboxOverlay{position:fixed;width:100%;height:100%;}#cboxMiddleLeft,#cboxBottomLeft{clear:left;}#cboxContent{position:relative;}#cboxLoadedContent{overflow:auto;-webkit-overflow-scrolling:touch;}#cboxTitle{margin:0;}#cboxLoadingOverlay,#cboxLoadingGraphic{position:absolute;top:0;left:0;width:100%;height:100%;}#cboxPrevious,#cboxNext,#cboxClose,#cboxSlideshow{border:0;padding:0;margin:0;overflow:visible;width:auto;background:none;cursor:pointer;}#cboxPrevious:active,#cboxNext:active,#cboxClose:active,#cboxSlideshow:active{outline:0;}.cboxPhoto{float:left;margin:auto;border:0;display:block;max-width:none;}.cboxIframe{width:100%;height:100%;display:block;border:0;}#colorbox,#cboxContent,#cboxLoadedContent{-moz-box-sizing:content-box;-webkit-box-sizing:content-box;box-sizing:content-box;}#cboxOverlay{background:#000;}#colorbox{outline:0;}#cboxWrapper{background:#fff;-moz-border-radius:5px;-webkit-border-radius:5px;border-radius:5px;}#cboxTopLeft{width:15px;height:15px;}#cboxTopCenter{height:15px;}#cboxTopRight{width:15px;height:15px;}#cboxBottomLeft{width:15px;height:10px;}#cboxBottomCenter{height:10px;}#cboxBottomRight{width:15px;height:10px;}#cboxMiddleLeft{width:15px;}#cboxMiddleRight{width:15px;}#cboxContent{background:#fff;overflow:hidden;}#cboxError{padding:50px;border:1px solid #ccc;}#cboxLoadedContent{margin-bottom:28px;}#cboxTitle{position:absolute;background:rgba(255,255,255,0.7);bottom:28px;left:0;color:#535353;width:100%;padding:4px 6px;-moz-box-sizing:border-box;-webkit-box-sizing:border-box;box-sizing:border-box;}#cboxCurrent{position:absolute;bottom:4px;left:60px;color:#949494;}.cboxSlideshow_on #cboxSlideshow{position:absolute;bottom:0px;right:30px;background:url(/sites/all/themes/liizenboot/colorbox/images/controls.png) no-repeat -75px -50px;width:25px;height:25px;text-indent:-9999px;}.cboxSlideshow_on #cboxSlideshow:hover{background-position:-101px -50px;}.cboxSlideshow_off #cboxSlideshow{position:absolute;bottom:0px;right:30px;background:url(/sites/all/themes/liizenboot/colorbox/images/controls.png) no-repeat -25px -50px;width:25px;height:25px;text-indent:-9999px;}.cboxSlideshow_off #cboxSlideshow:hover{background-position:-49px -50px;}#cboxPrevious{position:absolute;bottom:0;left:0;background:url(/sites/all/themes/liizenboot/colorbox/images/controls.png) no-repeat -75px 0px;width:25px;height:25px;text-indent:-9999px;}#cboxPrevious:hover{background-position:-75px -25px;}#cboxNext{position:absolute;bottom:0;left:27px;background:url(/sites/all/themes/liizenboot/colorbox/images/controls.png) no-repeat -50px 0px;width:25px;height:25px;text-indent:-9999px;}#cboxNext:hover{background-position:-50px -25px;}#cboxLoadingOverlay{background:#fff;}#cboxLoadingGraphic{background:url(/sites/all/themes/liizenboot/colorbox/images/loading_animation.gif) no-repeat center center;}#cboxClose{position:absolute;top:0;right:0;background:url(/sites/all/themes/liizenboot/colorbox/images/controls.png) no-repeat -25px 0px;width:25px;height:25px;text-indent:-9999px;}#cboxClose:hover{background-position:-25px -25px;} 3 | -------------------------------------------------------------------------------- /sample/html/styles/css_tuqeOBz1ozigHOvScJR2wasCmXBizZ9rfd58u6_20EE.css: -------------------------------------------------------------------------------- 1 | .ui-helper-hidden{display:none;}.ui-helper-hidden-accessible{border:0;clip:rect(0 0 0 0);height:1px;margin:-1px;overflow:hidden;padding:0;position:absolute;width:1px;}.ui-helper-reset{margin:0;padding:0;border:0;outline:0;line-height:1.3;text-decoration:none;font-size:100%;list-style:none;}.ui-helper-clearfix:before,.ui-helper-clearfix:after{content:"";display:table;border-collapse:collapse;}.ui-helper-clearfix:after{clear:both;}.ui-helper-clearfix{min-height:0;}.ui-helper-zfix{width:100%;height:100%;top:0;left:0;position:absolute;opacity:0;filter:Alpha(Opacity=0);}.ui-front{z-index:100;}.ui-state-disabled{cursor:default !important;}.ui-icon{display:block;text-indent:-99999px;overflow:hidden;background-repeat:no-repeat;}.ui-widget-overlay{position:fixed;top:0;left:0;width:100%;height:100%;} 2 | .ui-widget{font-family:Verdana,Arial,sans-serif;font-size:1.1em;}.ui-widget .ui-widget{font-size:1em;}.ui-widget input,.ui-widget select,.ui-widget textarea,.ui-widget button{font-family:Verdana,Arial,sans-serif;font-size:1em;}.ui-widget-content{border:1px solid #aaaaaa;background:#ffffff url(/sites/all/modules/jquery_update/replace/ui/themes/base/images/ui-bg_flat_75_ffffff_40x100.png) 50% 50% repeat-x;color:#222222;}.ui-widget-content a{color:#222222;}.ui-widget-header{border:1px solid #aaaaaa;background:#cccccc url(/sites/all/modules/jquery_update/replace/ui/themes/base/images/ui-bg_highlight-soft_75_cccccc_1x100.png) 50% 50% repeat-x;color:#222222;font-weight:bold;}.ui-widget-header a{color:#222222;}.ui-state-default,.ui-widget-content .ui-state-default,.ui-widget-header .ui-state-default{border:1px solid #d3d3d3;background:#e6e6e6 url(/sites/all/modules/jquery_update/replace/ui/themes/base/images/ui-bg_glass_75_e6e6e6_1x400.png) 50% 50% repeat-x;font-weight:normal;color:#555555;}.ui-state-default a,.ui-state-default a:link,.ui-state-default a:visited{color:#555555;text-decoration:none;}.ui-state-hover,.ui-widget-content .ui-state-hover,.ui-widget-header .ui-state-hover,.ui-state-focus,.ui-widget-content .ui-state-focus,.ui-widget-header .ui-state-focus{border:1px solid #999999;background:#dadada url(/sites/all/modules/jquery_update/replace/ui/themes/base/images/ui-bg_glass_75_dadada_1x400.png) 50% 50% repeat-x;font-weight:normal;color:#212121;}.ui-state-hover a,.ui-state-hover a:hover,.ui-state-hover a:link,.ui-state-hover a:visited{color:#212121;text-decoration:none;}.ui-state-active,.ui-widget-content .ui-state-active,.ui-widget-header .ui-state-active{border:1px solid #aaaaaa;background:#ffffff url(/sites/all/modules/jquery_update/replace/ui/themes/base/images/ui-bg_glass_65_ffffff_1x400.png) 50% 50% repeat-x;font-weight:normal;color:#212121;}.ui-state-active a,.ui-state-active a:link,.ui-state-active a:visited{color:#212121;text-decoration:none;}.ui-state-highlight,.ui-widget-content .ui-state-highlight,.ui-widget-header .ui-state-highlight{border:1px solid #fcefa1;background:#fbf9ee url(/sites/all/modules/jquery_update/replace/ui/themes/base/images/ui-bg_glass_55_fbf9ee_1x400.png) 50% 50% repeat-x;color:#363636;}.ui-state-highlight a,.ui-widget-content .ui-state-highlight a,.ui-widget-header .ui-state-highlight a{color:#363636;}.ui-state-error,.ui-widget-content .ui-state-error,.ui-widget-header .ui-state-error{border:1px solid #cd0a0a;background:#fef1ec url(/sites/all/modules/jquery_update/replace/ui/themes/base/images/ui-bg_glass_95_fef1ec_1x400.png) 50% 50% repeat-x;color:#cd0a0a;}.ui-state-error a,.ui-widget-content .ui-state-error a,.ui-widget-header .ui-state-error a{color:#cd0a0a;}.ui-state-error-text,.ui-widget-content .ui-state-error-text,.ui-widget-header .ui-state-error-text{color:#cd0a0a;}.ui-priority-primary,.ui-widget-content .ui-priority-primary,.ui-widget-header .ui-priority-primary{font-weight:bold;}.ui-priority-secondary,.ui-widget-content .ui-priority-secondary,.ui-widget-header .ui-priority-secondary{opacity:.7;filter:Alpha(Opacity=70);font-weight:normal;}.ui-state-disabled,.ui-widget-content .ui-state-disabled,.ui-widget-header .ui-state-disabled{opacity:.35;filter:Alpha(Opacity=35);background-image:none;}.ui-state-disabled .ui-icon{filter:Alpha(Opacity=35);}.ui-icon{width:16px;height:16px;}.ui-icon,.ui-widget-content .ui-icon{background-image:url(/sites/all/modules/jquery_update/replace/ui/themes/base/images/ui-icons_222222_256x240.png);}.ui-widget-header .ui-icon{background-image:url(/sites/all/modules/jquery_update/replace/ui/themes/base/images/ui-icons_222222_256x240.png);}.ui-state-default .ui-icon{background-image:url(/sites/all/modules/jquery_update/replace/ui/themes/base/images/ui-icons_888888_256x240.png);}.ui-state-hover .ui-icon,.ui-state-focus .ui-icon{background-image:url(/sites/all/modules/jquery_update/replace/ui/themes/base/images/ui-icons_454545_256x240.png);}.ui-state-active .ui-icon{background-image:url(/sites/all/modules/jquery_update/replace/ui/themes/base/images/ui-icons_454545_256x240.png);}.ui-state-highlight .ui-icon{background-image:url(/sites/all/modules/jquery_update/replace/ui/themes/base/images/ui-icons_2e83ff_256x240.png);}.ui-state-error .ui-icon,.ui-state-error-text .ui-icon{background-image:url(/sites/all/modules/jquery_update/replace/ui/themes/base/images/ui-icons_cd0a0a_256x240.png);}.ui-icon-blank{background-position:16px 16px;}.ui-icon-carat-1-n{background-position:0 0;}.ui-icon-carat-1-ne{background-position:-16px 0;}.ui-icon-carat-1-e{background-position:-32px 0;}.ui-icon-carat-1-se{background-position:-48px 0;}.ui-icon-carat-1-s{background-position:-64px 0;}.ui-icon-carat-1-sw{background-position:-80px 0;}.ui-icon-carat-1-w{background-position:-96px 0;}.ui-icon-carat-1-nw{background-position:-112px 0;}.ui-icon-carat-2-n-s{background-position:-128px 0;}.ui-icon-carat-2-e-w{background-position:-144px 0;}.ui-icon-triangle-1-n{background-position:0 -16px;}.ui-icon-triangle-1-ne{background-position:-16px -16px;}.ui-icon-triangle-1-e{background-position:-32px -16px;}.ui-icon-triangle-1-se{background-position:-48px -16px;}.ui-icon-triangle-1-s{background-position:-64px -16px;}.ui-icon-triangle-1-sw{background-position:-80px -16px;}.ui-icon-triangle-1-w{background-position:-96px -16px;}.ui-icon-triangle-1-nw{background-position:-112px -16px;}.ui-icon-triangle-2-n-s{background-position:-128px -16px;}.ui-icon-triangle-2-e-w{background-position:-144px -16px;}.ui-icon-arrow-1-n{background-position:0 -32px;}.ui-icon-arrow-1-ne{background-position:-16px -32px;}.ui-icon-arrow-1-e{background-position:-32px -32px;}.ui-icon-arrow-1-se{background-position:-48px -32px;}.ui-icon-arrow-1-s{background-position:-64px -32px;}.ui-icon-arrow-1-sw{background-position:-80px -32px;}.ui-icon-arrow-1-w{background-position:-96px -32px;}.ui-icon-arrow-1-nw{background-position:-112px -32px;}.ui-icon-arrow-2-n-s{background-position:-128px -32px;}.ui-icon-arrow-2-ne-sw{background-position:-144px -32px;}.ui-icon-arrow-2-e-w{background-position:-160px -32px;}.ui-icon-arrow-2-se-nw{background-position:-176px -32px;}.ui-icon-arrowstop-1-n{background-position:-192px -32px;}.ui-icon-arrowstop-1-e{background-position:-208px -32px;}.ui-icon-arrowstop-1-s{background-position:-224px -32px;}.ui-icon-arrowstop-1-w{background-position:-240px -32px;}.ui-icon-arrowthick-1-n{background-position:0 -48px;}.ui-icon-arrowthick-1-ne{background-position:-16px -48px;}.ui-icon-arrowthick-1-e{background-position:-32px -48px;}.ui-icon-arrowthick-1-se{background-position:-48px -48px;}.ui-icon-arrowthick-1-s{background-position:-64px -48px;}.ui-icon-arrowthick-1-sw{background-position:-80px -48px;}.ui-icon-arrowthick-1-w{background-position:-96px -48px;}.ui-icon-arrowthick-1-nw{background-position:-112px -48px;}.ui-icon-arrowthick-2-n-s{background-position:-128px -48px;}.ui-icon-arrowthick-2-ne-sw{background-position:-144px -48px;}.ui-icon-arrowthick-2-e-w{background-position:-160px -48px;}.ui-icon-arrowthick-2-se-nw{background-position:-176px -48px;}.ui-icon-arrowthickstop-1-n{background-position:-192px -48px;}.ui-icon-arrowthickstop-1-e{background-position:-208px -48px;}.ui-icon-arrowthickstop-1-s{background-position:-224px -48px;}.ui-icon-arrowthickstop-1-w{background-position:-240px -48px;}.ui-icon-arrowreturnthick-1-w{background-position:0 -64px;}.ui-icon-arrowreturnthick-1-n{background-position:-16px -64px;}.ui-icon-arrowreturnthick-1-e{background-position:-32px -64px;}.ui-icon-arrowreturnthick-1-s{background-position:-48px -64px;}.ui-icon-arrowreturn-1-w{background-position:-64px -64px;}.ui-icon-arrowreturn-1-n{background-position:-80px -64px;}.ui-icon-arrowreturn-1-e{background-position:-96px -64px;}.ui-icon-arrowreturn-1-s{background-position:-112px -64px;}.ui-icon-arrowrefresh-1-w{background-position:-128px -64px;}.ui-icon-arrowrefresh-1-n{background-position:-144px -64px;}.ui-icon-arrowrefresh-1-e{background-position:-160px -64px;}.ui-icon-arrowrefresh-1-s{background-position:-176px -64px;}.ui-icon-arrow-4{background-position:0 -80px;}.ui-icon-arrow-4-diag{background-position:-16px -80px;}.ui-icon-extlink{background-position:-32px -80px;}.ui-icon-newwin{background-position:-48px -80px;}.ui-icon-refresh{background-position:-64px -80px;}.ui-icon-shuffle{background-position:-80px -80px;}.ui-icon-transfer-e-w{background-position:-96px -80px;}.ui-icon-transferthick-e-w{background-position:-112px -80px;}.ui-icon-folder-collapsed{background-position:0 -96px;}.ui-icon-folder-open{background-position:-16px -96px;}.ui-icon-document{background-position:-32px -96px;}.ui-icon-document-b{background-position:-48px -96px;}.ui-icon-note{background-position:-64px -96px;}.ui-icon-mail-closed{background-position:-80px -96px;}.ui-icon-mail-open{background-position:-96px -96px;}.ui-icon-suitcase{background-position:-112px -96px;}.ui-icon-comment{background-position:-128px -96px;}.ui-icon-person{background-position:-144px -96px;}.ui-icon-print{background-position:-160px -96px;}.ui-icon-trash{background-position:-176px -96px;}.ui-icon-locked{background-position:-192px -96px;}.ui-icon-unlocked{background-position:-208px -96px;}.ui-icon-bookmark{background-position:-224px -96px;}.ui-icon-tag{background-position:-240px -96px;}.ui-icon-home{background-position:0 -112px;}.ui-icon-flag{background-position:-16px -112px;}.ui-icon-calendar{background-position:-32px -112px;}.ui-icon-cart{background-position:-48px -112px;}.ui-icon-pencil{background-position:-64px -112px;}.ui-icon-clock{background-position:-80px -112px;}.ui-icon-disk{background-position:-96px -112px;}.ui-icon-calculator{background-position:-112px -112px;}.ui-icon-zoomin{background-position:-128px -112px;}.ui-icon-zoomout{background-position:-144px -112px;}.ui-icon-search{background-position:-160px -112px;}.ui-icon-wrench{background-position:-176px -112px;}.ui-icon-gear{background-position:-192px -112px;}.ui-icon-heart{background-position:-208px -112px;}.ui-icon-star{background-position:-224px -112px;}.ui-icon-link{background-position:-240px -112px;}.ui-icon-cancel{background-position:0 -128px;}.ui-icon-plus{background-position:-16px -128px;}.ui-icon-plusthick{background-position:-32px -128px;}.ui-icon-minus{background-position:-48px -128px;}.ui-icon-minusthick{background-position:-64px -128px;}.ui-icon-close{background-position:-80px -128px;}.ui-icon-closethick{background-position:-96px -128px;}.ui-icon-key{background-position:-112px -128px;}.ui-icon-lightbulb{background-position:-128px -128px;}.ui-icon-scissors{background-position:-144px -128px;}.ui-icon-clipboard{background-position:-160px -128px;}.ui-icon-copy{background-position:-176px -128px;}.ui-icon-contact{background-position:-192px -128px;}.ui-icon-image{background-position:-208px -128px;}.ui-icon-video{background-position:-224px -128px;}.ui-icon-script{background-position:-240px -128px;}.ui-icon-alert{background-position:0 -144px;}.ui-icon-info{background-position:-16px -144px;}.ui-icon-notice{background-position:-32px -144px;}.ui-icon-help{background-position:-48px -144px;}.ui-icon-check{background-position:-64px -144px;}.ui-icon-bullet{background-position:-80px -144px;}.ui-icon-radio-on{background-position:-96px -144px;}.ui-icon-radio-off{background-position:-112px -144px;}.ui-icon-pin-w{background-position:-128px -144px;}.ui-icon-pin-s{background-position:-144px -144px;}.ui-icon-play{background-position:0 -160px;}.ui-icon-pause{background-position:-16px -160px;}.ui-icon-seek-next{background-position:-32px -160px;}.ui-icon-seek-prev{background-position:-48px -160px;}.ui-icon-seek-end{background-position:-64px -160px;}.ui-icon-seek-start{background-position:-80px -160px;}.ui-icon-seek-first{background-position:-80px -160px;}.ui-icon-stop{background-position:-96px -160px;}.ui-icon-eject{background-position:-112px -160px;}.ui-icon-volume-off{background-position:-128px -160px;}.ui-icon-volume-on{background-position:-144px -160px;}.ui-icon-power{background-position:0 -176px;}.ui-icon-signal-diag{background-position:-16px -176px;}.ui-icon-signal{background-position:-32px -176px;}.ui-icon-battery-0{background-position:-48px -176px;}.ui-icon-battery-1{background-position:-64px -176px;}.ui-icon-battery-2{background-position:-80px -176px;}.ui-icon-battery-3{background-position:-96px -176px;}.ui-icon-circle-plus{background-position:0 -192px;}.ui-icon-circle-minus{background-position:-16px -192px;}.ui-icon-circle-close{background-position:-32px -192px;}.ui-icon-circle-triangle-e{background-position:-48px -192px;}.ui-icon-circle-triangle-s{background-position:-64px -192px;}.ui-icon-circle-triangle-w{background-position:-80px -192px;}.ui-icon-circle-triangle-n{background-position:-96px -192px;}.ui-icon-circle-arrow-e{background-position:-112px -192px;}.ui-icon-circle-arrow-s{background-position:-128px -192px;}.ui-icon-circle-arrow-w{background-position:-144px -192px;}.ui-icon-circle-arrow-n{background-position:-160px -192px;}.ui-icon-circle-zoomin{background-position:-176px -192px;}.ui-icon-circle-zoomout{background-position:-192px -192px;}.ui-icon-circle-check{background-position:-208px -192px;}.ui-icon-circlesmall-plus{background-position:0 -208px;}.ui-icon-circlesmall-minus{background-position:-16px -208px;}.ui-icon-circlesmall-close{background-position:-32px -208px;}.ui-icon-squaresmall-plus{background-position:-48px -208px;}.ui-icon-squaresmall-minus{background-position:-64px -208px;}.ui-icon-squaresmall-close{background-position:-80px -208px;}.ui-icon-grip-dotted-vertical{background-position:0 -224px;}.ui-icon-grip-dotted-horizontal{background-position:-16px -224px;}.ui-icon-grip-solid-vertical{background-position:-32px -224px;}.ui-icon-grip-solid-horizontal{background-position:-48px -224px;}.ui-icon-gripsmall-diagonal-se{background-position:-64px -224px;}.ui-icon-grip-diagonal-se{background-position:-80px -224px;}.ui-corner-all,.ui-corner-top,.ui-corner-left,.ui-corner-tl{border-top-left-radius:4px;}.ui-corner-all,.ui-corner-top,.ui-corner-right,.ui-corner-tr{border-top-right-radius:4px;}.ui-corner-all,.ui-corner-bottom,.ui-corner-left,.ui-corner-bl{border-bottom-left-radius:4px;}.ui-corner-all,.ui-corner-bottom,.ui-corner-right,.ui-corner-br{border-bottom-right-radius:4px;}.ui-widget-overlay{background:#aaaaaa url(/sites/all/modules/jquery_update/replace/ui/themes/base/images/ui-bg_flat_0_aaaaaa_40x100.png) 50% 50% repeat-x;opacity:.3;filter:Alpha(Opacity=30);}.ui-widget-shadow{margin:-8px 0 0 -8px;padding:8px;background:#aaaaaa url(/sites/all/modules/jquery_update/replace/ui/themes/base/images/ui-bg_flat_0_aaaaaa_40x100.png) 50% 50% repeat-x;opacity:.3;filter:Alpha(Opacity=30);border-radius:8px;} 3 | -------------------------------------------------------------------------------- /sample/parsed_content.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CatalystCode/CustomSearch/3d9f82f676577210dde159ef7218753a2e8c309b/sample/parsed_content.xlsx -------------------------------------------------------------------------------- /sample/parsed_content_sample.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CatalystCode/CustomSearch/3d9f82f676577210dde159ef7218753a2e8c309b/sample/parsed_content_sample.xlsx -------------------------------------------------------------------------------- /sample/raw_text_enriched_with_keywords_sample.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CatalystCode/CustomSearch/3d9f82f676577210dde159ef7218753a2e8c309b/sample/raw_text_enriched_with_keywords_sample.xlsx -------------------------------------------------------------------------------- /sample/sample_page.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CatalystCode/CustomSearch/3d9f82f676577210dde159ef7218753a2e8c309b/sample/sample_page.png -------------------------------------------------------------------------------- /sample/sample_queries.txt: -------------------------------------------------------------------------------- 1 | Qid Query 2 | 1 child tax credit 3 | 2 what is the tax bracket for married couple filing separately 4 | 3 definition of surviving spouse 5 | 4 when should i submit my tax return 6 | 5 my tenant paid for some repairs in my rental unit. is this taxable? 7 | 6 what is the maximum allowed IRA limit 8 | 7 are there any limitations on employer provided benefits? 9 | -------------------------------------------------------------------------------- /sample/sample_queries.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CatalystCode/CustomSearch/3d9f82f676577210dde159ef7218753a2e8c309b/sample/sample_queries.xlsx -------------------------------------------------------------------------------- /sample/sample_query_answers.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CatalystCode/CustomSearch/3d9f82f676577210dde159ef7218753a2e8c309b/sample/sample_query_answers.xlsx --------------------------------------------------------------------------------