├── .gitignore
├── LICENSE
├── README.md
├── qacrawler
    ├── README.md
    ├── __init__.py
    ├── crawler.py
    ├── driver_wrapper.py
    ├── google_dom_info.py
    ├── jeopardy.py
    ├── main.py
    └── sr_parser.py
├── requirements.txt
└── tests
    ├── .DS_Store
    ├── data
        ├── cheese - Google Search.html
        ├── one_result.html
        ├── parsed.tsv
        └── tiny_dataset.json
    ├── main_test.py
    ├── study_codes.py
    ├── test_crawler.py
    └── test_jeopardy.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | BSD 3-Clause License
 2 | 
 3 | Copyright (c) 2017, New York University (Kyunghyun Cho)
 4 | All rights reserved.
 5 | 
 6 | Redistribution and use in source and binary forms, with or without
 7 | modification, are permitted provided that the following conditions are met:
 8 | 
 9 | * Redistributions of source code must retain the above copyright notice, this
10 |   list of conditions and the following disclaimer.
11 | 
12 | * Redistributions in binary form must reproduce the above copyright notice,
13 |   this list of conditions and the following disclaimer in the documentation
14 |   and/or other materials provided with the distribution.
15 | 
16 | * Neither the name of the copyright holder nor the names of its
17 |   contributors may be used to endorse or promote products derived from
18 |   this software without specific prior written permission.
19 | 
20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # SearchQA
 2 | 
 3 | Associated paper:  
 4 | https://arxiv.org/abs/1704.05179  
 5 | 
 6 | Here are raw, split, and processed files:
 7 | https://drive.google.com/drive/u/2/folders/1kBkQGooNyG0h8waaOJpgdGtOnlb1S649
 8 | 
 9 | -------
10 | 
11 | One can collect the original json files through web search using the scripts in qacrawler. Please refer to the README in the folder for further details on how to use the scraper. Furthermore, one can use the files in the test folder to try it. The above link also contains the original json files that are collected using the Jeopardy! dataset.  
12 | 
13 | There are also stat files that gives the number of snippets found for the question associated to its filename. This number can range from 0 to 100. For some questions the crawler is set to collect the first 50 snippets and for some it was 100. When the search doesn't give enough results to reach this level then the ones available are collected. During the training we ignored all the files that contain 40 or less snippets to eliminate possible trivial cases. Also, the training data ignores snippets from the 51st onward.
14 | 
15 | And here is the link for the Jeopardy! files themselves:  
16 | https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/  
17 | 
18 | NOTE: We will release the the script that converts these to the training files above with appropriate restrictions.
19 | 
20 | -------
21 | 
22 | Some requirements:
23 | nltk==3.2.1  
24 | pandas==0.18.1  
25 | selenium==2.53.6  
26 | pytest==3.0.2  
27 | pytorch==0.1.11  
28 | 


--------------------------------------------------------------------------------
/qacrawler/README.md:
--------------------------------------------------------------------------------
 1 | # Build
 2 | 
 3 | ## Virtual environment
 4 | 
 5 | If you prefer to work in a new Python virtual environment first create one via 
 6 | A) [virtualenvwrapper](https://virtualenvwrapper.readthedocs.io) (easier) or
 7 | B) [virtualenv](https://virtualenv.pypa.io)
 8 | 
 9 | A) `mkvirtualenv qasearch` to create the environment. 
10 | (By default this creates `~/.virtualenvs/qasearch/` directory.) 
11 | Then `workon qasearch` to activate the environment.
12 | 
13 | B) `virtualenv ENV_DIR` to create the environment.
14 | Then `source ENV_DIR/bin/activate` to activate the environment.
15 | 
16 | ## Requirements 
17 | 
18 | Type `pip install -r requirements.txt` in the project folder to install the required Python packages. 
19 | (If you don't create a virtual environment packages will be added to your system Python installation) 
20 | 
21 | # Drivers
22 | 
23 | Put driver executables to system's PATH. (Ugur: "I sudo-copied them into `/usr/local/bin`" ^_^)
24 | 
25 | - [Chrome driver](https://sites.google.com/a/chromium.org/chromedriver/downloads).
26 | - [Firefox driver](https://developer.mozilla.org/en-US/docs/Mozilla/QA/Marionette/WebDriver). 
27 |     - Use [v0.9.0](https://github.com/mozilla/geckodriver/releases/tag/v0.9.0) 
28 |     - For some reason you have to rename the executabe to `wires` (from `geckodriver`)
29 |     
30 | 
31 | # Run
32 | 
33 | Example command to run the crawler.
34 | 
35 | ```
36 | $ workon PROJECT_VIRTUAL_ENVIRONMENT
37 | $ cd /PROJECT_GIT_REPOSITORY_FOLDER/qacrawler
38 | $ python main.py --jeopardy-json PATH_TO_JEOPARDY_QUESTIONS1.json --first 0 --last 1000 --output-folder OUTPUT_FOLDER --wait-duration 2 --log-level DEBUG --driver Firefox --disable-javascript --num-pages 1 --results-per-page 50
39 | ```
40 | 
41 | Then to follow the logs `tail -f jeopardy_crawler.log`.
42 | 


--------------------------------------------------------------------------------
/qacrawler/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nyu-dl/dl4ir-searchQA/fe1ea9fdb888001723ee89daf72b35601d72a4ea/qacrawler/__init__.py


--------------------------------------------------------------------------------
/qacrawler/crawler.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Module to crawl entries and save the results.
  3 | """
  4 | import json
  5 | import logging
  6 | import os
  7 | 
  8 | import sr_parser
  9 | 
 10 | 
 11 | def crawl(settings, entries):
 12 |     """
 13 |     Crawl search results of given Jeopardy entries according to given crawler settings.
 14 | 
 15 |     :param settings: Crawler settings object
 16 |     :type settings: CrawlerSettings
 17 |     :param entries: Jeopary dataset entries
 18 |     :type entries: collections.Iterable[qacrawler.jeopardy.Entry]
 19 |     :return:
 20 |     """
 21 |     for entry in entries:
 22 |         logging.info('Question no %06d: %s. Crawl!' % (entry.id, entry.question))
 23 |         results = sr_parser.collect_query_results_from_google(entry.question, settings)
 24 |         logging.info('Question no %06d. Collected %d search results.' % (entry.id, len(results)))
 25 |         if results:
 26 |             save_results_for_entry(results, entry, settings.output_folder)
 27 | 
 28 | 
 29 | def save_results_for_entry(results, entry, output_folder, file_type='json'):
 30 |     """
 31 |     Format search results into json or tsv and save them to a file.
 32 | 
 33 |     :param results: a list of SearchResults
 34 |     :type results: list[sr_parser.SearchResult]
 35 |     :param entry: jeopardy Entry
 36 |     :type entry: jeopardy.Entry
 37 |     :param output_folder: path to output folder
 38 |     :type output_folder: str
 39 |     :param file_type: the type of the saved file. Can only have values ['json', 'tsv']
 40 |     :type file_type: str
 41 |     :rtype: None
 42 |     """
 43 |     if file_type == 'json':
 44 |         formatted_results = results_list_to_output(results, entry)
 45 |     else:
 46 |         formatted_results = results_list_to_tsv(results)
 47 | 
 48 |     file_folder = output_folder if output_folder else '.'
 49 |     file_name = generate_filename(entry, file_type)
 50 |     file_path = os.path.join(file_folder, file_name)
 51 |     with open(file_path, 'wt') as f:
 52 |         f.write(formatted_results)
 53 | 
 54 | 
 55 | def results_list_to_output(results, entry):
 56 |     """
 57 |     Format a list of SearchResults into a JSON string.
 58 | 
 59 |     :param results: a list of SearchResults
 60 |     :type results: list[sr_parser.SearchResult]
 61 |     :param entry: Jeopardy entry
 62 |     :type entry: jeopardy.Entry
 63 |     :return: string in JSON format
 64 |     :rtype: str
 65 |     """
 66 |     result_dicts = [res.to_dict() for res in results]
 67 |     output_dict = {'search_results': result_dicts}
 68 |     output_dict.update(entry.to_dict())
 69 |     results_json = json.dumps(output_dict, indent=4)  # Pretty-print via indent. Splits keys into multiple lines.
 70 |     return results_json
 71 | 
 72 | 
 73 | def results_list_to_tsv(results):
 74 |     """
 75 |     Format a list of SearchResults into a TSV (Tab Separated Values) string.
 76 | 
 77 |     :param results: a list of SearchResults
 78 |     :type results: list[sr_parser.SearchResult]
 79 |     :return: string in TSV format
 80 |     :rtype: str
 81 |     """
 82 |     results_tab_separated = [str(res) for res in results]
 83 |     results_str = '\n'.join(results_tab_separated)
 84 |     return results_str
 85 | 
 86 | 
 87 | def generate_filename(entry, extension):
 88 |     filename = '%06d-%s.%s' % (entry.id, entry.tag, extension)
 89 |     return filename
 90 | 
 91 | 
 92 | # TODO have a process pipeline
 93 | def process_pipeline():
 94 |     pass
 95 |     # Get a SearchResult
 96 |     # save %06d-%s-raw.txt % (q_num_id, q_text_id)
 97 |     # Filter out exact matches
 98 |     # save %06d-%s-flt.txt % (q_num_id, q_text_id)
 99 |     # Tokenize
100 |     # save %06d-%s-tok.txt % (q_num_id, q_text_id)
101 |     # Other NLP operator
102 |     # save %06d-%s-oth.txt % (q_num_id, q_text_id)
103 | 
104 | 
105 | class CrawlerSettings:
106 |     def __init__(self, driver, num_pages, output_folder, wait_duration,
107 |                  simulate_typing, simulate_clicking, disable_javascript):
108 |         """
109 |         Singleton class that holds configuration info for crawler.
110 | 
111 |         It is to make sharing of crawler settings easier among the project. We do not want higher-level
112 |         functions in the functionality abstraction hierarchy to be bloated with too many arguments.
113 | 
114 |         :param driver: selenium driver with which we are going to visit Google
115 |         :type driver: selenium.webdriver.chrome.webdriver.WebDriver
116 |         :param num_pages: number of results pages to read (at most)
117 |         :type num_pages: int
118 |         :param output_folder:
119 |         :param wait_duration: duration to wait between search results pages in seconds
120 |         :type wait_duration: float
121 |         :param simulate_typing: indicates whether or not to simulate human key typing
122 |         :type simulate_typing: bool
123 |         :param simulate_typing: indicates whether or not to simulate human mouse clicking
124 |         :type simulate_clicking: bool
125 |         """
126 |         self.driver = driver
127 |         self.num_pages = num_pages
128 |         self.output_folder = output_folder
129 |         self.wait_duration = wait_duration
130 |         self.simulate_typing = simulate_typing
131 |         self.simulate_clicking = simulate_clicking
132 |         self.disable_javascript = disable_javascript
133 | 


--------------------------------------------------------------------------------
/qacrawler/driver_wrapper.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This module abstracts away getting a browser driver.
  3 | 
  4 | Currently we only have Chrome bindings.
  5 | """
  6 | import logging
  7 | import time
  8 | 
  9 | from selenium import webdriver
 10 | from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
 11 | from selenium.webdriver.common.keys import Keys
 12 | from selenium.common.exceptions import NoSuchElementException
 13 | 
 14 | 
 15 | def get_selenium_driver(driver_type='Firefox'):
 16 |     if driver_type == 'Firefox':
 17 |         return get_firefox_driver()
 18 |     elif driver_type == 'Chrome':
 19 |         return get_chrome_driver()
 20 |     elif driver_type == 'PhantomJS':
 21 |         raise NotImplementedError('PhantomJS usage is not implemented.')
 22 | 
 23 | 
 24 | def get_chrome_driver():
 25 |     """
 26 |     Get a Chrome Driver.
 27 | 
 28 |     The driver executable (chromedriver) must be in the system path.
 29 | 
 30 |     :return: selenium Chrome webdriver
 31 |     :rtype: selenium.webdriver.chrome.webdriver.WebDriver
 32 |     """
 33 |     driver = webdriver.Chrome()
 34 |     return driver
 35 | 
 36 | 
 37 | def get_firefox_driver():
 38 |     """
 39 |     Get a Firefox Driver.
 40 | 
 41 |     The driver executable (wires) must be in the system path.
 42 | 
 43 |     :return: selenium Firefox webdriver
 44 |     :rtype: selenium.webdriver.firefox.webdriver.WebDriver
 45 |     """
 46 |     firefox_capabilities = DesiredCapabilities.FIREFOX
 47 |     firefox_capabilities['marionette'] = True
 48 |     driver = webdriver.Firefox(capabilities=firefox_capabilities)
 49 |     return driver
 50 | 
 51 | 
 52 | def disable_javascript(driver, driver_type='Firefox'):
 53 |     if driver_type == 'Firefox':
 54 |         return disable_javascript_on_firefox(driver)
 55 |     elif driver_type == 'Chrome':
 56 |         raise NotImplementedError('Disabling Javascript on Chrome is not implemented.')
 57 | 
 58 | 
 59 | def disable_javascript_on_firefox(driver):
 60 |     """
 61 |     Manually disable Javascript on Firefox.
 62 | 
 63 |     Setting profile preference on Javascript method (shown below) does not work anymore.
 64 |     Hence, we have to manually
 65 |     - visit Firefox Configuration page (about:config)
 66 |     - click on 'I accept the risk!' button that warns about messing with Firefox settings
 67 |     - narrow down the configurations list by typing 'javascript.enabled' into text box
 68 |     - togge javascript.enabled setting to False by highlighting it and pressing return
 69 | 
 70 |     Not working method:
 71 |     profile = webdriver.FirefoxProfile()
 72 |     profile.set_preference("javascript.enabled", False);
 73 |     driver = webdriver.Firefox(profile)
 74 |     """
 75 |     logging.info('Disabling Javascript...')
 76 |     driver.get(FirefoxConfigInfo.CONFIG_PAGE_URL)  # Firefox's configuration page opens when this url is visited
 77 |     try:
 78 |         warning_button = driver.find_element_by_id(FirefoxConfigInfo.ACCEPT_WARNING_BUTTON_ID)
 79 |         warning_button.click()
 80 |     except NoSuchElementException:
 81 |         pass
 82 |     text_box = driver.find_element_by_id(FirefoxConfigInfo.CONFIGURATIONS_SEARCH_BOX)
 83 |     text_box.send_keys(FirefoxConfigInfo.JAVASCRIPT_CONFIGURATION_NAME)
 84 |     time.sleep(2.0)  # wait for the configuration list to respond to entered text
 85 |     config_tree = driver.find_element_by_id(FirefoxConfigInfo.CONFIGURATION_ELEMENTS_LIST_ID)
 86 |     config_tree.send_keys(Keys.ARROW_DOWN)  # Select first item in the list
 87 |     time.sleep(1.0)  # wait for the configuration list to respond to selecting setting
 88 |     config_tree.send_keys(Keys.RETURN)  # Toggle selected item
 89 |     time.sleep(1.0)
 90 | 
 91 | 
 92 | class FirefoxConfigInfo(object):
 93 |     """Has DOM information for Firefox's Configuration Page."""
 94 |     FIREFOX_VERSION = '49.0.1'  # works on this version
 95 |     CONFIG_PAGE_URL = 'about:config'
 96 |     ACCEPT_WARNING_BUTTON_ID = 'warningButton'
 97 |     CONFIGURATIONS_SEARCH_BOX = 'textbox'
 98 |     JAVASCRIPT_CONFIGURATION_NAME = 'javascript.enabled'
 99 |     CONFIGURATION_ELEMENTS_LIST_ID = 'configTree'
100 | 
101 | 
102 | def get_phantomjs_driver():
103 |     # TODO Implement PhantomJS driver version instead of Chrome driver
104 |     pass
105 | 


--------------------------------------------------------------------------------
/qacrawler/google_dom_info.py:
--------------------------------------------------------------------------------
 1 | """
 2 | This module includes information about Google search page specific HTML information to be used in parsing.
 3 | 
 4 | It includes class names such as the class assigned to search result DIV's etc.
 5 | """
 6 | 
 7 | 
 8 | class GoogleDomInfoBase(object):
 9 |     RESULT_TITLE_CLASS = 'r'
10 |     RESULT_URL_CLASS = '_Rm'
11 |     RESULT_DESCRIPTION_CLASS = 'st'
12 |     RESULT_RELATED_LINKS_DIV_CLASS = 'osl'
13 |     RESULT_RELATED_LINK_CLASS = 'fl'
14 |     NEXT_PAGE_ID = 'pnnext'
15 | 
16 | 
17 | class GoogleDomInfoWithJS(GoogleDomInfoBase):
18 |     RESULT_DIV_CLASS = 'rc'
19 |     SEARCH_BOX_XPATH = '//*[@id="lst-ib"]'
20 | 
21 | 
22 | class GoogleDomInfoWithoutJS(GoogleDomInfoBase):
23 |     RESULT_DIV_CLASS = 'g'
24 |     SEARCH_BOX_XPATH = '//*[@id="sbhost"]'
25 |     NAVIGATION_LINK_CLASS = 'fl'
26 |     PREFERENCES_BUTTON_ID = 'gbi5'
27 |     NUMBER_OF_RESULTS_SELECT_ID = 'numsel'
28 |     SAVE_PREFERENCES_BUTTON_NAME = 'submit2'
29 | 


--------------------------------------------------------------------------------
/qacrawler/jeopardy.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This module is about abstracting away reading Jeopardy dataset.
  3 | 
  4 | An Entry is an object that holds parsed Jeopardy entry information such as question, answer etc.
  5 | 
  6 | Dataset class is responsible from reading Jeopardy dataset file and converting entries to Entry objects.
  7 | """
  8 | import json
  9 | 
 10 | from bs4 import BeautifulSoup
 11 | 
 12 | 
 13 | class Dataset(object):
 14 |     """
 15 |     Class that represents Jeopardy! question answer dataset.
 16 |     """
 17 |     def __init__(self, filepath):
 18 |         self.data = self.load_jeopardy_dataset_from_json_file(filepath)
 19 |         self.size = len(self.data)
 20 | 
 21 |     @staticmethod
 22 |     def load_jeopardy_dataset_from_json_file(filepath):
 23 |         """
 24 |         Read Jeopardy entries from file.
 25 | 
 26 |         :param filepath: the path of json file
 27 |         :return: A list of dictionaries that hold entry information.
 28 |         :rtype: list[dict]
 29 |         """
 30 |         with open(filepath, 'rt') as json_file:
 31 |             data = json.load(json_file)
 32 |             return data
 33 | 
 34 |     def get_entry(self, no):
 35 |         """
 36 |         Create an Entry object from the jeopardy entry at line no.
 37 | 
 38 |         :param no: The number of entry, in the order they are saved in json file
 39 |         :return: at the at the line no
 40 |         :rtype: Entry
 41 |         """
 42 |         entry = Entry(entry_dict=self.data[no], entry_id=no)
 43 |         return entry
 44 | 
 45 | 
 46 | class Entry(object):
 47 |     """Class that represents a Jeopardy! entry."""
 48 |     KEYS = [u'category', u'air_date', u'question', u'value', u'answer', u'round', u'show_number']
 49 | 
 50 |     def __init__(self, entry_dict, entry_id):
 51 |         """
 52 |         Construct an Entry object
 53 | 
 54 |         :param entry_dict: Entry data read from json file
 55 |         :type entry_dict: dict
 56 |         :param entry_id: An integer ID for the entry, which is the spatial rank in the dataset file
 57 |         :type entry_id: int
 58 |         """
 59 |         self.id = entry_id
 60 |         self.question = self.get_question(entry_dict['question'])
 61 |         self.answer = entry_dict['answer']
 62 |         self.category = entry_dict['category']
 63 |         self.air_date = entry_dict['air_date']
 64 |         self.show_number = entry_dict['show_number']
 65 |         self.round = entry_dict['round']
 66 |         self.value = entry_dict['value']
 67 |         self.tag = self.get_tag(entry_dict)
 68 | 
 69 |     def to_dict(self):
 70 |         d = {key: getattr(self, key) for key in Entry.KEYS}
 71 |         d.update({'id': self.id})
 72 |         return d
 73 | 
 74 |     @staticmethod
 75 |     def get_question(string):
 76 |         """
 77 |         Trim quotation marks and remove HTML tags.
 78 | 
 79 |         Trim quotation marks in the beginning and at the end of the string. Remove HTML tags if there are any. (Some
 80 |         questions have <a> and <i> tags.
 81 | 
 82 |         For some reason json version of the dataset has that quotes.
 83 |         """
 84 |         string = string[1:-1]
 85 |         soup = BeautifulSoup(string, 'html.parser')
 86 |         return soup.text
 87 | 
 88 |     @staticmethod
 89 |     def get_tag(entry_dict):
 90 |         """
 91 |         Generate a unique tag from entry metadata.
 92 | 
 93 |         example: 4999_Double Jeopardy!_AUTHORS IN THEIR YOUTH_$800
 94 | 
 95 |         :param entry_dict: A dictionary with keys ['show_number', 'round', 'category', 'value']
 96 |         :type entry_dict: dict
 97 |         :return: tag of entry
 98 |         :rtype: str
 99 |         """
100 |         tag_keys = ['show_number', 'round', 'category', 'value']
101 |         tag_parts = [Entry.format_tag_part(entry_dict[key]) for key in tag_keys]
102 |         tag = '_'.join(tag_parts)
103 |         return tag
104 | 
105 |     @staticmethod
106 |     def format_tag_part(s):
107 |         """
108 |         Make input string all lowercase and filter out non-alphanumeric characters.
109 | 
110 |         :param s: tag part string to format
111 |         :type s: str
112 |         :return: formatted string
113 |         :rtype: str
114 |         """
115 |         if s is None:
116 |             return ''
117 |         s = s.lower()
118 |         s = ''.join([ch for ch in s if ch.isalnum()])
119 |         return s
120 | 


--------------------------------------------------------------------------------
/qacrawler/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Main entry point of qacrawler
  3 | 
  4 | Parses command line arguments. Loads the dataset. Chooses the entries to crawl. Crawl chosen entries. Quit.
  5 | """
  6 | import argparse
  7 | import logging
  8 | import os
  9 | import time
 10 | 
 11 | from selenium.webdriver.common.by import By
 12 | from selenium.webdriver.support.ui import Select
 13 | 
 14 | import driver_wrapper
 15 | import jeopardy
 16 | import crawler
 17 | import sr_parser
 18 | from google_dom_info import GoogleDomInfoWithoutJS as GDom
 19 | 
 20 | 
 21 | def main():
 22 |     crawler_settings, entries = initialize()
 23 |     crawler.crawl(crawler_settings, entries)
 24 |     finalize(crawler_settings.driver)
 25 | 
 26 | 
 27 | def initialize():
 28 |     """Initialize collector.
 29 | 
 30 |     Initialize by parsing command line arguments, reading Jeopardy entries from dataset file
 31 |     and getting browser driver.
 32 |     """
 33 |     args = parse_command_line_arguments()
 34 |     configure_logging(log_level=args.log_level)
 35 |     create_folder_if_not_exists(args.output_folder)
 36 |     dataset = jeopardy.Dataset(filepath=args.jeopardy_json)
 37 |     entries = get_entries_to_search(dataset, first=args.first, last=args.last)
 38 |     driver = driver_wrapper.get_selenium_driver(args.driver_type)
 39 |     if args.disable_javascript:
 40 |         driver_wrapper.disable_javascript(driver, args.driver_type)
 41 |     set_number_of_results_per_page(driver, args.results_per_page)
 42 |     settings = crawler.CrawlerSettings(driver, args.num_pages, args.output_folder, args.wait_duration,
 43 |                                        args.simulate_typing, args.simulate_clicking, args.disable_javascript)
 44 |     logging.info('Start.')
 45 |     return settings, entries
 46 | 
 47 | 
 48 | def parse_command_line_arguments():
 49 |     """Parse command line arguments
 50 | 
 51 |     :return: An object of which attributes are command line arguments
 52 |     :rtype: argparse.ArgumentParser
 53 |     """
 54 |     argparser = argparse.ArgumentParser(description='Crawl Google to generate QA dataset')
 55 |     argparser.add_argument('-j', '--jeopardy-json', type=str, required=True,
 56 |                            help='Path to Jeopardy dataset file')
 57 |     argparser.add_argument('-d', '--driver-type', type=str, default='Firefox',
 58 |                            help='The browser/driver type to be used by crawler',
 59 |                            choices=['Firefox', 'Chrome', 'PhantomJS'])
 60 |     argparser.add_argument('-o', '--output-folder', type=str, required=True,
 61 |                            help='If a folder is given the output files will be written there. '
 62 |                                 'If given folder does not exist it will be created first.')
 63 |     argparser.add_argument('-f', '--first', type=int, default=0,
 64 |                            help='First entry from which to start reading questions')
 65 |     argparser.add_argument('-l', '--last', type=int, default=216930,
 66 |                            help='Last entry at which to stop reading questions')
 67 |     argparser.add_argument('-n', '--num-pages', type=int, default=1,
 68 |                            help='Number of search result pages to parse per query')
 69 |     argparser.add_argument('-g', '--log-level', type=str, default='INFO',
 70 |                            help='Set the level of log messages below which will be saved to log file',
 71 |                            choices=['DEBUG', 'INFO', 'WARNING', 'ERROR'])
 72 |     argparser.add_argument('-w', '--wait-duration', type=int, default=4,
 73 |                            help='Number of seconds to wait before getting the next page')
 74 |     argparser.add_argument('--simulate-typing', action='store_true',
 75 |                            help='When included simulates human typing by pressing keys one-by-one')
 76 |     argparser.add_argument('--simulate-clicking', action='store_true',
 77 |                            help='When included simulates mouse clicking on next page link')
 78 |     argparser.add_argument('--disable-javascript', action='store_true',
 79 |                            help='When included disables JavaScript (only for Firefox)')
 80 |     argparser.add_argument('--results-per-page', type=int, default=10,
 81 |                            help='The number of search results in a page per query',
 82 |                            choices=[10, 20, 30, 50, 100])
 83 |     args = argparser.parse_args()
 84 |     return args
 85 | 
 86 | 
 87 | def configure_logging(log_level, log_file='jeopardy_crawler.log',
 88 |                       log_format='%(levelname)s:%(asctime)s:%(module)s:%(funcName)s:%(message)s'):
 89 |     """Configure Crawler's logger and Suppress selenium's browser logger.
 90 | 
 91 |     Selenium's browser logger logs every HTTP requests etc. We are not interested in that."""
 92 |     log_level_num = getattr(logging, log_level)
 93 |     logging.basicConfig(filename=log_file, format=log_format, level=log_level_num)
 94 |     from selenium.webdriver.remote.remote_connection import LOGGER as SELENIUM_LOGGER
 95 |     SELENIUM_LOGGER.setLevel(logging.INFO)
 96 | 
 97 | 
 98 | def create_folder_if_not_exists(folder):
 99 |     """
100 |     :param folder: path to folder
101 |     :type folder: str
102 |     :rtype None:
103 |     """
104 |     if not os.path.exists(folder):
105 |         os.makedirs(folder)
106 | 
107 | 
108 | def get_entries_to_search(dataset, first, last):
109 |     """Get entries to do search queries.
110 | 
111 |     :rtype generator[jeopardy.Entry]"""
112 |     if last >= dataset.size: last = dataset.size
113 |     entries = (dataset.get_entry(no) for no in range(first, last))
114 |     return entries
115 | 
116 | 
117 | def set_number_of_results_per_page(driver, num_results):
118 |     """Manually set the number of search results per page when Javascript is disabled.
119 | 
120 |     Manually
121 |     - visit Google Search's preferences page
122 |     - find the element to set number of results per page
123 |     - set it to the value given in command-line arguments
124 |     """
125 |     logging.debug('Setting search results per page to %d...' % num_results)
126 |     redirect_nonjavascript_version_of_google_by_making_a_dummy_query(driver)
127 |     visit_google_search_preferences_page(driver)
128 |     find_select_and_set(driver, num_results)
129 |     save_preferences(driver)
130 | 
131 | 
132 | def redirect_nonjavascript_version_of_google_by_making_a_dummy_query(driver):
133 |     """
134 |     If we first visit preferences page directly, Google gives "Your cookies seem to be disabled." warning.
135 |     Hence first do a dummy search, so that Google redirects us to its non-Javascript version.
136 |     """
137 |     sr_parser.visit_google(driver, 'Increase search results per page...')
138 |     sr_parser.wait_for_presence_and_get_element(driver, (By.XPATH, GDom.SEARCH_BOX_XPATH))
139 | 
140 | 
141 | def visit_google_search_preferences_page(driver):
142 |     google_preferences_page_url = 'http://www.google.com/preferences?hl=en'
143 |     driver.get(google_preferences_page_url)
144 | 
145 | 
146 | def find_select_and_set(driver, num_results):
147 |     select = wait_for_and_scroll_into_view_and_get_num_results_select(driver)
148 |     time.sleep(1)
149 |     select.select_by_value(str(num_results))
150 |     logging.info('Selected value: %s' % select.first_selected_option.text)
151 |     time.sleep(1)
152 | 
153 | 
154 | def wait_for_and_scroll_into_view_and_get_num_results_select(driver):
155 |     """
156 |     :return: Select html element to set number of search results per page
157 |     :rtype: selenium.webdriver.support.ui.Select
158 |     """
159 |     select_locator = (By.ID, GDom.NUMBER_OF_RESULTS_SELECT_ID)
160 |     select = sr_parser.wait_for_presence_and_get_element(driver, select_locator)
161 |     coordinates = select.location_once_scrolled_into_view  # first scrolls to element then returns location
162 |     select = Select(select)
163 |     return select
164 | 
165 | 
166 | def save_preferences(driver):
167 |     save_preferences_button = driver.find_element(By.NAME, GDom.SAVE_PREFERENCES_BUTTON_NAME)
168 |     save_preferences_button.click()
169 |     time.sleep(1)
170 | 
171 | 
172 | def finalize(driver):
173 |     logging.info('End.')
174 |     driver.quit()
175 | 
176 | 
177 | if __name__ == '__main__':
178 |     main()
179 | 


--------------------------------------------------------------------------------
/qacrawler/sr_parser.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This module deals with crawling Google search results and parsing result pages.
  3 | 
  4 | A SearchResult is an object that holds parsed search results.
  5 | 
  6 | collect_query_results_from_google() function is to crawl a query's search results into SearchResult objects.
  7 | 
  8 | Terminology is taken from Google help at: https://support.google.com/websearch/answer/35891?hl=en#results
  9 | """
 10 | import logging
 11 | import random
 12 | import sys
 13 | import time
 14 | 
 15 | from selenium.webdriver.common.action_chains import ActionChains
 16 | from selenium.webdriver.common.by import By
 17 | from selenium.webdriver.common.keys import Keys
 18 | from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
 19 | from selenium.webdriver.support.ui import WebDriverWait
 20 | from selenium.webdriver.support import expected_conditions as EC
 21 | from selenium.common.exceptions import TimeoutException
 22 | from bs4 import BeautifulSoup
 23 | 
 24 | import google_dom_info
 25 | 
 26 | GDOM = None
 27 | 
 28 | 
 29 | def collect_query_results_from_google(query, settings):
 30 |     """
 31 |     Get formatted search results given a search query and the number of pages to parse.
 32 | 
 33 |     :param query: search query
 34 |     :type query: str
 35 |     :param settings: Crawler settings object
 36 |     :type settings: qacrawler.crawler.CrawlerSettings
 37 |     :return: A list of SearchResult objects
 38 |     :rtype: list[SearchResult]
 39 |     """
 40 |     set_gdom(settings.disable_javascript)
 41 |     if settings.disable_javascript:
 42 |         visit_google(settings.driver, query='%20')  # request a search result page with no results
 43 |     else:
 44 |         visit_google(settings.driver)
 45 |     check_google_bot_police(settings.driver)
 46 |     search_box = wait_for_and_get_search_box(settings.driver)
 47 |     submit_query(query, search_box, settings.simulate_typing, settings.driver)
 48 |     all_results = parse_n_search_result_pages(settings,
 49 |                                               num_pages=settings.num_pages, wait_duration=settings.wait_duration)
 50 |     return all_results
 51 | 
 52 | 
 53 | def set_gdom(disable_javascript):
 54 |     """Choose the Google DOM information according whether Javascript is disabled or not."""
 55 |     global GDOM
 56 |     if disable_javascript:
 57 |         GDOM = google_dom_info.GoogleDomInfoWithoutJS
 58 |     else:
 59 |         GDOM = google_dom_info.GoogleDomInfoWithJS
 60 | 
 61 | 
 62 | def visit_google(driver, query=None):
 63 |     """If a query is given directly search that query otherwise just open Google's front page."""
 64 |     url = 'http://google.com'
 65 |     if query is not None:
 66 |         url = url + '/search?q=' + query
 67 |     driver.get(url)  # visit a search results page with no search results
 68 | 
 69 | 
 70 | def check_google_bot_police(driver):
 71 |     """Give error and quit if caught by Google Bot Police."""
 72 |     robot_police_text = 'Our systems have detected unusual traffic'
 73 |     if robot_police_text in driver.page_source:
 74 |         logging.critical('Caught by Google Bot Police :-(. Exiting...')
 75 |         quit_driver_and_exit(driver)
 76 | 
 77 | 
 78 | def wait_for_and_get_search_box(driver):
 79 |     search_box_locator = (By.XPATH, GDOM.SEARCH_BOX_XPATH)
 80 |     search_box = wait_for_presence_and_get_element(driver, search_box_locator)
 81 |     return search_box
 82 | 
 83 | 
 84 | def wait_for_presence_and_get_element(driver, locator, timeout=10):
 85 |     try:
 86 |         condition = EC.presence_of_element_located(locator)
 87 |         element = WebDriverWait(driver, timeout=timeout).until(condition)
 88 |         return element
 89 |     except TimeoutException:
 90 |         logging.critical('TimeoutException: Could not get the element at %s. ' % str(locator) +
 91 |                          'There is a connection problem. Exiting.')
 92 |         quit_driver_and_exit(driver)
 93 | 
 94 | 
 95 | def quit_driver_and_exit(driver):
 96 |     driver.quit()
 97 |     sys.exit()
 98 | 
 99 | 
100 | def submit_query(query, search_box, simulate_typing, driver):
101 |     """Choose query submission type according to simulate_typing."""
102 |     if simulate_typing:
103 |         submit_query_by_typing(driver, query, search_box)
104 |     else:
105 |         submit_query_at_once(query, search_box)
106 | 
107 | 
108 | def submit_query_by_typing(driver, query, search_box):
109 |     """Submit query by sending keys one-by-one."""
110 |     ActionChains(driver).move_to_element(search_box).click().perform()
111 |     simulate_typing(search_box, query)
112 |     result_divs = driver.find_elements_by_class_name(GDOM.RESULT_DIV_CLASS)
113 |     search_box.send_keys(Keys.ENTER)
114 |     search_box.submit()
115 |     if result_divs:
116 |         wait_for_page_load_after_submission(driver, result_divs)
117 | 
118 | 
119 | def wait_for_page_load_after_submission(driver, result_divs):
120 |     """Wait for the query string to arrive Search Engine."""
121 |     logging.debug('wait for staleness after submitting search query')
122 |     WebDriverWait(driver, timeout=5).until(
123 |         EC.staleness_of(result_divs[0])
124 |     )
125 |     logging.debug('staleness ended')
126 | 
127 | 
128 | def simulate_typing(element, text):
129 |     for char in text:
130 |         element.send_keys(char)
131 |         wait_with_variance(0.05, variation=0.05)
132 | 
133 | 
134 | def submit_query_at_once(query, search_box):
135 |     """Submit query by entering whole question string at once."""
136 |     search_box.clear()
137 |     search_box.send_keys(query)
138 |     search_box.send_keys(Keys.ENTER)
139 |     # search_box.submit()
140 | 
141 | 
142 | def parse_n_search_result_pages(settings, num_pages, wait_duration):
143 |     """
144 |     Parse num_pages of search result pages and return all SearchResults found.
145 | 
146 |     :param wait_duration: duration to wait between search results pages in seconds
147 |     :type wait_duration: float
148 |     :param num_pages: Number of search result pages to parse per query
149 |     :type num_pages: int
150 |     :return: list of SearchResult parsed from num_pages of search result pages
151 |     :rtype: list[SearchResult]
152 |     """
153 |     all_results = []
154 |     for page_no in range(num_pages):
155 |         logging.debug('Parsing page %d.' % page_no)
156 |         page_results = parse_one_search_result_page(settings.driver)
157 |         if not page_results:
158 |             return all_results
159 |         all_results.extend(page_results)
160 |         wait_with_variance(duration=wait_duration)
161 |         next_page_exists = request_next_page(settings.driver, settings.simulate_clicking, settings.disable_javascript)
162 |         if not next_page_exists:
163 |             break
164 |     return all_results
165 | 
166 | 
167 | def request_next_page(driver, simulate_clicking, disable_javascript):
168 |     """
169 |     Find next page element/url and if exists request it.
170 | 
171 |     :param driver: selenium driver with which we'll collect data from opened search result page
172 |     :type driver: selenium.webdriver.chrome.webdriver.WebDriver
173 |     :param simulate_clicking: if True request next page by clicking on element otherwise request it via
174 |     driver.get method
175 |     :type simulate_clicking: bool
176 |     :return: Whether the next page element exists or not
177 |     :rtype: bool
178 |     """
179 |     if simulate_clicking:
180 |         next_page_element = get_next_page_element(driver)
181 |         if next_page_element:
182 |             ActionChains(driver).move_to_element(next_page_element).perform()
183 |             wait_with_variance(duration=0.2, variation=0.2)
184 |             ActionChains(driver).click(next_page_element).perform()
185 |             wait_for_page_load_after_clicking_on_link(driver)
186 |             return True
187 |     else:
188 |         next_page_url = get_next_page_url_no_js(driver) if disable_javascript else get_next_page_url_js(driver)
189 |         if next_page_url:
190 |             driver.get(next_page_url)
191 |             return True
192 |     return False
193 | 
194 | 
195 | def wait_for_page_load_after_clicking_on_link(driver):
196 |     """
197 |     Unlike opening a new page via driver.get opening a new page via clicking on a link is a more complicated process.
198 |     Clicking on a link is an asynchronious operation, i.e. browser driver does not wait for the action that is
199 |     trigger by clicking to finish. Because the action can be anything including opening a new page,
200 |     sending an AJAX request, or even nothing.
201 | 
202 |     Our technique to make the operation a blocking operation by waiting until the new page request is to see whether the
203 |     search results that appeared while we were simulated-typing went stale.
204 |     """
205 |     a_div = get_search_result_divs(driver)[0]
206 |     logging.debug('wait for staleness after clicking next page')
207 |     WebDriverWait(driver, timeout=5).until(
208 |         EC.staleness_of(a_div)
209 |     )
210 |     logging.debug('staleness ended')
211 | 
212 | 
213 | def parse_one_search_result_page(driver):
214 |     """
215 |     Wait the search result page to load and call page parse function.
216 | 
217 |     :param driver: selenium driver with which we'll open results page
218 |     :type driver: selenium.webdriver.chrome.webdriver.WebDriver
219 |     :return: A list of SearchResult objects. If there are no results return an empty list
220 |     :rtype: list[SearchResult]
221 |     """
222 |     try:
223 |         wait_for_search_results(driver)
224 |     except TimeoutException:
225 |         logging.warning('TimeoutException: Either there are no search results (e.g. false alarm) '
226 |                         'or Google realized that we are a bot :-( '
227 |                         'or there is connection problem (less likely).')
228 |         return []
229 |     results = parse_opened_results_page(driver)
230 |     logging.debug('Collected %d search results.' % len(results))
231 |     return results
232 | 
233 | 
234 | def wait_for_search_results(driver, timeout=10):
235 |     elem = WebDriverWait(driver, timeout=timeout).until(
236 |         EC.presence_of_element_located((By.CLASS_NAME, GDOM.RESULT_DIV_CLASS))
237 |     )
238 |     logging.debug('wait_for_search_results ENDED')
239 | 
240 | 
241 | def wait_with_variance(duration, variation=1.0):
242 |     """Wait a while to pass Google's bot detection."""
243 |     uniform = random.random() * variation
244 |     duration += uniform
245 |     time.sleep(duration)
246 | 
247 | 
248 | def get_next_page_element(driver):
249 |     """
250 |     Get HTML element for next page link so that we can click on it.
251 | 
252 |     :param driver: the selenium driver that is used to open the search results page
253 |     :type driver: selenium.webdriver.chrome.webdriver.WebDriver
254 |     :return: next page element if exists else None
255 |     :rtype: selenium.webdriver.remote.webelement.WebElement
256 |     """
257 |     next_page_elements = driver.find_elements(By.ID, GDOM.NEXT_PAGE_ID)
258 |     if next_page_elements:
259 |         next_page_element = next_page_elements[0]
260 |     else:
261 |         next_page_element = None
262 |         logging.debug('There is no next page.')
263 |     return next_page_element
264 | 
265 | 
266 | def get_next_page_url_js(driver):
267 |     """
268 |     Get next page url when Javascript is enabled.
269 | 
270 |     :param driver: the selenium driver that is used to open the search results page
271 |     :type driver: selenium.webdriver.chrome.webdriver.WebDriver
272 |     :return: url if exists else None
273 |     :rtype: str
274 |     """
275 |     next_page_link = driver.find_elements(By.ID, GDOM.NEXT_PAGE_ID)
276 |     if next_page_link:
277 |         next_page_url = next_page_link[0].get_attribute('href')
278 |     else:
279 |         next_page_url = None
280 |         logging.debug('There is no next page.')
281 |     return next_page_url
282 | 
283 | 
284 | def get_next_page_url_no_js(driver):
285 |     """
286 |     Get next page url when Javascript is disabled.
287 | 
288 |     Google does not give an HTML id to next page link element when Javascript is disabled.
289 | 
290 |     :param driver: the selenium driver that is used to open the search results page
291 |     :type driver: selenium.webdriver.chrome.webdriver.WebDriver
292 |     :return: url if exists else None
293 |     :rtype: str
294 |     """
295 |     navigation_elements = driver.find_elements(By.CLASS_NAME, GDOM.NAVIGATION_LINK_CLASS)
296 |     if not navigation_elements:
297 |         logging.debug('There is no next page.')
298 |         return None
299 |     last_navigation_element = navigation_elements[-1]
300 |     if last_navigation_element.text != 'Next':
301 |         logging.debug('There is no next page.')
302 |         return None
303 |     next_page_url = last_navigation_element.get_attribute('href')
304 |     return next_page_url
305 | 
306 | 
307 | def parse_opened_results_page(driver):
308 |     """Parse a loaded search result page into a list of SearchResult objects.
309 | 
310 |     It parses 4 parts ("title", "url", "short description" and, if exists, "related links") from each
311 |     search result item.
312 | 
313 |     :param driver: the selenium driver that is used to open the search results page
314 |     :type driver: selenium.webdriver.chrome.webdriver.WebDriver
315 |     :return: A list of SearchResult objects
316 |     :rtype: list[SearchResult]
317 |     """
318 |     elements = get_search_result_divs(driver)
319 |     results = []
320 |     for no, elem in enumerate(elements):
321 |         try:
322 |             results.append(SearchResult(elem))
323 |         except NotAParsableSearchResult:
324 |             logging.debug('Search result DIV no %d is not parsable. '
325 |                           'It can be a non-website result such as a video.' % no)
326 |             continue
327 |     return results
328 | 
329 | 
330 | def get_search_result_divs(driver):
331 |     """From the opened page, get a list of DIVs where each DIV is a result.
332 | 
333 |     :param driver: selenium driver with which results page is opened
334 |     :type driver: selenium.webdriver.chrome.webdriver.WebDriver
335 |     :rtype: list[bs4.element.Tag]
336 |     """
337 |     soup = BeautifulSoup(driver.page_source, 'html.parser')
338 |     elements = soup.select('.' + GDOM.RESULT_DIV_CLASS)  # tag: div
339 |     return elements
340 | 
341 | 
342 | class SearchResult(object):
343 |     """Parses search result information such as title, url, snippet from div and keep them in related attributes"""
344 |     def __init__(self, element):
345 |         """Parse a search result DIV to get title, url, short description.
346 | 
347 |         :param element: HTML element, a DIV that holds a search result
348 |         :type element: bs4.element.Tag
349 |         """
350 |         self.element = element
351 |         self.title = self.parse_title(element)
352 |         logging.debug('parsing result ' + self.title)
353 |         self.url = self.parse_url(element)
354 |         self.snippet = self.parse_snippet(element)
355 |         self.related_links = self.parse_related_links(element)
356 | 
357 |     @staticmethod
358 |     def parse_title(element):
359 |         title = element.select_one('.' + GDOM.RESULT_TITLE_CLASS)
360 |         if title is None:
361 |             raise NotAParsableSearchResult
362 |         return title.text.encode('ascii', 'ignore')
363 | 
364 |     @staticmethod
365 |     def parse_url(element):
366 |         h3 = element.select_one('.' + GDOM.RESULT_TITLE_CLASS)
367 |         anchor = h3.find('a')  # tag: a
368 |         if anchor is None:
369 |             raise NotAParsableSearchResult
370 |         url = anchor['href']
371 |         return url.encode('ascii', 'ignore')
372 | 
373 |     @staticmethod
374 |     def parse_snippet(element):
375 |         # snippet might not exist for result item
376 |         snippet = element.select_one('.' + GDOM.RESULT_DESCRIPTION_CLASS)  # tag: span
377 |         return snippet.text.encode('ascii', 'ignore') if snippet else None
378 | 
379 |     @staticmethod
380 |     def parse_related_links(element):
381 |         # related links might not exist for result item
382 |         # first check if their div exists. If yes, get the list of link elements
383 |         related_links_div = element.select_one('.' + GDOM.RESULT_RELATED_LINKS_DIV_CLASS)  # tag: div
384 |         if related_links_div:
385 |             related_links = related_links_div.select('.' + GDOM.RESULT_RELATED_LINK_CLASS)  # tag: a
386 |             related_links = [rl.text.encode('ascii', 'ignore') for rl in related_links]
387 |         else:
388 |             related_links = None
389 |         return related_links
390 | 
391 |     def to_dict(self):
392 |         """Convert data in SearchResult object into a dictionary"""
393 |         dict_representation = {'title': self.title,
394 |                                'url': self.url,
395 |                                'snippet': self.snippet,
396 |                                'related_links': self.related_links}
397 |         return dict_representation
398 | 
399 |     def __str__(self):
400 |         """Format a parsed Google search result into a tab separated string"""
401 |         s = '%s\t' % self.title
402 |         s += '%s\t' % self.url
403 |         s += '%s\t' % (self.snippet if self.snippet else '')
404 |         s += '%s' % (';'.join(self.related_links) if self.related_links else '')
405 |         s = s
406 |         return s
407 | 
408 | 
409 | class NotAParsableSearchResult(Exception):
410 |     pass
411 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | nltk==3.2.1
2 | pandas==0.18.1
3 | selenium==2.53.6
4 | # development
5 | pytest==3.0.2
6 | 


--------------------------------------------------------------------------------
/tests/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nyu-dl/dl4ir-searchQA/fe1ea9fdb888001723ee89daf72b35601d72a4ea/tests/.DS_Store


--------------------------------------------------------------------------------
/tests/data/one_result.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html lang="en">
 3 | <head>
 4 |     <meta charset="UTF-8">
 5 |     <title>Title</title>
 6 | </head>
 7 | <body>
 8 | <div class="rc" data-hveid="27"><h3 class="r"><a href="https://en.wikipedia.org/wiki/Cheese" onmousedown="return rwt(this,'','','','1','AFQjCNFoPFU-N7RalkG4HpzKZ9txi_2fVQ','','0ahUKEwj3pvuatanPAhUCET4KHYeJCQEQFggcMAA','','',event)">Cheese - Wikipedia, the free encyclopedia</a></h3><div class="s"><div><div class="f kv _SWb" style="white-space:nowrap"><cite class="_Rm">https://en.wikipedia.org/wiki/<b>Cheese</b></cite><div class="action-menu ab_ctl"><a class="_Fmb ab_button" href="https://www.google.com/?gws_rd=ssl#" id="am-b0" aria-label="Result details" aria-expanded="false" aria-haspopup="true" role="button" jsaction="m.tdd;keydown:m.hbke;keypress:m.mskpe" data-ved="0ahUKEwj3pvuatanPAhUCET4KHYeJCQEQ7B0IHTAA"><span class="mn-dwn-arw"></span></a><div class="action-menu-panel ab_dropdown" role="menu" tabindex="-1" jsaction="keydown:m.hdke;mouseover:m.hdhne;mouseout:m.hdhue" data-ved="0ahUKEwj3pvuatanPAhUCET4KHYeJCQEQqR8IHjAA"><ol><li class="action-menu-item ab_dropdownitem" role="menuitem"><a class="fl" href="https://webcache.googleusercontent.com/search?q=cache:CrVLCmrgjzoJ:https://en.wikipedia.org/wiki/Cheese+&amp;cd=1&amp;hl=en&amp;ct=clnk&amp;gl=us" onmousedown="return rwt(this,'','','','1','AFQjCNEZC1j3DGehiFxfkAK3tMsZcPcpbQ','','0ahUKEwj3pvuatanPAhUCET4KHYeJCQEQIAgfMAA','','',event)">Cached</a></li><li class="action-menu-item ab_dropdownitem" role="menuitem"><a class="fl" href="https://www.google.com/search?biw=1050&amp;bih=759&amp;q=related:https://en.wikipedia.org/wiki/Cheese+cheese&amp;tbo=1&amp;sa=X&amp;ved=0ahUKEwj3pvuatanPAhUCET4KHYeJCQEQHwggMAA">Similar</a></li></ol></div></div><div class="crc"><div class="crl" data-async-context="res:0;ri:;site:en.wikipedia.org" data-async-trigger="cra-0-filled" jsaction="crd.tglpop" data-ved="0ahUKEwj3pvuatanPAhUCET4KHYeJCQEQ5CsIITAA">Wikipedia<span class="_Bs"></span></div><div class="cri" jsaction="crd.popclk"><div id="cra-0-filled" data-jiis="up" data-async-type="cra" data-async-context-required="site,ri,res" class="y yp" jsaction="asyncFilled:crd.rmload"></div><div class="cr-load">Loading...</div></div></div></div><span class="st"><em>Cheese</em> is a food derived from milk that is produced in a wide range of flavors, textures, and forms by coagulation of the milk protein casein. It comprises proteins&nbsp;...</span><div class="osl">‎<a class="fl" href="https://en.wikipedia.org/wiki/Cheese#Etymology" onmousedown="return rwt(this,'','','','1','AFQjCNFoPFU-N7RalkG4HpzKZ9txi_2fVQ','','0ahUKEwj3pvuatanPAhUCET4KHYeJCQEQ0gIIIygAMAA','','',event)">Etymology</a> ·&nbsp;‎<a class="fl" href="https://en.wikipedia.org/wiki/Cheese#History" onmousedown="return rwt(this,'','','','1','AFQjCNFoPFU-N7RalkG4HpzKZ9txi_2fVQ','','0ahUKEwj3pvuatanPAhUCET4KHYeJCQEQ0gIIJCgBMAA','','',event)">History</a> ·&nbsp;‎<a class="fl" href="https://en.wikipedia.org/wiki/Cheese#Production" onmousedown="return rwt(this,'','','','1','AFQjCNFoPFU-N7RalkG4HpzKZ9txi_2fVQ','','0ahUKEwj3pvuatanPAhUCET4KHYeJCQEQ0gIIJSgCMAA','','',event)">Production</a> ·&nbsp;‎<a class="fl" href="https://en.wikipedia.org/wiki/Cheese#Processing" onmousedown="return rwt(this,'','','','1','AFQjCNFoPFU-N7RalkG4HpzKZ9txi_2fVQ','','0ahUKEwj3pvuatanPAhUCET4KHYeJCQEQ0gIIJigDMAA','','',event)">Processing</a></div></div></div></div>
 9 | </body>
10 | </html>
11 | 


--------------------------------------------------------------------------------
/tests/data/parsed.tsv:
--------------------------------------------------------------------------------
 1 | Cheese - Wikipedia, the free encyclopedia	https://en.wikipedia.org/wiki/Cheese	Cheese is a food derived from milk that is produced in a wide range of flavors, textures, and forms by coagulation of the milk protein casein. It comprises proteins ...	Etymology;History;Production;Processing
 2 | Cheese.com - World's Greatest Cheese Resource	http://www.cheese.com/	Everything you want to know about cheese. Includes search features.	
 3 | Alphabetical list of cheeses - Cheese.com	http://www.cheese.com/alphabetical/	All cheeses in alphabetical order. ... Aged Cashew & Blue Green Algae Cheese. 1; 2; 3; 4; 5. Per page: 20, 40, 60, 80, 100. #. 2015  Worldnews, Inc. Privacy ...	
 4 | Murray's Cheese	http://www.murrayscheese.com/	Imported and domestic cheeses, searchable by name or by country. Also offers gift baskets and certificates.	
 5 | Order Cheese Online | Fast Delivery by FreshDirect	https://www.freshdirect.com/browse.jsp?id=che	Visit our cheese counter to order high quality cheese favorites. Choose from a wide selection of best-selling brands and favorites from around the world.	
 6 | Murray's Cheese Bar	http://www.murrayscheesebar.com/	... Reservations  Events  Press  Contact. cheese-bar-spring-menu-burrata-pea-pesto-280A7887. fried-chicken-sandwich-cheese-bar-stadium-280A8551.jpg.	
 7 | French cheese board	http://frenchcheeseboard.com/	The French Cheese Board is also a fab lab of ideas, where people can participate in cooking lessons, wine and cheese pairings sessions, interactive ...	
 8 | Cabot Creamery: Cheddar Cheese & Other Dairy Products from Vermont	https://www.cabotcheese.coop/	Cabot Cheese, owned & operated by real farmers, has been making award winning cheddar cheese & other dairy products since 1919. Learn More!	
 9 | No, cheese is not just like crack | Science News	https://www.sciencenews.org/blog/scicurious/no-cheese-not-just-crack	Cheese is a delicious invention. But if you saw the news last week, you might think it's on its way to being classified as a Schedule II drug.	
10 | What is Big Block of Cheese Day? - The Atlantic	http://www.theatlantic.com/politics/archive/2015/01/the-real-story-of-the-white-house-and-the-big-block-of-cheese/384676/	Wednesday, as you may have heard, is the White House's second annual Big Block of Cheese Day. Inspired by a fictional character in a ...	
11 | After 75 Years, the Cheese Stands Alone - The New York Times	http://www.nytimes.com/2016/07/17/nyregion/after-75-years-the-cheese-stands-alone.html	A Brooklyn family's edible heirloom: a quarter wheel of cheese. ... That anything as perishable as cheese should survive 75 years is remarkable ...	


--------------------------------------------------------------------------------
/tests/data/tiny_dataset.json:
--------------------------------------------------------------------------------
 1 | [
 2 |   {
 3 |     "category": "HISTORY",
 4 |     "air_date": "2004-12-31",
 5 |     "question": "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'",
 6 |     "value": "$200",
 7 |     "answer": "Copernicus",
 8 |     "round": "Jeopardy!",
 9 |     "show_number": "4680"
10 |   },
11 |   {
12 |     "category": "ESPN's TOP 10 ALL-TIME ATHLETES",
13 |     "air_date": "2004-12-31",
14 |     "question": "'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'",
15 |     "value": "$200",
16 |     "answer": "Jim Thorpe",
17 |     "round": "Jeopardy!",
18 |     "show_number": "4680"
19 |   },
20 |   {
21 |     "category": "EVERYBODY TALKS ABOUT IT...",
22 |     "air_date": "2004-12-31",
23 |     "question": "'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'",
24 |     "value": "$200",
25 |     "answer": "Arizona",
26 |     "round": "Jeopardy!",
27 |     "show_number": "4680"
28 |   },
29 |   {
30 |     "category": "THE COMPANY LINE",
31 |     "air_date": "2004-12-31",
32 |     "question": "'In 1963, live on \"The Art Linkletter Show\", this company served its billionth burger'",
33 |     "value": "$200",
34 |     "answer": "McDonald\\'s",
35 |     "round": "Jeopardy!",
36 |     "show_number": "4680"
37 |   },
38 |   {
39 |     "category": "EPITAPHS & TRIBUTES",
40 |     "air_date": "2004-12-31",
41 |     "question": "'Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States'",
42 |     "value": "$200",
43 |     "answer": "John Adams",
44 |     "round": "Jeopardy!",
45 |     "show_number": "4680"
46 |   },
47 |   {
48 |     "category": "3-LETTER WORDS",
49 |     "air_date": "2004-12-31",
50 |     "question": "'In the title of an Aesop fable, this insect shared billing with a grasshopper'",
51 |     "value": "$200",
52 |     "answer": "the ant",
53 |     "round": "Jeopardy!",
54 |     "show_number": "4680"
55 |   },
56 |   {
57 |     "category": "HISTORY",
58 |     "air_date": "2004-12-31",
59 |     "question": "'Built in 312 B.C. to link Rome & the South of Italy, it's still in use today'",
60 |     "value": "$400",
61 |     "answer": "the Appian Way",
62 |     "round": "Jeopardy!",
63 |     "show_number": "4680"
64 |   }
65 | ]
66 | 


--------------------------------------------------------------------------------
/tests/main_test.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | sys.path.insert(0, os.path.abspath('..'))
 4 | 
 5 | import pandas as pd
 6 | 
 7 | from sr_parser import main as p
 8 | 
 9 | 
10 | def nlp_parsing():
11 |     text = ['What track on beautiful life has the same name as a country?',
12 |             'what kinds of movie is the desperate hours']
13 |     parsed_text = p.parser(text)
14 |     all_results = p.subject_object_getter(parsed_text)
15 |     for i, j in zip(all_results, text):
16 |         i.append(j)
17 |     df = pd.DataFrame(all_results, columns=[
18 |                       'subject', 'object', 'raw_question'])
19 |     assert df.ix[0, 0][0] == 'track'
20 | 


--------------------------------------------------------------------------------
/tests/study_codes.py:
--------------------------------------------------------------------------------
  1 | """
  2 | The codes in each function are meant to be run in a Python interpreter at project root folder.
  3 | 
  4 | These are different than test modules. Here the goal is to have some variables in an interpreted to be
  5 | analyzed interactively. For example, to create a SearchResult, or Entry object and see how it works.
  6 | 
  7 | In PyCharm, to run the source code from the editor in Python console:
  8 | 1) Select code in editor
  9 | 2) Choose "Execute selection in console" in context menu (ALT + SHIFT + E)
 10 | 
 11 | Source: https://www.jetbrains.com/help/pycharm/2016.2/loading-code-from-editor-into-console.html
 12 | """
 13 | 
 14 | def study_initialize_visit():
 15 |     from selenium.webdriver.support.ui import WebDriverWait
 16 |     from selenium.webdriver.common.by import By
 17 |     from selenium.webdriver.support import expected_conditions as EC
 18 | 
 19 |     from qacrawler import driver_wrapper
 20 |     from qacrawler import google_dom_info
 21 |     from qacrawler import sr_parser
 22 | 
 23 |     sr_parser.gdom = google_dom_info.GoogleDomInfoWithoutJS
 24 |     driver = driver_wrapper.get_firefox_driver()
 25 |     driver_wrapper.disable_javascript_on_firefox(driver)
 26 | 
 27 |     sr_parser.visit_google(driver, 'foo')
 28 |     search_box = sr_parser.wait_for_and_get_search_box(driver)
 29 | 
 30 | 
 31 | def study_google_preferences():
 32 |     from qacrawler.google_dom_info import GoogleDomInfoWithoutJS as GDom
 33 |     from qacrawler import main
 34 | 
 35 |     google_preferences_page_url = 'http://www.google.com/preferences?hl=en'
 36 |     driver.get(google_preferences_page_url)
 37 | 
 38 |     main.set_number_of_results_per_page(driver, 20)
 39 | 
 40 | 
 41 | def study_entry():
 42 |     from crawler import jeopardy
 43 |     tiny_dataset = jeopardy.Dataset('tests/data/tiny_dataset.json')
 44 |     for no in range(len(tiny_dataset.data)):
 45 |         entry = tiny_dataset.get_entry(no)
 46 |         print entry.id, entry.tag, entry.question
 47 | 
 48 | 
 49 | def study_next_page_links():
 50 |     from crawler import parser
 51 |     from crawler import driver_wrapper
 52 | 
 53 |     driver = driver_wrapper.get_chrome_driver('/usr/local/bin/chromedriver')
 54 |     parser.visit_google_front_page(driver)
 55 |     search_box = parser.wait_for_and_get_search_box(driver)
 56 |     parser.submit_query('abidin', search_box)
 57 | 
 58 |     for n in range(2):
 59 |         parser.wait_for_search_results(driver)
 60 |         npu = parser.get_next_page_url(driver)
 61 |         print npu
 62 |         driver.get(npu)
 63 | 
 64 |     driver.quit()
 65 | 
 66 | 
 67 | def study_parsing_result_divs():
 68 |     import os
 69 |     from selenium.webdriver.common.by import By
 70 |     from crawler import parser
 71 |     from crawler import driver_wrapper
 72 |     from crawler import google_dom_info as gdom
 73 |     driver = driver_wrapper.get_chrome_driver('/usr/local/bin/chromedriver')
 74 | 
 75 |     html_file = 'tests/data/cheese%20-%20Google%20Search.html'
 76 |     local_file_url = 'file://' + os.path.abspath(html_file)
 77 |     driver.get(local_file_url)
 78 | 
 79 |     results = parser.get_search_result_divs(driver)
 80 | 
 81 |     for element in results:
 82 |         h3 = element.find_element(By.CLASS_NAME, gdom.RESULT_TITLE_CLASS)
 83 |         anchor = h3.find_element(By.TAG_NAME, 'a')  # tag: a
 84 |         url = anchor.get_attribute('href')
 85 |         print url
 86 | 
 87 |     driver.quit()
 88 | 
 89 | 
 90 | def study_logging():
 91 |     """
 92 |     Log message formatting items
 93 |     https://docs.python.org/2/library/logging.html#logrecord-attributes
 94 |     :return:
 95 |     """
 96 |     import logging
 97 |     reload(logging)
 98 |     log_attributes = ['levelname', 'asctime', 'module', 'funcName', 'message']
 99 |     log_format = ':'.join(['%%(%s)s' % attr for attr in log_attributes])
100 |     logging.basicConfig(filename='example.log',
101 |                         format=log_format,
102 |                         level=logging.DEBUG)
103 |     logging.info('started.')
104 |     logging.debug('analyzing this...')
105 | 
106 | 
107 | def study_command_line():
108 |     pass
109 |     # Study logging level
110 |     # python -c 'import main; print main.parse_command_line_arguments()' --log-level=DEBUG -j='aa'
111 | 
112 | 
113 | def misc():
114 |     from crawler import jeopardy
115 |     from crawler import main
116 |     dataset = jeopardy.Dataset('tests/data/tiny_dataset.json')
117 |     gen = main.get_entries_to_search(dataset, 0, len(dataset.data))
118 | 


--------------------------------------------------------------------------------
/tests/test_crawler.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | sys.path.insert(0, os.path.abspath('..'))
 4 | 
 5 | from qacrawler import driver_wrapper
 6 | from qacrawler import sr_parser as parser2
 7 | from qacrawler import crawler
 8 | from qacrawler import main
 9 | 
10 | 
11 | def test_result_div_parsing():
12 |     driver = driver_wrapper.get_chrome_driver()
13 |     html_file = 'data/one_result.html'
14 |     local_file_url = 'file://' + os.path.abspath(html_file)
15 |     driver.get(local_file_url)
16 | 
17 |     result_divs = parser2.get_search_result_divs(driver)
18 | 
19 |     result_div = result_divs[0]
20 |     print 'result_div', result_div
21 |     sr = parser2.SearchResult(result_div)
22 |     assert sr.title == 'Cheese - Wikipedia, the free encyclopedia'
23 |     assert sr.url == 'https://en.wikipedia.org/wiki/Cheese'
24 |     assert sr.snippet == 'Cheese is a food derived from milk that is produced in a wide range of flavors, textures, and forms by coagulation of the milk protein casein. It comprises proteins ...'
25 |     assert sr.related_links == ['Etymology', 'History', 'Production', 'Processing']
26 |     assert str(sr) == 'Cheese - Wikipedia, the free encyclopedia\thttps://en.wikipedia.org/wiki/Cheese\tCheese is a food derived from milk that is produced in a wide range of flavors, textures, and forms by coagulation of the milk protein casein. It comprises proteins ...\tEtymology;History;Production;Processing'
27 | 
28 |     driver.quit()
29 | 
30 | 
31 | def test_page_parsing(html_file='data/cheese%20-%20Google%20Search.html', parsed_file='data/parsed.tsv'):
32 |     driver = driver_wrapper.get_chrome_driver()
33 |     local_file_url = 'file://' + os.path.abspath(html_file)
34 |     driver.get(local_file_url)
35 | 
36 |     parsed_results = parser2.parse_opened_results_page(driver)
37 |     formatted = crawler.results_list_to_tsv(parsed_results)
38 | 
39 |     with open(parsed_file, 'rt') as f:
40 |         from_file = f.read()
41 | 
42 |     assert formatted == from_file
43 | 
44 |     driver.quit()
45 | 


--------------------------------------------------------------------------------
/tests/test_jeopardy.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | sys.path.insert(0, os.path.abspath('..'))
 4 | 
 5 | from crawler import jeopardy
 6 | 
 7 | DATASET_PATH = os.path.join(os.path.dirname(__file__), 'data', 'tiny_dataset.json')
 8 | DATASET = jeopardy.Dataset(DATASET_PATH)
 9 | 
10 | 
11 | def test_loading():
12 |     dataset_size = 7
13 |     assert len(DATASET.data) == dataset_size
14 |     first_entry_dict = DATASET.data[0]
15 |     assert first_entry_dict['answer'] == 'Copernicus'
16 | 
17 | 
18 | def test_getting_entry():
19 |     entry_no = 0
20 |     entry = DATASET.get_entry(entry_no)
21 |     assert isinstance(entry, jeopardy.Entry)
22 |     assert entry.id == entry_no
23 |     assert entry.answer == 'Copernicus'
24 | 


--------------------------------------------------------------------------------