├── .gitignore ├── LICENSE ├── README.md ├── qacrawler ├── README.md ├── __init__.py ├── crawler.py ├── driver_wrapper.py ├── google_dom_info.py ├── jeopardy.py ├── main.py └── sr_parser.py ├── requirements.txt └── tests ├── .DS_Store ├── data ├── cheese - Google Search.html ├── one_result.html ├── parsed.tsv └── tiny_dataset.json ├── main_test.py ├── study_codes.py ├── test_crawler.py └── test_jeopardy.py /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2017, New York University (Kyunghyun Cho) 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | * Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | * Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SearchQA 2 | 3 | Associated paper: 4 | https://arxiv.org/abs/1704.05179 5 | 6 | Here are raw, split, and processed files: 7 | https://drive.google.com/drive/u/2/folders/1kBkQGooNyG0h8waaOJpgdGtOnlb1S649 8 | 9 | ------- 10 | 11 | One can collect the original json files through web search using the scripts in qacrawler. Please refer to the README in the folder for further details on how to use the scraper. Furthermore, one can use the files in the test folder to try it. The above link also contains the original json files that are collected using the Jeopardy! dataset. 12 | 13 | There are also stat files that gives the number of snippets found for the question associated to its filename. This number can range from 0 to 100. For some questions the crawler is set to collect the first 50 snippets and for some it was 100. When the search doesn't give enough results to reach this level then the ones available are collected. During the training we ignored all the files that contain 40 or less snippets to eliminate possible trivial cases. Also, the training data ignores snippets from the 51st onward. 14 | 15 | And here is the link for the Jeopardy! files themselves: 16 | https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/ 17 | 18 | NOTE: We will release the the script that converts these to the training files above with appropriate restrictions. 19 | 20 | ------- 21 | 22 | Some requirements: 23 | nltk==3.2.1 24 | pandas==0.18.1 25 | selenium==2.53.6 26 | pytest==3.0.2 27 | pytorch==0.1.11 28 | -------------------------------------------------------------------------------- /qacrawler/README.md: -------------------------------------------------------------------------------- 1 | # Build 2 | 3 | ## Virtual environment 4 | 5 | If you prefer to work in a new Python virtual environment first create one via 6 | A) [virtualenvwrapper](https://virtualenvwrapper.readthedocs.io) (easier) or 7 | B) [virtualenv](https://virtualenv.pypa.io) 8 | 9 | A) `mkvirtualenv qasearch` to create the environment. 10 | (By default this creates `~/.virtualenvs/qasearch/` directory.) 11 | Then `workon qasearch` to activate the environment. 12 | 13 | B) `virtualenv ENV_DIR` to create the environment. 14 | Then `source ENV_DIR/bin/activate` to activate the environment. 15 | 16 | ## Requirements 17 | 18 | Type `pip install -r requirements.txt` in the project folder to install the required Python packages. 19 | (If you don't create a virtual environment packages will be added to your system Python installation) 20 | 21 | # Drivers 22 | 23 | Put driver executables to system's PATH. (Ugur: "I sudo-copied them into `/usr/local/bin`" ^_^) 24 | 25 | - [Chrome driver](https://sites.google.com/a/chromium.org/chromedriver/downloads). 26 | - [Firefox driver](https://developer.mozilla.org/en-US/docs/Mozilla/QA/Marionette/WebDriver). 27 | - Use [v0.9.0](https://github.com/mozilla/geckodriver/releases/tag/v0.9.0) 28 | - For some reason you have to rename the executabe to `wires` (from `geckodriver`) 29 | 30 | 31 | # Run 32 | 33 | Example command to run the crawler. 34 | 35 | ``` 36 | $ workon PROJECT_VIRTUAL_ENVIRONMENT 37 | $ cd /PROJECT_GIT_REPOSITORY_FOLDER/qacrawler 38 | $ python main.py --jeopardy-json PATH_TO_JEOPARDY_QUESTIONS1.json --first 0 --last 1000 --output-folder OUTPUT_FOLDER --wait-duration 2 --log-level DEBUG --driver Firefox --disable-javascript --num-pages 1 --results-per-page 50 39 | ``` 40 | 41 | Then to follow the logs `tail -f jeopardy_crawler.log`. 42 | -------------------------------------------------------------------------------- /qacrawler/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nyu-dl/dl4ir-searchQA/fe1ea9fdb888001723ee89daf72b35601d72a4ea/qacrawler/__init__.py -------------------------------------------------------------------------------- /qacrawler/crawler.py: -------------------------------------------------------------------------------- 1 | """ 2 | Module to crawl entries and save the results. 3 | """ 4 | import json 5 | import logging 6 | import os 7 | 8 | import sr_parser 9 | 10 | 11 | def crawl(settings, entries): 12 | """ 13 | Crawl search results of given Jeopardy entries according to given crawler settings. 14 | 15 | :param settings: Crawler settings object 16 | :type settings: CrawlerSettings 17 | :param entries: Jeopary dataset entries 18 | :type entries: collections.Iterable[qacrawler.jeopardy.Entry] 19 | :return: 20 | """ 21 | for entry in entries: 22 | logging.info('Question no %06d: %s. Crawl!' % (entry.id, entry.question)) 23 | results = sr_parser.collect_query_results_from_google(entry.question, settings) 24 | logging.info('Question no %06d. Collected %d search results.' % (entry.id, len(results))) 25 | if results: 26 | save_results_for_entry(results, entry, settings.output_folder) 27 | 28 | 29 | def save_results_for_entry(results, entry, output_folder, file_type='json'): 30 | """ 31 | Format search results into json or tsv and save them to a file. 32 | 33 | :param results: a list of SearchResults 34 | :type results: list[sr_parser.SearchResult] 35 | :param entry: jeopardy Entry 36 | :type entry: jeopardy.Entry 37 | :param output_folder: path to output folder 38 | :type output_folder: str 39 | :param file_type: the type of the saved file. Can only have values ['json', 'tsv'] 40 | :type file_type: str 41 | :rtype: None 42 | """ 43 | if file_type == 'json': 44 | formatted_results = results_list_to_output(results, entry) 45 | else: 46 | formatted_results = results_list_to_tsv(results) 47 | 48 | file_folder = output_folder if output_folder else '.' 49 | file_name = generate_filename(entry, file_type) 50 | file_path = os.path.join(file_folder, file_name) 51 | with open(file_path, 'wt') as f: 52 | f.write(formatted_results) 53 | 54 | 55 | def results_list_to_output(results, entry): 56 | """ 57 | Format a list of SearchResults into a JSON string. 58 | 59 | :param results: a list of SearchResults 60 | :type results: list[sr_parser.SearchResult] 61 | :param entry: Jeopardy entry 62 | :type entry: jeopardy.Entry 63 | :return: string in JSON format 64 | :rtype: str 65 | """ 66 | result_dicts = [res.to_dict() for res in results] 67 | output_dict = {'search_results': result_dicts} 68 | output_dict.update(entry.to_dict()) 69 | results_json = json.dumps(output_dict, indent=4) # Pretty-print via indent. Splits keys into multiple lines. 70 | return results_json 71 | 72 | 73 | def results_list_to_tsv(results): 74 | """ 75 | Format a list of SearchResults into a TSV (Tab Separated Values) string. 76 | 77 | :param results: a list of SearchResults 78 | :type results: list[sr_parser.SearchResult] 79 | :return: string in TSV format 80 | :rtype: str 81 | """ 82 | results_tab_separated = [str(res) for res in results] 83 | results_str = '\n'.join(results_tab_separated) 84 | return results_str 85 | 86 | 87 | def generate_filename(entry, extension): 88 | filename = '%06d-%s.%s' % (entry.id, entry.tag, extension) 89 | return filename 90 | 91 | 92 | # TODO have a process pipeline 93 | def process_pipeline(): 94 | pass 95 | # Get a SearchResult 96 | # save %06d-%s-raw.txt % (q_num_id, q_text_id) 97 | # Filter out exact matches 98 | # save %06d-%s-flt.txt % (q_num_id, q_text_id) 99 | # Tokenize 100 | # save %06d-%s-tok.txt % (q_num_id, q_text_id) 101 | # Other NLP operator 102 | # save %06d-%s-oth.txt % (q_num_id, q_text_id) 103 | 104 | 105 | class CrawlerSettings: 106 | def __init__(self, driver, num_pages, output_folder, wait_duration, 107 | simulate_typing, simulate_clicking, disable_javascript): 108 | """ 109 | Singleton class that holds configuration info for crawler. 110 | 111 | It is to make sharing of crawler settings easier among the project. We do not want higher-level 112 | functions in the functionality abstraction hierarchy to be bloated with too many arguments. 113 | 114 | :param driver: selenium driver with which we are going to visit Google 115 | :type driver: selenium.webdriver.chrome.webdriver.WebDriver 116 | :param num_pages: number of results pages to read (at most) 117 | :type num_pages: int 118 | :param output_folder: 119 | :param wait_duration: duration to wait between search results pages in seconds 120 | :type wait_duration: float 121 | :param simulate_typing: indicates whether or not to simulate human key typing 122 | :type simulate_typing: bool 123 | :param simulate_typing: indicates whether or not to simulate human mouse clicking 124 | :type simulate_clicking: bool 125 | """ 126 | self.driver = driver 127 | self.num_pages = num_pages 128 | self.output_folder = output_folder 129 | self.wait_duration = wait_duration 130 | self.simulate_typing = simulate_typing 131 | self.simulate_clicking = simulate_clicking 132 | self.disable_javascript = disable_javascript 133 | -------------------------------------------------------------------------------- /qacrawler/driver_wrapper.py: -------------------------------------------------------------------------------- 1 | """ 2 | This module abstracts away getting a browser driver. 3 | 4 | Currently we only have Chrome bindings. 5 | """ 6 | import logging 7 | import time 8 | 9 | from selenium import webdriver 10 | from selenium.webdriver.common.desired_capabilities import DesiredCapabilities 11 | from selenium.webdriver.common.keys import Keys 12 | from selenium.common.exceptions import NoSuchElementException 13 | 14 | 15 | def get_selenium_driver(driver_type='Firefox'): 16 | if driver_type == 'Firefox': 17 | return get_firefox_driver() 18 | elif driver_type == 'Chrome': 19 | return get_chrome_driver() 20 | elif driver_type == 'PhantomJS': 21 | raise NotImplementedError('PhantomJS usage is not implemented.') 22 | 23 | 24 | def get_chrome_driver(): 25 | """ 26 | Get a Chrome Driver. 27 | 28 | The driver executable (chromedriver) must be in the system path. 29 | 30 | :return: selenium Chrome webdriver 31 | :rtype: selenium.webdriver.chrome.webdriver.WebDriver 32 | """ 33 | driver = webdriver.Chrome() 34 | return driver 35 | 36 | 37 | def get_firefox_driver(): 38 | """ 39 | Get a Firefox Driver. 40 | 41 | The driver executable (wires) must be in the system path. 42 | 43 | :return: selenium Firefox webdriver 44 | :rtype: selenium.webdriver.firefox.webdriver.WebDriver 45 | """ 46 | firefox_capabilities = DesiredCapabilities.FIREFOX 47 | firefox_capabilities['marionette'] = True 48 | driver = webdriver.Firefox(capabilities=firefox_capabilities) 49 | return driver 50 | 51 | 52 | def disable_javascript(driver, driver_type='Firefox'): 53 | if driver_type == 'Firefox': 54 | return disable_javascript_on_firefox(driver) 55 | elif driver_type == 'Chrome': 56 | raise NotImplementedError('Disabling Javascript on Chrome is not implemented.') 57 | 58 | 59 | def disable_javascript_on_firefox(driver): 60 | """ 61 | Manually disable Javascript on Firefox. 62 | 63 | Setting profile preference on Javascript method (shown below) does not work anymore. 64 | Hence, we have to manually 65 | - visit Firefox Configuration page (about:config) 66 | - click on 'I accept the risk!' button that warns about messing with Firefox settings 67 | - narrow down the configurations list by typing 'javascript.enabled' into text box 68 | - togge javascript.enabled setting to False by highlighting it and pressing return 69 | 70 | Not working method: 71 | profile = webdriver.FirefoxProfile() 72 | profile.set_preference("javascript.enabled", False); 73 | driver = webdriver.Firefox(profile) 74 | """ 75 | logging.info('Disabling Javascript...') 76 | driver.get(FirefoxConfigInfo.CONFIG_PAGE_URL) # Firefox's configuration page opens when this url is visited 77 | try: 78 | warning_button = driver.find_element_by_id(FirefoxConfigInfo.ACCEPT_WARNING_BUTTON_ID) 79 | warning_button.click() 80 | except NoSuchElementException: 81 | pass 82 | text_box = driver.find_element_by_id(FirefoxConfigInfo.CONFIGURATIONS_SEARCH_BOX) 83 | text_box.send_keys(FirefoxConfigInfo.JAVASCRIPT_CONFIGURATION_NAME) 84 | time.sleep(2.0) # wait for the configuration list to respond to entered text 85 | config_tree = driver.find_element_by_id(FirefoxConfigInfo.CONFIGURATION_ELEMENTS_LIST_ID) 86 | config_tree.send_keys(Keys.ARROW_DOWN) # Select first item in the list 87 | time.sleep(1.0) # wait for the configuration list to respond to selecting setting 88 | config_tree.send_keys(Keys.RETURN) # Toggle selected item 89 | time.sleep(1.0) 90 | 91 | 92 | class FirefoxConfigInfo(object): 93 | """Has DOM information for Firefox's Configuration Page.""" 94 | FIREFOX_VERSION = '49.0.1' # works on this version 95 | CONFIG_PAGE_URL = 'about:config' 96 | ACCEPT_WARNING_BUTTON_ID = 'warningButton' 97 | CONFIGURATIONS_SEARCH_BOX = 'textbox' 98 | JAVASCRIPT_CONFIGURATION_NAME = 'javascript.enabled' 99 | CONFIGURATION_ELEMENTS_LIST_ID = 'configTree' 100 | 101 | 102 | def get_phantomjs_driver(): 103 | # TODO Implement PhantomJS driver version instead of Chrome driver 104 | pass 105 | -------------------------------------------------------------------------------- /qacrawler/google_dom_info.py: -------------------------------------------------------------------------------- 1 | """ 2 | This module includes information about Google search page specific HTML information to be used in parsing. 3 | 4 | It includes class names such as the class assigned to search result DIV's etc. 5 | """ 6 | 7 | 8 | class GoogleDomInfoBase(object): 9 | RESULT_TITLE_CLASS = 'r' 10 | RESULT_URL_CLASS = '_Rm' 11 | RESULT_DESCRIPTION_CLASS = 'st' 12 | RESULT_RELATED_LINKS_DIV_CLASS = 'osl' 13 | RESULT_RELATED_LINK_CLASS = 'fl' 14 | NEXT_PAGE_ID = 'pnnext' 15 | 16 | 17 | class GoogleDomInfoWithJS(GoogleDomInfoBase): 18 | RESULT_DIV_CLASS = 'rc' 19 | SEARCH_BOX_XPATH = '//*[@id="lst-ib"]' 20 | 21 | 22 | class GoogleDomInfoWithoutJS(GoogleDomInfoBase): 23 | RESULT_DIV_CLASS = 'g' 24 | SEARCH_BOX_XPATH = '//*[@id="sbhost"]' 25 | NAVIGATION_LINK_CLASS = 'fl' 26 | PREFERENCES_BUTTON_ID = 'gbi5' 27 | NUMBER_OF_RESULTS_SELECT_ID = 'numsel' 28 | SAVE_PREFERENCES_BUTTON_NAME = 'submit2' 29 | -------------------------------------------------------------------------------- /qacrawler/jeopardy.py: -------------------------------------------------------------------------------- 1 | """ 2 | This module is about abstracting away reading Jeopardy dataset. 3 | 4 | An Entry is an object that holds parsed Jeopardy entry information such as question, answer etc. 5 | 6 | Dataset class is responsible from reading Jeopardy dataset file and converting entries to Entry objects. 7 | """ 8 | import json 9 | 10 | from bs4 import BeautifulSoup 11 | 12 | 13 | class Dataset(object): 14 | """ 15 | Class that represents Jeopardy! question answer dataset. 16 | """ 17 | def __init__(self, filepath): 18 | self.data = self.load_jeopardy_dataset_from_json_file(filepath) 19 | self.size = len(self.data) 20 | 21 | @staticmethod 22 | def load_jeopardy_dataset_from_json_file(filepath): 23 | """ 24 | Read Jeopardy entries from file. 25 | 26 | :param filepath: the path of json file 27 | :return: A list of dictionaries that hold entry information. 28 | :rtype: list[dict] 29 | """ 30 | with open(filepath, 'rt') as json_file: 31 | data = json.load(json_file) 32 | return data 33 | 34 | def get_entry(self, no): 35 | """ 36 | Create an Entry object from the jeopardy entry at line no. 37 | 38 | :param no: The number of entry, in the order they are saved in json file 39 | :return: at the at the line no 40 | :rtype: Entry 41 | """ 42 | entry = Entry(entry_dict=self.data[no], entry_id=no) 43 | return entry 44 | 45 | 46 | class Entry(object): 47 | """Class that represents a Jeopardy! entry.""" 48 | KEYS = [u'category', u'air_date', u'question', u'value', u'answer', u'round', u'show_number'] 49 | 50 | def __init__(self, entry_dict, entry_id): 51 | """ 52 | Construct an Entry object 53 | 54 | :param entry_dict: Entry data read from json file 55 | :type entry_dict: dict 56 | :param entry_id: An integer ID for the entry, which is the spatial rank in the dataset file 57 | :type entry_id: int 58 | """ 59 | self.id = entry_id 60 | self.question = self.get_question(entry_dict['question']) 61 | self.answer = entry_dict['answer'] 62 | self.category = entry_dict['category'] 63 | self.air_date = entry_dict['air_date'] 64 | self.show_number = entry_dict['show_number'] 65 | self.round = entry_dict['round'] 66 | self.value = entry_dict['value'] 67 | self.tag = self.get_tag(entry_dict) 68 | 69 | def to_dict(self): 70 | d = {key: getattr(self, key) for key in Entry.KEYS} 71 | d.update({'id': self.id}) 72 | return d 73 | 74 | @staticmethod 75 | def get_question(string): 76 | """ 77 | Trim quotation marks and remove HTML tags. 78 | 79 | Trim quotation marks in the beginning and at the end of the string. Remove HTML tags if there are any. (Some 80 | questions have and tags. 81 | 82 | For some reason json version of the dataset has that quotes. 83 | """ 84 | string = string[1:-1] 85 | soup = BeautifulSoup(string, 'html.parser') 86 | return soup.text 87 | 88 | @staticmethod 89 | def get_tag(entry_dict): 90 | """ 91 | Generate a unique tag from entry metadata. 92 | 93 | example: 4999_Double Jeopardy!_AUTHORS IN THEIR YOUTH_$800 94 | 95 | :param entry_dict: A dictionary with keys ['show_number', 'round', 'category', 'value'] 96 | :type entry_dict: dict 97 | :return: tag of entry 98 | :rtype: str 99 | """ 100 | tag_keys = ['show_number', 'round', 'category', 'value'] 101 | tag_parts = [Entry.format_tag_part(entry_dict[key]) for key in tag_keys] 102 | tag = '_'.join(tag_parts) 103 | return tag 104 | 105 | @staticmethod 106 | def format_tag_part(s): 107 | """ 108 | Make input string all lowercase and filter out non-alphanumeric characters. 109 | 110 | :param s: tag part string to format 111 | :type s: str 112 | :return: formatted string 113 | :rtype: str 114 | """ 115 | if s is None: 116 | return '' 117 | s = s.lower() 118 | s = ''.join([ch for ch in s if ch.isalnum()]) 119 | return s 120 | -------------------------------------------------------------------------------- /qacrawler/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Main entry point of qacrawler 3 | 4 | Parses command line arguments. Loads the dataset. Chooses the entries to crawl. Crawl chosen entries. Quit. 5 | """ 6 | import argparse 7 | import logging 8 | import os 9 | import time 10 | 11 | from selenium.webdriver.common.by import By 12 | from selenium.webdriver.support.ui import Select 13 | 14 | import driver_wrapper 15 | import jeopardy 16 | import crawler 17 | import sr_parser 18 | from google_dom_info import GoogleDomInfoWithoutJS as GDom 19 | 20 | 21 | def main(): 22 | crawler_settings, entries = initialize() 23 | crawler.crawl(crawler_settings, entries) 24 | finalize(crawler_settings.driver) 25 | 26 | 27 | def initialize(): 28 | """Initialize collector. 29 | 30 | Initialize by parsing command line arguments, reading Jeopardy entries from dataset file 31 | and getting browser driver. 32 | """ 33 | args = parse_command_line_arguments() 34 | configure_logging(log_level=args.log_level) 35 | create_folder_if_not_exists(args.output_folder) 36 | dataset = jeopardy.Dataset(filepath=args.jeopardy_json) 37 | entries = get_entries_to_search(dataset, first=args.first, last=args.last) 38 | driver = driver_wrapper.get_selenium_driver(args.driver_type) 39 | if args.disable_javascript: 40 | driver_wrapper.disable_javascript(driver, args.driver_type) 41 | set_number_of_results_per_page(driver, args.results_per_page) 42 | settings = crawler.CrawlerSettings(driver, args.num_pages, args.output_folder, args.wait_duration, 43 | args.simulate_typing, args.simulate_clicking, args.disable_javascript) 44 | logging.info('Start.') 45 | return settings, entries 46 | 47 | 48 | def parse_command_line_arguments(): 49 | """Parse command line arguments 50 | 51 | :return: An object of which attributes are command line arguments 52 | :rtype: argparse.ArgumentParser 53 | """ 54 | argparser = argparse.ArgumentParser(description='Crawl Google to generate QA dataset') 55 | argparser.add_argument('-j', '--jeopardy-json', type=str, required=True, 56 | help='Path to Jeopardy dataset file') 57 | argparser.add_argument('-d', '--driver-type', type=str, default='Firefox', 58 | help='The browser/driver type to be used by crawler', 59 | choices=['Firefox', 'Chrome', 'PhantomJS']) 60 | argparser.add_argument('-o', '--output-folder', type=str, required=True, 61 | help='If a folder is given the output files will be written there. ' 62 | 'If given folder does not exist it will be created first.') 63 | argparser.add_argument('-f', '--first', type=int, default=0, 64 | help='First entry from which to start reading questions') 65 | argparser.add_argument('-l', '--last', type=int, default=216930, 66 | help='Last entry at which to stop reading questions') 67 | argparser.add_argument('-n', '--num-pages', type=int, default=1, 68 | help='Number of search result pages to parse per query') 69 | argparser.add_argument('-g', '--log-level', type=str, default='INFO', 70 | help='Set the level of log messages below which will be saved to log file', 71 | choices=['DEBUG', 'INFO', 'WARNING', 'ERROR']) 72 | argparser.add_argument('-w', '--wait-duration', type=int, default=4, 73 | help='Number of seconds to wait before getting the next page') 74 | argparser.add_argument('--simulate-typing', action='store_true', 75 | help='When included simulates human typing by pressing keys one-by-one') 76 | argparser.add_argument('--simulate-clicking', action='store_true', 77 | help='When included simulates mouse clicking on next page link') 78 | argparser.add_argument('--disable-javascript', action='store_true', 79 | help='When included disables JavaScript (only for Firefox)') 80 | argparser.add_argument('--results-per-page', type=int, default=10, 81 | help='The number of search results in a page per query', 82 | choices=[10, 20, 30, 50, 100]) 83 | args = argparser.parse_args() 84 | return args 85 | 86 | 87 | def configure_logging(log_level, log_file='jeopardy_crawler.log', 88 | log_format='%(levelname)s:%(asctime)s:%(module)s:%(funcName)s:%(message)s'): 89 | """Configure Crawler's logger and Suppress selenium's browser logger. 90 | 91 | Selenium's browser logger logs every HTTP requests etc. We are not interested in that.""" 92 | log_level_num = getattr(logging, log_level) 93 | logging.basicConfig(filename=log_file, format=log_format, level=log_level_num) 94 | from selenium.webdriver.remote.remote_connection import LOGGER as SELENIUM_LOGGER 95 | SELENIUM_LOGGER.setLevel(logging.INFO) 96 | 97 | 98 | def create_folder_if_not_exists(folder): 99 | """ 100 | :param folder: path to folder 101 | :type folder: str 102 | :rtype None: 103 | """ 104 | if not os.path.exists(folder): 105 | os.makedirs(folder) 106 | 107 | 108 | def get_entries_to_search(dataset, first, last): 109 | """Get entries to do search queries. 110 | 111 | :rtype generator[jeopardy.Entry]""" 112 | if last >= dataset.size: last = dataset.size 113 | entries = (dataset.get_entry(no) for no in range(first, last)) 114 | return entries 115 | 116 | 117 | def set_number_of_results_per_page(driver, num_results): 118 | """Manually set the number of search results per page when Javascript is disabled. 119 | 120 | Manually 121 | - visit Google Search's preferences page 122 | - find the element to set number of results per page 123 | - set it to the value given in command-line arguments 124 | """ 125 | logging.debug('Setting search results per page to %d...' % num_results) 126 | redirect_nonjavascript_version_of_google_by_making_a_dummy_query(driver) 127 | visit_google_search_preferences_page(driver) 128 | find_select_and_set(driver, num_results) 129 | save_preferences(driver) 130 | 131 | 132 | def redirect_nonjavascript_version_of_google_by_making_a_dummy_query(driver): 133 | """ 134 | If we first visit preferences page directly, Google gives "Your cookies seem to be disabled." warning. 135 | Hence first do a dummy search, so that Google redirects us to its non-Javascript version. 136 | """ 137 | sr_parser.visit_google(driver, 'Increase search results per page...') 138 | sr_parser.wait_for_presence_and_get_element(driver, (By.XPATH, GDom.SEARCH_BOX_XPATH)) 139 | 140 | 141 | def visit_google_search_preferences_page(driver): 142 | google_preferences_page_url = 'http://www.google.com/preferences?hl=en' 143 | driver.get(google_preferences_page_url) 144 | 145 | 146 | def find_select_and_set(driver, num_results): 147 | select = wait_for_and_scroll_into_view_and_get_num_results_select(driver) 148 | time.sleep(1) 149 | select.select_by_value(str(num_results)) 150 | logging.info('Selected value: %s' % select.first_selected_option.text) 151 | time.sleep(1) 152 | 153 | 154 | def wait_for_and_scroll_into_view_and_get_num_results_select(driver): 155 | """ 156 | :return: Select html element to set number of search results per page 157 | :rtype: selenium.webdriver.support.ui.Select 158 | """ 159 | select_locator = (By.ID, GDom.NUMBER_OF_RESULTS_SELECT_ID) 160 | select = sr_parser.wait_for_presence_and_get_element(driver, select_locator) 161 | coordinates = select.location_once_scrolled_into_view # first scrolls to element then returns location 162 | select = Select(select) 163 | return select 164 | 165 | 166 | def save_preferences(driver): 167 | save_preferences_button = driver.find_element(By.NAME, GDom.SAVE_PREFERENCES_BUTTON_NAME) 168 | save_preferences_button.click() 169 | time.sleep(1) 170 | 171 | 172 | def finalize(driver): 173 | logging.info('End.') 174 | driver.quit() 175 | 176 | 177 | if __name__ == '__main__': 178 | main() 179 | -------------------------------------------------------------------------------- /qacrawler/sr_parser.py: -------------------------------------------------------------------------------- 1 | """ 2 | This module deals with crawling Google search results and parsing result pages. 3 | 4 | A SearchResult is an object that holds parsed search results. 5 | 6 | collect_query_results_from_google() function is to crawl a query's search results into SearchResult objects. 7 | 8 | Terminology is taken from Google help at: https://support.google.com/websearch/answer/35891?hl=en#results 9 | """ 10 | import logging 11 | import random 12 | import sys 13 | import time 14 | 15 | from selenium.webdriver.common.action_chains import ActionChains 16 | from selenium.webdriver.common.by import By 17 | from selenium.webdriver.common.keys import Keys 18 | from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException 19 | from selenium.webdriver.support.ui import WebDriverWait 20 | from selenium.webdriver.support import expected_conditions as EC 21 | from selenium.common.exceptions import TimeoutException 22 | from bs4 import BeautifulSoup 23 | 24 | import google_dom_info 25 | 26 | GDOM = None 27 | 28 | 29 | def collect_query_results_from_google(query, settings): 30 | """ 31 | Get formatted search results given a search query and the number of pages to parse. 32 | 33 | :param query: search query 34 | :type query: str 35 | :param settings: Crawler settings object 36 | :type settings: qacrawler.crawler.CrawlerSettings 37 | :return: A list of SearchResult objects 38 | :rtype: list[SearchResult] 39 | """ 40 | set_gdom(settings.disable_javascript) 41 | if settings.disable_javascript: 42 | visit_google(settings.driver, query='%20') # request a search result page with no results 43 | else: 44 | visit_google(settings.driver) 45 | check_google_bot_police(settings.driver) 46 | search_box = wait_for_and_get_search_box(settings.driver) 47 | submit_query(query, search_box, settings.simulate_typing, settings.driver) 48 | all_results = parse_n_search_result_pages(settings, 49 | num_pages=settings.num_pages, wait_duration=settings.wait_duration) 50 | return all_results 51 | 52 | 53 | def set_gdom(disable_javascript): 54 | """Choose the Google DOM information according whether Javascript is disabled or not.""" 55 | global GDOM 56 | if disable_javascript: 57 | GDOM = google_dom_info.GoogleDomInfoWithoutJS 58 | else: 59 | GDOM = google_dom_info.GoogleDomInfoWithJS 60 | 61 | 62 | def visit_google(driver, query=None): 63 | """If a query is given directly search that query otherwise just open Google's front page.""" 64 | url = 'http://google.com' 65 | if query is not None: 66 | url = url + '/search?q=' + query 67 | driver.get(url) # visit a search results page with no search results 68 | 69 | 70 | def check_google_bot_police(driver): 71 | """Give error and quit if caught by Google Bot Police.""" 72 | robot_police_text = 'Our systems have detected unusual traffic' 73 | if robot_police_text in driver.page_source: 74 | logging.critical('Caught by Google Bot Police :-(. Exiting...') 75 | quit_driver_and_exit(driver) 76 | 77 | 78 | def wait_for_and_get_search_box(driver): 79 | search_box_locator = (By.XPATH, GDOM.SEARCH_BOX_XPATH) 80 | search_box = wait_for_presence_and_get_element(driver, search_box_locator) 81 | return search_box 82 | 83 | 84 | def wait_for_presence_and_get_element(driver, locator, timeout=10): 85 | try: 86 | condition = EC.presence_of_element_located(locator) 87 | element = WebDriverWait(driver, timeout=timeout).until(condition) 88 | return element 89 | except TimeoutException: 90 | logging.critical('TimeoutException: Could not get the element at %s. ' % str(locator) + 91 | 'There is a connection problem. Exiting.') 92 | quit_driver_and_exit(driver) 93 | 94 | 95 | def quit_driver_and_exit(driver): 96 | driver.quit() 97 | sys.exit() 98 | 99 | 100 | def submit_query(query, search_box, simulate_typing, driver): 101 | """Choose query submission type according to simulate_typing.""" 102 | if simulate_typing: 103 | submit_query_by_typing(driver, query, search_box) 104 | else: 105 | submit_query_at_once(query, search_box) 106 | 107 | 108 | def submit_query_by_typing(driver, query, search_box): 109 | """Submit query by sending keys one-by-one.""" 110 | ActionChains(driver).move_to_element(search_box).click().perform() 111 | simulate_typing(search_box, query) 112 | result_divs = driver.find_elements_by_class_name(GDOM.RESULT_DIV_CLASS) 113 | search_box.send_keys(Keys.ENTER) 114 | search_box.submit() 115 | if result_divs: 116 | wait_for_page_load_after_submission(driver, result_divs) 117 | 118 | 119 | def wait_for_page_load_after_submission(driver, result_divs): 120 | """Wait for the query string to arrive Search Engine.""" 121 | logging.debug('wait for staleness after submitting search query') 122 | WebDriverWait(driver, timeout=5).until( 123 | EC.staleness_of(result_divs[0]) 124 | ) 125 | logging.debug('staleness ended') 126 | 127 | 128 | def simulate_typing(element, text): 129 | for char in text: 130 | element.send_keys(char) 131 | wait_with_variance(0.05, variation=0.05) 132 | 133 | 134 | def submit_query_at_once(query, search_box): 135 | """Submit query by entering whole question string at once.""" 136 | search_box.clear() 137 | search_box.send_keys(query) 138 | search_box.send_keys(Keys.ENTER) 139 | # search_box.submit() 140 | 141 | 142 | def parse_n_search_result_pages(settings, num_pages, wait_duration): 143 | """ 144 | Parse num_pages of search result pages and return all SearchResults found. 145 | 146 | :param wait_duration: duration to wait between search results pages in seconds 147 | :type wait_duration: float 148 | :param num_pages: Number of search result pages to parse per query 149 | :type num_pages: int 150 | :return: list of SearchResult parsed from num_pages of search result pages 151 | :rtype: list[SearchResult] 152 | """ 153 | all_results = [] 154 | for page_no in range(num_pages): 155 | logging.debug('Parsing page %d.' % page_no) 156 | page_results = parse_one_search_result_page(settings.driver) 157 | if not page_results: 158 | return all_results 159 | all_results.extend(page_results) 160 | wait_with_variance(duration=wait_duration) 161 | next_page_exists = request_next_page(settings.driver, settings.simulate_clicking, settings.disable_javascript) 162 | if not next_page_exists: 163 | break 164 | return all_results 165 | 166 | 167 | def request_next_page(driver, simulate_clicking, disable_javascript): 168 | """ 169 | Find next page element/url and if exists request it. 170 | 171 | :param driver: selenium driver with which we'll collect data from opened search result page 172 | :type driver: selenium.webdriver.chrome.webdriver.WebDriver 173 | :param simulate_clicking: if True request next page by clicking on element otherwise request it via 174 | driver.get method 175 | :type simulate_clicking: bool 176 | :return: Whether the next page element exists or not 177 | :rtype: bool 178 | """ 179 | if simulate_clicking: 180 | next_page_element = get_next_page_element(driver) 181 | if next_page_element: 182 | ActionChains(driver).move_to_element(next_page_element).perform() 183 | wait_with_variance(duration=0.2, variation=0.2) 184 | ActionChains(driver).click(next_page_element).perform() 185 | wait_for_page_load_after_clicking_on_link(driver) 186 | return True 187 | else: 188 | next_page_url = get_next_page_url_no_js(driver) if disable_javascript else get_next_page_url_js(driver) 189 | if next_page_url: 190 | driver.get(next_page_url) 191 | return True 192 | return False 193 | 194 | 195 | def wait_for_page_load_after_clicking_on_link(driver): 196 | """ 197 | Unlike opening a new page via driver.get opening a new page via clicking on a link is a more complicated process. 198 | Clicking on a link is an asynchronious operation, i.e. browser driver does not wait for the action that is 199 | trigger by clicking to finish. Because the action can be anything including opening a new page, 200 | sending an AJAX request, or even nothing. 201 | 202 | Our technique to make the operation a blocking operation by waiting until the new page request is to see whether the 203 | search results that appeared while we were simulated-typing went stale. 204 | """ 205 | a_div = get_search_result_divs(driver)[0] 206 | logging.debug('wait for staleness after clicking next page') 207 | WebDriverWait(driver, timeout=5).until( 208 | EC.staleness_of(a_div) 209 | ) 210 | logging.debug('staleness ended') 211 | 212 | 213 | def parse_one_search_result_page(driver): 214 | """ 215 | Wait the search result page to load and call page parse function. 216 | 217 | :param driver: selenium driver with which we'll open results page 218 | :type driver: selenium.webdriver.chrome.webdriver.WebDriver 219 | :return: A list of SearchResult objects. If there are no results return an empty list 220 | :rtype: list[SearchResult] 221 | """ 222 | try: 223 | wait_for_search_results(driver) 224 | except TimeoutException: 225 | logging.warning('TimeoutException: Either there are no search results (e.g. false alarm) ' 226 | 'or Google realized that we are a bot :-( ' 227 | 'or there is connection problem (less likely).') 228 | return [] 229 | results = parse_opened_results_page(driver) 230 | logging.debug('Collected %d search results.' % len(results)) 231 | return results 232 | 233 | 234 | def wait_for_search_results(driver, timeout=10): 235 | elem = WebDriverWait(driver, timeout=timeout).until( 236 | EC.presence_of_element_located((By.CLASS_NAME, GDOM.RESULT_DIV_CLASS)) 237 | ) 238 | logging.debug('wait_for_search_results ENDED') 239 | 240 | 241 | def wait_with_variance(duration, variation=1.0): 242 | """Wait a while to pass Google's bot detection.""" 243 | uniform = random.random() * variation 244 | duration += uniform 245 | time.sleep(duration) 246 | 247 | 248 | def get_next_page_element(driver): 249 | """ 250 | Get HTML element for next page link so that we can click on it. 251 | 252 | :param driver: the selenium driver that is used to open the search results page 253 | :type driver: selenium.webdriver.chrome.webdriver.WebDriver 254 | :return: next page element if exists else None 255 | :rtype: selenium.webdriver.remote.webelement.WebElement 256 | """ 257 | next_page_elements = driver.find_elements(By.ID, GDOM.NEXT_PAGE_ID) 258 | if next_page_elements: 259 | next_page_element = next_page_elements[0] 260 | else: 261 | next_page_element = None 262 | logging.debug('There is no next page.') 263 | return next_page_element 264 | 265 | 266 | def get_next_page_url_js(driver): 267 | """ 268 | Get next page url when Javascript is enabled. 269 | 270 | :param driver: the selenium driver that is used to open the search results page 271 | :type driver: selenium.webdriver.chrome.webdriver.WebDriver 272 | :return: url if exists else None 273 | :rtype: str 274 | """ 275 | next_page_link = driver.find_elements(By.ID, GDOM.NEXT_PAGE_ID) 276 | if next_page_link: 277 | next_page_url = next_page_link[0].get_attribute('href') 278 | else: 279 | next_page_url = None 280 | logging.debug('There is no next page.') 281 | return next_page_url 282 | 283 | 284 | def get_next_page_url_no_js(driver): 285 | """ 286 | Get next page url when Javascript is disabled. 287 | 288 | Google does not give an HTML id to next page link element when Javascript is disabled. 289 | 290 | :param driver: the selenium driver that is used to open the search results page 291 | :type driver: selenium.webdriver.chrome.webdriver.WebDriver 292 | :return: url if exists else None 293 | :rtype: str 294 | """ 295 | navigation_elements = driver.find_elements(By.CLASS_NAME, GDOM.NAVIGATION_LINK_CLASS) 296 | if not navigation_elements: 297 | logging.debug('There is no next page.') 298 | return None 299 | last_navigation_element = navigation_elements[-1] 300 | if last_navigation_element.text != 'Next': 301 | logging.debug('There is no next page.') 302 | return None 303 | next_page_url = last_navigation_element.get_attribute('href') 304 | return next_page_url 305 | 306 | 307 | def parse_opened_results_page(driver): 308 | """Parse a loaded search result page into a list of SearchResult objects. 309 | 310 | It parses 4 parts ("title", "url", "short description" and, if exists, "related links") from each 311 | search result item. 312 | 313 | :param driver: the selenium driver that is used to open the search results page 314 | :type driver: selenium.webdriver.chrome.webdriver.WebDriver 315 | :return: A list of SearchResult objects 316 | :rtype: list[SearchResult] 317 | """ 318 | elements = get_search_result_divs(driver) 319 | results = [] 320 | for no, elem in enumerate(elements): 321 | try: 322 | results.append(SearchResult(elem)) 323 | except NotAParsableSearchResult: 324 | logging.debug('Search result DIV no %d is not parsable. ' 325 | 'It can be a non-website result such as a video.' % no) 326 | continue 327 | return results 328 | 329 | 330 | def get_search_result_divs(driver): 331 | """From the opened page, get a list of DIVs where each DIV is a result. 332 | 333 | :param driver: selenium driver with which results page is opened 334 | :type driver: selenium.webdriver.chrome.webdriver.WebDriver 335 | :rtype: list[bs4.element.Tag] 336 | """ 337 | soup = BeautifulSoup(driver.page_source, 'html.parser') 338 | elements = soup.select('.' + GDOM.RESULT_DIV_CLASS) # tag: div 339 | return elements 340 | 341 | 342 | class SearchResult(object): 343 | """Parses search result information such as title, url, snippet from div and keep them in related attributes""" 344 | def __init__(self, element): 345 | """Parse a search result DIV to get title, url, short description. 346 | 347 | :param element: HTML element, a DIV that holds a search result 348 | :type element: bs4.element.Tag 349 | """ 350 | self.element = element 351 | self.title = self.parse_title(element) 352 | logging.debug('parsing result ' + self.title) 353 | self.url = self.parse_url(element) 354 | self.snippet = self.parse_snippet(element) 355 | self.related_links = self.parse_related_links(element) 356 | 357 | @staticmethod 358 | def parse_title(element): 359 | title = element.select_one('.' + GDOM.RESULT_TITLE_CLASS) 360 | if title is None: 361 | raise NotAParsableSearchResult 362 | return title.text.encode('ascii', 'ignore') 363 | 364 | @staticmethod 365 | def parse_url(element): 366 | h3 = element.select_one('.' + GDOM.RESULT_TITLE_CLASS) 367 | anchor = h3.find('a') # tag: a 368 | if anchor is None: 369 | raise NotAParsableSearchResult 370 | url = anchor['href'] 371 | return url.encode('ascii', 'ignore') 372 | 373 | @staticmethod 374 | def parse_snippet(element): 375 | # snippet might not exist for result item 376 | snippet = element.select_one('.' + GDOM.RESULT_DESCRIPTION_CLASS) # tag: span 377 | return snippet.text.encode('ascii', 'ignore') if snippet else None 378 | 379 | @staticmethod 380 | def parse_related_links(element): 381 | # related links might not exist for result item 382 | # first check if their div exists. If yes, get the list of link elements 383 | related_links_div = element.select_one('.' + GDOM.RESULT_RELATED_LINKS_DIV_CLASS) # tag: div 384 | if related_links_div: 385 | related_links = related_links_div.select('.' + GDOM.RESULT_RELATED_LINK_CLASS) # tag: a 386 | related_links = [rl.text.encode('ascii', 'ignore') for rl in related_links] 387 | else: 388 | related_links = None 389 | return related_links 390 | 391 | def to_dict(self): 392 | """Convert data in SearchResult object into a dictionary""" 393 | dict_representation = {'title': self.title, 394 | 'url': self.url, 395 | 'snippet': self.snippet, 396 | 'related_links': self.related_links} 397 | return dict_representation 398 | 399 | def __str__(self): 400 | """Format a parsed Google search result into a tab separated string""" 401 | s = '%s\t' % self.title 402 | s += '%s\t' % self.url 403 | s += '%s\t' % (self.snippet if self.snippet else '') 404 | s += '%s' % (';'.join(self.related_links) if self.related_links else '') 405 | s = s 406 | return s 407 | 408 | 409 | class NotAParsableSearchResult(Exception): 410 | pass 411 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | nltk==3.2.1 2 | pandas==0.18.1 3 | selenium==2.53.6 4 | # development 5 | pytest==3.0.2 6 | -------------------------------------------------------------------------------- /tests/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nyu-dl/dl4ir-searchQA/fe1ea9fdb888001723ee89daf72b35601d72a4ea/tests/.DS_Store -------------------------------------------------------------------------------- /tests/data/one_result.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | Title 6 | 7 | 8 |

Cheese - Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Cheese
Wikipedia
Loading...
Cheese is a food derived from milk that is produced in a wide range of flavors, textures, and forms by coagulation of the milk protein casein. It comprises proteins ...
Etymology · ‎History · ‎Production · ‎Processing
9 | 10 | 11 | -------------------------------------------------------------------------------- /tests/data/parsed.tsv: -------------------------------------------------------------------------------- 1 | Cheese - Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Cheese Cheese is a food derived from milk that is produced in a wide range of flavors, textures, and forms by coagulation of the milk protein casein. It comprises proteins ... Etymology;History;Production;Processing 2 | Cheese.com - World's Greatest Cheese Resource http://www.cheese.com/ Everything you want to know about cheese. Includes search features. 3 | Alphabetical list of cheeses - Cheese.com http://www.cheese.com/alphabetical/ All cheeses in alphabetical order. ... Aged Cashew & Blue Green Algae Cheese. 1; 2; 3; 4; 5. Per page: 20, 40, 60, 80, 100. #. 2015 Worldnews, Inc. Privacy ... 4 | Murray's Cheese http://www.murrayscheese.com/ Imported and domestic cheeses, searchable by name or by country. Also offers gift baskets and certificates. 5 | Order Cheese Online | Fast Delivery by FreshDirect https://www.freshdirect.com/browse.jsp?id=che Visit our cheese counter to order high quality cheese favorites. Choose from a wide selection of best-selling brands and favorites from around the world. 6 | Murray's Cheese Bar http://www.murrayscheesebar.com/ ... Reservations Events Press Contact. cheese-bar-spring-menu-burrata-pea-pesto-280A7887. fried-chicken-sandwich-cheese-bar-stadium-280A8551.jpg. 7 | French cheese board http://frenchcheeseboard.com/ The French Cheese Board is also a fab lab of ideas, where people can participate in cooking lessons, wine and cheese pairings sessions, interactive ... 8 | Cabot Creamery: Cheddar Cheese & Other Dairy Products from Vermont https://www.cabotcheese.coop/ Cabot Cheese, owned & operated by real farmers, has been making award winning cheddar cheese & other dairy products since 1919. Learn More! 9 | No, cheese is not just like crack | Science News https://www.sciencenews.org/blog/scicurious/no-cheese-not-just-crack Cheese is a delicious invention. But if you saw the news last week, you might think it's on its way to being classified as a Schedule II drug. 10 | What is Big Block of Cheese Day? - The Atlantic http://www.theatlantic.com/politics/archive/2015/01/the-real-story-of-the-white-house-and-the-big-block-of-cheese/384676/ Wednesday, as you may have heard, is the White House's second annual Big Block of Cheese Day. Inspired by a fictional character in a ... 11 | After 75 Years, the Cheese Stands Alone - The New York Times http://www.nytimes.com/2016/07/17/nyregion/after-75-years-the-cheese-stands-alone.html A Brooklyn family's edible heirloom: a quarter wheel of cheese. ... That anything as perishable as cheese should survive 75 years is remarkable ... -------------------------------------------------------------------------------- /tests/data/tiny_dataset.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "category": "HISTORY", 4 | "air_date": "2004-12-31", 5 | "question": "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'", 6 | "value": "$200", 7 | "answer": "Copernicus", 8 | "round": "Jeopardy!", 9 | "show_number": "4680" 10 | }, 11 | { 12 | "category": "ESPN's TOP 10 ALL-TIME ATHLETES", 13 | "air_date": "2004-12-31", 14 | "question": "'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'", 15 | "value": "$200", 16 | "answer": "Jim Thorpe", 17 | "round": "Jeopardy!", 18 | "show_number": "4680" 19 | }, 20 | { 21 | "category": "EVERYBODY TALKS ABOUT IT...", 22 | "air_date": "2004-12-31", 23 | "question": "'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'", 24 | "value": "$200", 25 | "answer": "Arizona", 26 | "round": "Jeopardy!", 27 | "show_number": "4680" 28 | }, 29 | { 30 | "category": "THE COMPANY LINE", 31 | "air_date": "2004-12-31", 32 | "question": "'In 1963, live on \"The Art Linkletter Show\", this company served its billionth burger'", 33 | "value": "$200", 34 | "answer": "McDonald\\'s", 35 | "round": "Jeopardy!", 36 | "show_number": "4680" 37 | }, 38 | { 39 | "category": "EPITAPHS & TRIBUTES", 40 | "air_date": "2004-12-31", 41 | "question": "'Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States'", 42 | "value": "$200", 43 | "answer": "John Adams", 44 | "round": "Jeopardy!", 45 | "show_number": "4680" 46 | }, 47 | { 48 | "category": "3-LETTER WORDS", 49 | "air_date": "2004-12-31", 50 | "question": "'In the title of an Aesop fable, this insect shared billing with a grasshopper'", 51 | "value": "$200", 52 | "answer": "the ant", 53 | "round": "Jeopardy!", 54 | "show_number": "4680" 55 | }, 56 | { 57 | "category": "HISTORY", 58 | "air_date": "2004-12-31", 59 | "question": "'Built in 312 B.C. to link Rome & the South of Italy, it's still in use today'", 60 | "value": "$400", 61 | "answer": "the Appian Way", 62 | "round": "Jeopardy!", 63 | "show_number": "4680" 64 | } 65 | ] 66 | -------------------------------------------------------------------------------- /tests/main_test.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | sys.path.insert(0, os.path.abspath('..')) 4 | 5 | import pandas as pd 6 | 7 | from sr_parser import main as p 8 | 9 | 10 | def nlp_parsing(): 11 | text = ['What track on beautiful life has the same name as a country?', 12 | 'what kinds of movie is the desperate hours'] 13 | parsed_text = p.parser(text) 14 | all_results = p.subject_object_getter(parsed_text) 15 | for i, j in zip(all_results, text): 16 | i.append(j) 17 | df = pd.DataFrame(all_results, columns=[ 18 | 'subject', 'object', 'raw_question']) 19 | assert df.ix[0, 0][0] == 'track' 20 | -------------------------------------------------------------------------------- /tests/study_codes.py: -------------------------------------------------------------------------------- 1 | """ 2 | The codes in each function are meant to be run in a Python interpreter at project root folder. 3 | 4 | These are different than test modules. Here the goal is to have some variables in an interpreted to be 5 | analyzed interactively. For example, to create a SearchResult, or Entry object and see how it works. 6 | 7 | In PyCharm, to run the source code from the editor in Python console: 8 | 1) Select code in editor 9 | 2) Choose "Execute selection in console" in context menu (ALT + SHIFT + E) 10 | 11 | Source: https://www.jetbrains.com/help/pycharm/2016.2/loading-code-from-editor-into-console.html 12 | """ 13 | 14 | def study_initialize_visit(): 15 | from selenium.webdriver.support.ui import WebDriverWait 16 | from selenium.webdriver.common.by import By 17 | from selenium.webdriver.support import expected_conditions as EC 18 | 19 | from qacrawler import driver_wrapper 20 | from qacrawler import google_dom_info 21 | from qacrawler import sr_parser 22 | 23 | sr_parser.gdom = google_dom_info.GoogleDomInfoWithoutJS 24 | driver = driver_wrapper.get_firefox_driver() 25 | driver_wrapper.disable_javascript_on_firefox(driver) 26 | 27 | sr_parser.visit_google(driver, 'foo') 28 | search_box = sr_parser.wait_for_and_get_search_box(driver) 29 | 30 | 31 | def study_google_preferences(): 32 | from qacrawler.google_dom_info import GoogleDomInfoWithoutJS as GDom 33 | from qacrawler import main 34 | 35 | google_preferences_page_url = 'http://www.google.com/preferences?hl=en' 36 | driver.get(google_preferences_page_url) 37 | 38 | main.set_number_of_results_per_page(driver, 20) 39 | 40 | 41 | def study_entry(): 42 | from crawler import jeopardy 43 | tiny_dataset = jeopardy.Dataset('tests/data/tiny_dataset.json') 44 | for no in range(len(tiny_dataset.data)): 45 | entry = tiny_dataset.get_entry(no) 46 | print entry.id, entry.tag, entry.question 47 | 48 | 49 | def study_next_page_links(): 50 | from crawler import parser 51 | from crawler import driver_wrapper 52 | 53 | driver = driver_wrapper.get_chrome_driver('/usr/local/bin/chromedriver') 54 | parser.visit_google_front_page(driver) 55 | search_box = parser.wait_for_and_get_search_box(driver) 56 | parser.submit_query('abidin', search_box) 57 | 58 | for n in range(2): 59 | parser.wait_for_search_results(driver) 60 | npu = parser.get_next_page_url(driver) 61 | print npu 62 | driver.get(npu) 63 | 64 | driver.quit() 65 | 66 | 67 | def study_parsing_result_divs(): 68 | import os 69 | from selenium.webdriver.common.by import By 70 | from crawler import parser 71 | from crawler import driver_wrapper 72 | from crawler import google_dom_info as gdom 73 | driver = driver_wrapper.get_chrome_driver('/usr/local/bin/chromedriver') 74 | 75 | html_file = 'tests/data/cheese%20-%20Google%20Search.html' 76 | local_file_url = 'file://' + os.path.abspath(html_file) 77 | driver.get(local_file_url) 78 | 79 | results = parser.get_search_result_divs(driver) 80 | 81 | for element in results: 82 | h3 = element.find_element(By.CLASS_NAME, gdom.RESULT_TITLE_CLASS) 83 | anchor = h3.find_element(By.TAG_NAME, 'a') # tag: a 84 | url = anchor.get_attribute('href') 85 | print url 86 | 87 | driver.quit() 88 | 89 | 90 | def study_logging(): 91 | """ 92 | Log message formatting items 93 | https://docs.python.org/2/library/logging.html#logrecord-attributes 94 | :return: 95 | """ 96 | import logging 97 | reload(logging) 98 | log_attributes = ['levelname', 'asctime', 'module', 'funcName', 'message'] 99 | log_format = ':'.join(['%%(%s)s' % attr for attr in log_attributes]) 100 | logging.basicConfig(filename='example.log', 101 | format=log_format, 102 | level=logging.DEBUG) 103 | logging.info('started.') 104 | logging.debug('analyzing this...') 105 | 106 | 107 | def study_command_line(): 108 | pass 109 | # Study logging level 110 | # python -c 'import main; print main.parse_command_line_arguments()' --log-level=DEBUG -j='aa' 111 | 112 | 113 | def misc(): 114 | from crawler import jeopardy 115 | from crawler import main 116 | dataset = jeopardy.Dataset('tests/data/tiny_dataset.json') 117 | gen = main.get_entries_to_search(dataset, 0, len(dataset.data)) 118 | -------------------------------------------------------------------------------- /tests/test_crawler.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | sys.path.insert(0, os.path.abspath('..')) 4 | 5 | from qacrawler import driver_wrapper 6 | from qacrawler import sr_parser as parser2 7 | from qacrawler import crawler 8 | from qacrawler import main 9 | 10 | 11 | def test_result_div_parsing(): 12 | driver = driver_wrapper.get_chrome_driver() 13 | html_file = 'data/one_result.html' 14 | local_file_url = 'file://' + os.path.abspath(html_file) 15 | driver.get(local_file_url) 16 | 17 | result_divs = parser2.get_search_result_divs(driver) 18 | 19 | result_div = result_divs[0] 20 | print 'result_div', result_div 21 | sr = parser2.SearchResult(result_div) 22 | assert sr.title == 'Cheese - Wikipedia, the free encyclopedia' 23 | assert sr.url == 'https://en.wikipedia.org/wiki/Cheese' 24 | assert sr.snippet == 'Cheese is a food derived from milk that is produced in a wide range of flavors, textures, and forms by coagulation of the milk protein casein. It comprises proteins ...' 25 | assert sr.related_links == ['Etymology', 'History', 'Production', 'Processing'] 26 | assert str(sr) == 'Cheese - Wikipedia, the free encyclopedia\thttps://en.wikipedia.org/wiki/Cheese\tCheese is a food derived from milk that is produced in a wide range of flavors, textures, and forms by coagulation of the milk protein casein. It comprises proteins ...\tEtymology;History;Production;Processing' 27 | 28 | driver.quit() 29 | 30 | 31 | def test_page_parsing(html_file='data/cheese%20-%20Google%20Search.html', parsed_file='data/parsed.tsv'): 32 | driver = driver_wrapper.get_chrome_driver() 33 | local_file_url = 'file://' + os.path.abspath(html_file) 34 | driver.get(local_file_url) 35 | 36 | parsed_results = parser2.parse_opened_results_page(driver) 37 | formatted = crawler.results_list_to_tsv(parsed_results) 38 | 39 | with open(parsed_file, 'rt') as f: 40 | from_file = f.read() 41 | 42 | assert formatted == from_file 43 | 44 | driver.quit() 45 | -------------------------------------------------------------------------------- /tests/test_jeopardy.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | sys.path.insert(0, os.path.abspath('..')) 4 | 5 | from crawler import jeopardy 6 | 7 | DATASET_PATH = os.path.join(os.path.dirname(__file__), 'data', 'tiny_dataset.json') 8 | DATASET = jeopardy.Dataset(DATASET_PATH) 9 | 10 | 11 | def test_loading(): 12 | dataset_size = 7 13 | assert len(DATASET.data) == dataset_size 14 | first_entry_dict = DATASET.data[0] 15 | assert first_entry_dict['answer'] == 'Copernicus' 16 | 17 | 18 | def test_getting_entry(): 19 | entry_no = 0 20 | entry = DATASET.get_entry(entry_no) 21 | assert isinstance(entry, jeopardy.Entry) 22 | assert entry.id == entry_no 23 | assert entry.answer == 'Copernicus' 24 | --------------------------------------------------------------------------------