├── LICENSE ├── README.md └── openie_spider.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Ziyang Liao 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Stanford-OpenIE-Spider 2 | Extract Information from WebCorpus using Stanford Open Information Extraction. 3 | 4 | ## About Stanford IE 5 | 6 | Open information extraction (open IE) refers to the extraction of structured relation triples from plain text, such that the schema for these relations does not need to be specified in advance. For example, Barack Obama was born in Hawaii would create a triple (Barack Obama; was born in; Hawaii), corresponding to the open domain relation "was born in". This software is a Java implementation of an open IE system as described in the paper: 7 | 8 | Gabor Angeli, Melvin Johnson Premkumar, and Christopher D. Manning. Leveraging Linguistic Structure For Open Domain Information Extraction. In Proceedings of the Association of Computational Linguistics (ACL), 2015. The system first splits each sentence into a set of entailed clauses. Each clause is then maximally shortened, producing a set of entailed shorter sentence fragments. These fragments are then segmented into OpenIE triples, and output by the system. 9 | 10 | More information can be found here : http://nlp.stanford.edu/software/openie.html 11 | 12 | ## About Open Information Extraction(http://openie.allenai.org/) 13 | 14 | This is a web service that implement information extraction feature from web corpus using Stanford IE. We can search the relation in the web from this website. 15 | 16 | ## Usage 17 | 18 | First of all, please make sure python is installed. 19 | 20 | ``` 21 | python --version 22 | ``` 23 | 24 | Install scrapy. 25 | 26 | ``` 27 | sudo pip install scrapy 28 | ``` 29 | You can also see the offical document from here. https://doc.scrapy.org/en/latest/intro/install.html 30 | 31 | Install beautifulsoup4. 32 | ``` 33 | sudo pip install beautifulsoup4 34 | ``` 35 | 36 | Copy code into local. 37 | 38 | ``` 39 | git clone git@github.com:liaoziyang/Stanford-OpenIE-Spider.git 40 | cd Stanford-OpenIE-Spider 41 | ``` 42 | 43 | ### Argument 44 | 45 | Require at least one parameter below, using `-a` option. 46 | 47 | - `arg1`: Noun in the left side of the relationship. Default null. 48 | - `rel`: The relationship. Default null. 49 | - `arg2`: Noun in the right side of the relationship. Default null. 50 | 51 | And you can write the result into file by using `-o` option. 52 | 53 | ### Example 54 | 55 | Extract the information "What kills bacteria?" from the web corpus. 56 | 57 | ``` 58 | scrapy runspider -a rel=kills -a args2=bacteria openie_spider.py -o result.json 59 | ``` 60 | And the result in `result.json` is like: 61 | `key` is the string representation of the result and the `value` is the frequency of the result appeared in the web corpus. 62 | 63 | ```json 64 | [ 65 | {"Antibiotic": 165}, 66 | {"Chlorine": 76}, 67 | {"Water": 59}, 68 | {"Benzoyl peroxide": 45}, 69 | {"Heat": 40}, 70 | {"Antiseptic": 38}, 71 | {"Pasteurization": 35}, 72 | {"Cooking": 34}, 73 | {"Vinegar": 34}, 74 | {"Honey": 28}, 75 | {"Tea tree oil": 24}, 76 | {"Ultraviolet": 24}, 77 | {"Alcohol": 24}, 78 | {"The process": 21}, 79 | {"This drug": 21}, 80 | {"the oil": 19}, 81 | {"the skin": 17}, 82 | {"Food": 16}, 83 | {"chemicals": 15}, 84 | {"Cell (biology)": 14}, 85 | {"Isoniazid": 2}, 86 | {"sebum production": 2}, 87 | {"the UV light feature": 2}, 88 | {"Lidocaine": 2}, 89 | {"Patent-pending germicidal chamber": 2}, 90 | {"Titanium dioxide": 2}, 91 | {"shedding": 2}, 92 | {"Gut flora": 2}, 93 | {"six antiseptic agents": 2}, 94 | {"The warm salt water": 2}, 95 | {"Chlorhexidine": 2}, 96 | {"bitter component": 2}, 97 | {"a small anti-microbial bomb": 2}, 98 | {"an herb": 2}, 99 | {"Cotton": 2}, 100 | {"a day": 2}, 101 | {"the microwave": 2}, 102 | {"cancer cells": 2}, 103 | {"Food irradiation": 2}, 104 | {"Virus": 2}, 105 | {"Studies": 2}, 106 | {"Polymer": 2}, 107 | {"Active ingredient": 2}, 108 | {"Electron": 2}, 109 | {"Freezing": 2}, 110 | {"Fahrenheit": 2}, 111 | {"These peptides": 2}, 112 | {"Clarithromycin": 2}, 113 | {"redness": 2}, 114 | {"Saltwater": 2}, 115 | {"BP": 2}, 116 | {"White blood cell": 2}, 117 | {"a salve": 2}, 118 | {"the soap": 2}, 119 | {"Spice": 2}, 120 | {"Egg (food)": 2}, 121 | {"most toiletries": 2}, 122 | {"both": 2}, 123 | {"Neem": 2}, 124 | {"a role": 2}, 125 | {"Chamomile": 2}, 126 | {"Clindamycin": 2}, 127 | {"the formation": 2}, 128 | {"humans": 2}, 129 | {"three minutes": 2}, 130 | {"Sparfloxacin": 2}, 131 | {"one minute": 2}, 132 | {"Burning": 2}, 133 | {"High heat": 2}, 134 | {"A sanitizer test kit": 2}, 135 | {"Air purifier": 2}, 136 | {"the teeth and gums": 2}, 137 | {"Intestine": 2}, 138 | {"Concentration": 2}, 139 | {"Juice": 2}, 140 | {"Your goal": 2}, 141 | {"Sodium": 2}, 142 | {"Zinc": 2}, 143 | {"The face": 3}, 144 | {"Swimming pool": 3}, 145 | {"Vancomycin": 3}, 146 | {"Alcoholic beverage": 3}, 147 | {"Urine": 3}, 148 | {"Nitric oxide": 3}, 149 | {"Fever": 3}, 150 | {"This combination": 3}, 151 | {"The wine": 3}, 152 | {"the heart": 3}, 153 | {"Electricity": 3}, 154 | {"lemon": 3}, 155 | {"soothes": 3}, 156 | {"Phagocyte": 3}, 157 | {"devices": 3}, 158 | {"The Aprilaire 5000 Air Cleaner": 3}, 159 | {"Tanning bed": 2}, 160 | {"dirt and dust": 2}, 161 | {"These rinses": 2}, 162 | {"This medication": 2}, 163 | {"Bacteriophage": 3}, 164 | {"Toothpaste": 3}, 165 | {"chemicals and particles": 3}, 166 | {"Cell wall": 3}, 167 | {"Amoxicillin": 3}, 168 | {"Blood pressure": 3}, 169 | {"Miphil": 3}, 170 | {"Roasting": 3}, 171 | {"plaque": 3}, 172 | {"all": 3}, 173 | {"the tight-lidded pot": 3}, 174 | {"UV germicidal light": 3}, 175 | {"The idea": 3}, 176 | {"Azelaic acid": 3}, 177 | {"energy": 3}, 178 | {"Sulfur": 3}, 179 | {"Ammonium hydroxide": 3}, 180 | {"The purpose": 3}, 181 | {"these beds": 3}, 182 | {"160 degrees": 3}, 183 | {"Tea": 4}, 184 | {"Macrophage": 4}, 185 | {"Apple cider vinegar": 4}, 186 | {"Sodium bicarbonate": 4}, 187 | {"Additives": 4}, 188 | {"Acne vulgaris": 4}, 189 | {"Stomach": 4}, 190 | {"Antibacterial soap": 4}, 191 | {"Citric acid": 4}, 192 | {"15 seconds": 4}, 193 | {"Infection": 4}, 194 | {"Iodine": 4}, 195 | {"Disinfectant": 4}, 196 | {"Bactericidal antibiotics": 4}, 197 | {"Coconut oil": 4}, 198 | {"the same time": 4}, 199 | {"formula": 4}, 200 | {"Levofloxacin": 4}, 201 | {"The Lampe Berger": 4}, 202 | {"Nanoparticle": 3}, 203 | {"Saliva": 6}, 204 | {"Ion": 6}, 205 | {"the acidity": 6}, 206 | {"Odor": 6}, 207 | {"your body": 5}, 208 | {"Acai": 5}, 209 | {"Hair": 5}, 210 | {"Protein": 5}, 211 | {"Meat": 5}, 212 | {"Cabbage": 5}, 213 | {"Clothes dryer": 5}, 214 | {"Sebaceous gland": 5}, 215 | {"Lemon juice": 5}, 216 | {"Laser": 5}, 217 | {"Water filter": 5}, 218 | {"Wound": 5}, 219 | {"action": 5}, 220 | {"Antimicrobial": 4}, 221 | {"Antibody": 4}, 222 | {"Aloe": 4}, 223 | {"Immune system": 9}, 224 | {"Inflammation": 9}, 225 | {"Ingredient": 8}, 226 | {"Bactericide": 8}, 227 | {"Toxin": 8}, 228 | {"These medicines": 8}, 229 | {"Tap water": 8}, 230 | {"Neutrophil granulocyte": 7}, 231 | {"Salicylic acid": 7}, 232 | {"Colloidal silver": 7}, 233 | {"the treatment": 6}, 234 | {"Sunlight": 6}, 235 | {"Copper": 6}, 236 | {"Two-week triple therapy": 6}, 237 | {"Essential oil": 6}, 238 | {"Chemotherapy": 6}, 239 | {"the solution": 6}, 240 | {"Penicillin": 6}, 241 | {"Radical (chemistry)": 6}, 242 | {"Hydrogen peroxide": 6}, 243 | {"the sun": 14}, 244 | {"Oxygen": 14}, 245 | {"the light": 14}, 246 | {"the product": 14}, 247 | {"Irradiation": 12}, 248 | {"pores": 12}, 249 | {"Salt": 12}, 250 | {"Bleach (manga)": 11}, 251 | {"Temperature": 11}, 252 | {"Silver": 11}, 253 | {"NOT": 11}, 254 | {"the ability": 11}, 255 | {"Ozone": 11}, 256 | {"Milk": 10}, 257 | {"Bleach": 10}, 258 | {"Garlic": 9}, 259 | {"the growth": 9}, 260 | {"numerous active agents": 9}, 261 | {"Boiling": 9}, 262 | {"the steam": 9} 263 | ] 264 | ``` 265 | 266 | -------------------------------------------------------------------------------- /openie_spider.py: -------------------------------------------------------------------------------- 1 | import scrapy 2 | import sys 3 | import re 4 | import json 5 | import os 6 | from scrapy import Spider 7 | from scrapy.http import Request 8 | from scrapy.selector import Selector 9 | from BeautifulSoup import BeautifulSoup 10 | 11 | class OpenieSpider(scrapy.Spider): 12 | name = 'openie_spider' 13 | allowed_domains = [] 14 | start_urls = [] 15 | global results 16 | results = [] 17 | custom_settings = { 18 | "DOWNLOAD_DELAY": 0.5, 19 | } 20 | 21 | def __init__(self,args1='', rel='',args2='', *args, **kwargs): 22 | self.start_urls.append('http://openie.allenai.org/search/?arg1=%s&rel=%s&arg2=%s' %(args1, rel, args2,)) 23 | 24 | def start_requests(self): 25 | global start_url 26 | for start_url in self.start_urls: 27 | yield Request(start_url, callback = self.parse) 28 | 29 | def parse(self, response): 30 | global results 31 | global start_url 32 | arrScore = [] 33 | sel = Selector(response) 34 | page = 1 35 | answer_count_string = sel.xpath("//div[@id='stats']/b[1]/text()") 36 | for s in answer_count_string.extract()[0].split(): 37 | if s.isdigit(): 38 | answer_count = int(s) 39 | page_count = answer_count/20 40 | relationships = sel.xpath("//*[contains(@href, '#L')]/span/text()") 41 | relationship_counts = sel.xpath("//*[contains(@href, '#L')]/text()") 42 | pattern = re.compile(r"\((\d+)\)") 43 | for relationship_count in relationship_counts: 44 | a = BeautifulSoup(relationship_count.extract()).text 45 | arrScore.append(pattern.findall(a)) 46 | arrScore = [item for item in arrScore if len(item)] 47 | 48 | for index, relationship in enumerate(relationships): 49 | results.append(BeautifulSoup(relationship.extract()).text) 50 | yield { 51 | BeautifulSoup(relationship.extract()).text: int(arrScore[index][0]) 52 | } 53 | 54 | if page_count > 0: 55 | while page <= page_count: 56 | next_page_url = start_url + "&page=%s" % (page) 57 | page = page + 1 58 | yield scrapy.Request(response.urljoin(next_page_url)) 59 | --------------------------------------------------------------------------------