├── IDEAS_AND_RESULTS.txt ├── LICENSE ├── README ├── data ├── climate-single-word.seed ├── climate.seed ├── hypernyms.climate2_2015_7.txt.2gram.model ├── models │ └── climate2_2015_7.txt.2gram.small.model ├── old │ ├── our-orig-cc.db │ └── our.db ├── our.db └── to_check │ └── old │ └── climate_2015_7_single_filtered.binstep_3.NEW └── src ├── analyse_db.py ├── config.py ├── db.py ├── fs_helpers.py ├── old ├── CHECK_ME ├── NEW ├── create_and_load_model.py ├── create_tables.sql └── minic_judge_terms.py └── run_ontology_learning.py /IDEAS_AND_RESULTS.txt: -------------------------------------------------------------------------------- 1 | 2 | Copy our ontology learning with word2vec 3 | Step 1: copy term extraction process 4 | Step 2: do taxonomy extraction 5 | 6 | 7 | Implementation: 8 | a) basically copy what the our ontology learning does. 9 | - starting from a seed concept, 3 iterations where 25 term candidates are generated 10 | - manual validation of term candidates, evaluations are saved and re-used (via sqlite3 database table) 11 | 12 | TODO: - do relation type detection algorithm with word2vec -- in a similar fashion as word2vec is-A relation extraction 13 | - compare to existing approaches (eg the one done in Wohlgenannt's PhD thesis / http://eprints.weblyzard.com/18/1/weichselbraun2010.pdf ) 14 | 15 | Results: 16 | - taxonomy extraction: typical tradeoff between precision and recall -- if accepting a lower threshold 17 | and more suggestions (in the list) 18 | - taxonomy extraction: direction of relation in isA often not ok. 19 | Thoughts: not good enough by itself, but combined with other existing taxonomy methods? 20 | Thoughts: Well suited to extract unlabeled relations 21 | 22 | - term extraction: if no stemming / lemmatization -- often singlar and plural of same term suggested. 23 | general problem (but expected): generated terms are very close (often too close?) to seed terms 24 | 25 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "{}" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright {yyyy} {name of copyright owner} 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README: -------------------------------------------------------------------------------- 1 | 2 | Dear user, 3 | 4 | here you find most of the code and results from the ISWC 2016 poster submission 5 | "Using word2vec to Build a Simple Ontology Learning System -- Gerhard Wohlgenannt and Filip Minic". 6 | 7 | 8 | src: contains the source code used. 9 | run_ontology_extension.py -- to start the word2vec based ontology learning tasks 10 | db.py -- contains most of the functions 11 | 12 | 13 | data: 14 | The sqlite3 database named "our.db" contains the results from the concept candidate generation. 15 | The columns are as follows: 16 | term .. the term suggested by word2vec 17 | corpus .. which corpus/word2vec model was used? 18 | step .. which of the three iteration of ontology extension 19 | relevant .. was the term judged as relevant by the domain experts 20 | 21 | 22 | Furthermore the directory contains the two seed ontologies for the unigram and bigram case. 23 | Other results, eg. the hypernym pairs, are also found here. 24 | 25 | We include a word2vec model, which you can use to start your experiments, but for more in depth usage you 26 | will want to use your own models. The provide model is a small one, in order to stay in the 100MB size limit of github.com 27 | The dummy model is found at data/models/climate2_2015_7.txt.2gram.small.model 28 | 29 | 30 | ---------------------------------------- 31 | 32 | USAGE: 33 | 34 | step 1) configuration in config.py 35 | a) copy your word2vec model into data/models 36 | b) set the name of your word2vec model in 'config["model"]' 37 | c) set up seed terms (for example 2) for your ontology <-> this depends on whether using a bigram or unigram model, see next step 38 | d) If you use a bigram word2vec model --> use the according configs in config.py for 39 | conf['seedfn'] 40 | conf['num_taxomy_best'] 41 | conf['sim_threshold'] 42 | 43 | 44 | step 2) 45 | a) python run_ontology_extension.py 46 | this collects new terms for your ontology 47 | manually validate the terms after every ontology extension step (aka call "python run_ontology_extension") 48 | 49 | --> go into "data/to_check" and edit the "...CHECK_ME" file -- remove all terms from the file that are not relevant to the domain 50 | --> when done, rename the file to "...NEW" (that is replace CHECK_ME with NEW) 51 | b) got back to step a) -- after 3 iterations (configurable) the process finished 52 | 53 | --> after the 3rd iteration the system automatically tries to detected hypernymy relations between the terms using word2vec (analogy feature) 54 | this is very experimental, precision is not very high 55 | 56 | step 3) you can run "python analyse_db.py" to get acurracy numbers for the term detection service (according to manual evaluations) 57 | 58 | 59 | -------------------------------------------------------------------------------- /data/climate-single-word.seed: -------------------------------------------------------------------------------- 1 | climate 2 | change 3 | global 4 | warming 5 | -------------------------------------------------------------------------------- /data/climate.seed: -------------------------------------------------------------------------------- 1 | climate_change 2 | global_warming 3 | -------------------------------------------------------------------------------- /data/hypernyms.climate2_2015_7.txt.2gram.model: -------------------------------------------------------------------------------- 1 | global_warming isA climate_change 0.530227482319 2 | climate isA climate_change 0.609103441238 3 | climate_change isA global_warming 0.599485456944 4 | global_warming isA climate 0.469211667776 5 | climate_change isA climate 0.648139595985 6 | global_warming isA warming 0.472602009773 7 | emissions isA greenhouse_gas 0.419310569763 8 | greenhouse_gases isA greenhouse_gas 0.518722057343 9 | methane_emissions isA greenhouse_gas 0.598697304726 10 | carbon_emissions isA emissions 0.472410559654 11 | co_emissions isA emissions 0.682210624218 12 | methane_emissions isA emissions 0.536791503429 13 | emissions isA carbon_emissions 0.422982931137 14 | carbon_pollution isA carbon_emissions 0.483261227608 15 | methane_emissions isA carbon_emissions 0.458557844162 16 | fossil_fuel isA fossil_fuels 0.432914733887 17 | carbon_dioxide isA greenhouse_gases 0.44994315505 18 | reduce_greenhouse isA gas_emissions 0.427438735962 19 | unfccc isA framework_convention 0.552678823471 20 | greenhouse_gases isA heat_trapping 0.478779554367 21 | carbon_emissions isA co_emissions 0.414626151323 22 | ghg_emissions isA co_emissions 0.49120503664 23 | methane_emissions isA co_emissions 0.460364580154 24 | greenhouse_gases isA carbon_dioxide 0.485840529203 25 | carbon_dioxide isA carbon 0.448072195053 26 | carbon_pollution isA emissions_reductions 0.489338219166 27 | co_emissions isA ghg_emissions 0.539622783661 28 | methane_emissions isA ghg_emissions 0.520867884159 29 | greenhouse_gases isA ghgs 0.480785787106 30 | ghgs isA reducing_carbon 0.408048093319 31 | carbon_pollution isA reduction_targets 0.417098611593 32 | framework_convention isA kyoto_protocol 0.407713919878 33 | atmospheric_carbon isA dioxide 0.472179561853 34 | carbon_pollution isA reduce_emissions 0.458016932011 35 | methane_emissions isA reduce_emissions 0.400881558657 36 | emissions_reductions isA emission_reduction 0.496207267046 37 | framework_convention isA unfccc 0.479127287865 38 | ghg_emissions isA ghg 0.448527961969 39 | dioxide isA regulate_carbon 0.446149080992 40 | reduce_emissions isA reducing_emissions 0.471959024668 41 | gas_emissions isA reduce_greenhouse 0.574008405209 42 | methane_emissions isA emitters 0.428135544062 43 | emissions_reductions isA emissions_reduction 0.454973936081 44 | carbon isA gtco 0.413796901703 45 | greenhouse_gases isA emission 0.431531518698 46 | hfcs isA hydrofluorocarbons_hfcs 0.402177661657 47 | potent_greenhouse isA hydrofluorocarbons_hfcs 0.499326348305 48 | cutting_greenhouse isA reducing_greenhouse 0.430492818356 49 | gas_emissions isA reducing_greenhouse 0.536002457142 50 | carbon_pollution isA carbon_emission 0.405747771263 51 | ghgs isA greenhouse_gasses 0.459626466036 52 | methane_pollution isA methane_emissions 0.442805022001 53 | methane_emissions isA methane_pollution 0.413904160261 54 | gas_emissions isA cutting_greenhouse 0.480299115181 55 | dioxide isA atmospheric_carbon 0.484083265066 56 | fossil_fuels isA fossil_fuel 0.409488976002 57 | -------------------------------------------------------------------------------- /data/models/climate2_2015_7.txt.2gram.small.model: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gwohlgen/w2v_ol/bb67a3a1b54ca756cff1f28ef5ca5842e8a9c91d/data/models/climate2_2015_7.txt.2gram.small.model -------------------------------------------------------------------------------- /data/old/our-orig-cc.db: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gwohlgen/w2v_ol/bb67a3a1b54ca756cff1f28ef5ca5842e8a9c91d/data/old/our-orig-cc.db -------------------------------------------------------------------------------- /data/old/our.db: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gwohlgen/w2v_ol/bb67a3a1b54ca756cff1f28ef5ca5842e8a9c91d/data/old/our.db -------------------------------------------------------------------------------- /data/our.db: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gwohlgen/w2v_ol/bb67a3a1b54ca756cff1f28ef5ca5842e8a9c91d/data/our.db -------------------------------------------------------------------------------- /data/to_check/old/climate_2015_7_single_filtered.binstep_3.NEW: -------------------------------------------------------------------------------- 1 | species 2 | exploitation 3 | reefs 4 | resilience 5 | coastal 6 | forest 7 | regions 8 | ecologists 9 | -------------------------------------------------------------------------------- /src/analyse_db.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | """ 4 | make some analysis on the contents of the DB 5 | """ 6 | import db 7 | from config import * 8 | 9 | # details? 10 | PRINT_DETAILS=True 11 | 12 | # simple: connect to the sqlite DB 13 | get_db() 14 | conn, model = conf['db'], conf['model'] 15 | 16 | 17 | 18 | # get models used 19 | 20 | dbc = conn.cursor() 21 | 22 | # 1. which is the last step 23 | dbc.execute("select distinct model from data") 24 | models = [c[0] for c in dbc.fetchall()] 25 | 26 | for model in models: 27 | # see if max_stage == 3, else continue 28 | dbc.execute("SELECT MAX(step) FROM data WHERE model='%s'" % (model,)) 29 | if (dbc.fetchone()[0] != 3): continue 30 | 31 | # total number of concepts: 32 | dbc.execute("SELECT COUNT(*) FROM data WHERE model='%s' AND step IN (1,2,3)" % (model,)) 33 | tot_num = dbc.fetchone()[0] 34 | if tot_num != 75: raise Exception 35 | 36 | # num releveant 37 | dbc.execute("SELECT COUNT(*) FROM data WHERE model='%s' AND step IN (1,2,3) AND relevant=1" % (model,)) 38 | rel_num = dbc.fetchone()[0] 39 | 40 | print "\n\nNext model:", model, " -- Total number of concepts:", tot_num, "Num: relevant", rel_num , "Perc. relevant", rel_num / float(tot_num) 41 | 42 | if PRINT_DETAILS: 43 | for step in [1,2,3]: 44 | dbc.execute("SELECT term FROM data WHERE model='%s' AND step IN (%d) AND relevant=1" % (model, step)) 45 | rel = [t[0] for t in dbc.fetchall()] 46 | print "\n\tStep: %d -- Number of relevant concepts: %d -- Percent: %f" % (step, len(rel), len(rel)/25.0) 47 | for t in rel: 48 | print "\t\t+ ", t 49 | 50 | dbc.execute("SELECT term FROM data WHERE model='%s' AND step IN (%d) AND relevant=0" % (model, step)) 51 | nrel = [t[0] for t in dbc.fetchall()] 52 | for t in nrel: 53 | print "\t\t- ", t 54 | 55 | 56 | 57 | 58 | -------------------------------------------------------------------------------- /src/config.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | # you need to config this! 4 | # set the model file, and if the model supports big-grams: set seed with bigrams.. 5 | 6 | 7 | ## the conf dict stores all relevant config parameters 8 | 9 | conf={} 10 | conf['model'] = "climate2_2015_7.txt.2gram.small.model" # default dummy model 11 | #conf['model'] = "climate2_2015_7.txt.2gram.model" 12 | 13 | 14 | 15 | # if using a bigram model 16 | conf['seedfn'] = "../data/climate.seed" # bigram seed for climate change models 17 | 18 | # config for hypernym extraction 19 | conf['num_taxomy_best'] = 1 # number of most similar terms to consider when building a taxonomy 20 | conf['sim_threshold'] = 0.40 21 | 22 | 23 | # if using a unigram model 24 | #conf['seedfn'] = "../data/climate-single-word.seed" 25 | 26 | # config for hypernym extraction 27 | #conf['num_taxomy_best'] = 3 # number of most similar terms to consider when building a taxonomy 28 | #conf['sim_threshold'] = 0.23 29 | 30 | 31 | conf['binary_model'] = True # default: using a binary word2vec model (like created by Mikolov's C implementation) 32 | conf['domain'] = "climate change" # your domain of knowledge -- not important for the algorithms .. 33 | 34 | ######################################################################################################################## 35 | 36 | # no need to change below this 37 | DB_PATH= "../data/our.db" 38 | #DB_PATH= "/home/wohlg/workspace/dl4j-0.4-examples/src/main/java/MinicBac/python/data/our.db" 39 | 40 | print "db-path", DB_PATH 41 | 42 | import sqlite3 43 | def get_db(): 44 | """ just connect to the sqlite3 database """ 45 | conf['db'] = sqlite3.connect(DB_PATH) 46 | 47 | 48 | # model file name 49 | conf['MFN'] = "../data/models/" + conf['model'] 50 | 51 | # setup logging 52 | import logging, os 53 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 54 | -------------------------------------------------------------------------------- /src/db.py: -------------------------------------------------------------------------------- 1 | import os,sys 2 | from config import * 3 | import fs_helpers 4 | 5 | def check_tables_exist(conf): 6 | """ those two tables are used 7 | -- create a sqlite3 DB for them ..""" 8 | 9 | dbc = conf['db'].cursor() 10 | 11 | data_table = """CREATE TABLE IF NOT EXISTS data ( 12 | id INTEGER PRIMARY KEY, 13 | domain VARCHAR, 14 | model VARCHAR, 15 | term VARCHAR, 16 | step INT, 17 | relevant BOOLEAN); """ 18 | 19 | checked_table = """ 20 | CREATE TABLE IF NOT EXISTS checked_terms ( 21 | id INTEGER PRIMARY KEY AUTOINCREMENT, 22 | domain VARCHAR, 23 | term VARCHAR, 24 | relevant BOOLEAN 25 | ); """ 26 | 27 | dbc.execute(data_table) 28 | dbc.execute(checked_table) 29 | 30 | 31 | def get_status(conf): 32 | """ 33 | Find out in what state we are, ie. what steps have been done, etc 34 | - are there unjudged terms? update with judgements from the file (if existings) 35 | - if all terms are judged, we can proceed to the next step, so we set the new seed terms 36 | """ 37 | conn, domain, model = conf['db'], conf['domain'], conf['model'] 38 | dbc = conn.cursor() 39 | 40 | # 1. which is the last step 41 | dbc.execute("SELECT MAX(step) FROM data WHERE domain='%s' AND model='%s'" % (domain, model)) 42 | max_step = dbc.fetchone()[0] 43 | conf['step'] = max_step 44 | print "current max_step", max_step 45 | 46 | 47 | # see if there are unjudged terms in DB? 48 | dbc.execute("SELECT COUNT(*) FROM data WHERE domain='%s' AND model='%s' AND relevant IS NULL" % (domain, model)) 49 | c = dbc.fetchone()[0] 50 | 51 | if c>0: 52 | file_terms = fs_helpers.load_judgements_from_fs(conf) 53 | 54 | # step 1 # construct lists of relevant and not relevant-terms and update data table 55 | pos_terms, neg_terms, num_missing = update_data_table(conf, file_terms) 56 | 57 | # step 2 # insert into checked_terms table 58 | save_checked_terms(conf, pos_terms, neg_terms) 59 | 60 | # are there still unjudged terms? # TODO ? check in DB? 61 | fn = "../data/to_check/" + conf['model'] + "step_" + str(conf['step']) + ".CHECK_ME" 62 | if (num_missing>0): 63 | print "\nTHERE ARE TERMS IN THE TABLE WITHOUT JUDGEMENT -- set relevance\n" 64 | print "See file:", fn, "\n" 65 | sys.exit() 66 | 67 | # everything done for this step 68 | if max_step == 3: 69 | print "\n\nstep 3 and everything judged -- we are finished" 70 | print "\n\nlet's try to create a taxonomy:" 71 | generate_taxonomy(conf) 72 | sys.exit() 73 | 74 | # get current terms 75 | dbc.execute("SELECT term FROM data WHERE domain='%s' AND model='%s' AND relevant=1" % (domain, model)) 76 | rows = dbc.fetchall() 77 | current_terms = [row[0] for row in rows] 78 | print "current_terms", current_terms 79 | 80 | ### set current seed terms -- for the next iteration! 81 | conf['seeds'] = current_terms 82 | 83 | def save_candidates(conf, candidates): 84 | """ 85 | save the new candidates to the DB and to the filesystem (if unjudged candidates exist) 86 | """ 87 | 88 | conn, domain, model, step = conf['db'], conf['domain'], conf['model'], conf['step'] 89 | dbc = conn.cursor() 90 | 91 | inserted_terms = [] 92 | for (term, sim) in candidates: 93 | 94 | # insert only 25 candidates max! 95 | if len(inserted_terms) >= 25: break 96 | 97 | # filters 98 | if term.find("'") >= 0: 99 | print "Filtering term", term 100 | continue 101 | 102 | # check if already exists 103 | dbc.execute("SELECT count(*) FROM data WHERE domain=? AND model=? AND term=?", (domain, model, term)) 104 | if dbc.fetchone()[0]>0: 105 | print "term %s already exists!!! won't insert again" % (term) 106 | 107 | else: 108 | # insert a new term in to the DB 109 | dbc.execute("INSERT INTO data (domain, model, term, step) VALUES (?,?,?,?)", (domain, model, term, step+1)) 110 | inserted_terms.append( (term,sim) ) 111 | 112 | conn.commit() 113 | 114 | # reuse existing judgements 115 | unjudged_terms = set_judgements_from_existing(inserted_terms, conf) 116 | 117 | if unjudged_terms: 118 | # write into a checkme file 119 | fn = "../data/to_check/"+ conf['model'] + "step_" + str(conf['step']+1) + ".CHECK_ME" 120 | fh = open(fn, 'wb') 121 | for (term,sim) in unjudged_terms: 122 | fh.write(term.encode('utf-8') + "\n") 123 | 124 | 125 | def set_judgements_from_existing(inserted_terms, conf): 126 | conn, domain, model = conf['db'], conf['domain'], conf['model'] 127 | dbc = conn.cursor() 128 | 129 | unjudged_terms = [] 130 | 131 | for (term, sim) in inserted_terms: 132 | dbc.execute("SELECT relevant FROM checked_terms WHERE domain=? AND term=?", (domain, term)) 133 | 134 | # if found 135 | res = dbc.fetchone() 136 | if res: 137 | dbc.execute("UPDATE data SET relevant=? WHERE domain=? AND model=? AND term=?", (res[0], domain, model, term)) 138 | print "updating in set_judgements_from_existing -- term:", term, res[0] 139 | else: 140 | unjudged_terms.append( (term,sim) ) 141 | 142 | conn.commit() 143 | 144 | return unjudged_terms 145 | 146 | 147 | 148 | def check_initial_seeds(conf): 149 | """ 150 | check that the initial seed concepts exist in the DB 151 | else: add them 152 | """ 153 | conn, domain, model = conf['db'], conf['domain'], conf['model'] 154 | dbc = conn.cursor() 155 | 156 | dbc.execute("SELECT count(*) FROM data WHERE domain='%s' AND model='%s' AND step=0" % (domain, model)) 157 | if dbc.fetchone()[0]==0: 158 | print "No seeds found for model: %s" % (model,) 159 | 160 | # read seeds from file 161 | seeds = [l.strip() for l in open(conf['seedfn']).readlines()] 162 | print "Using seeds from file:", seeds 163 | 164 | for term in seeds: 165 | dbc.execute("INSERT INTO data (domain, model , term, step, relevant) VALUES ('%s', '%s', '%s', 0, 1)" % (domain, model, term)) 166 | conn.commit() 167 | 168 | print "Initial seeds entered in DB." 169 | else: 170 | print "Inital seeds exist." 171 | 172 | def update_data_table(conf, pos_terms): 173 | """ 174 | update the database with judgements given in the file 175 | the file only contains the positive judgements, so the rest must be the negative ("non-relevant") ones 176 | @pos_terms: the list of terms with judged as relevant (for the given "step") 177 | @returns: missing (unjudged) terms, if there are any 178 | """ 179 | conn, domain, model, step = conf['db'], conf['domain'], conf['model'], conf['step'] 180 | dbc = conn.cursor() 181 | 182 | dbc.execute("SELECT count(*) FROM data WHERE domain='%s' AND model='%s' AND step=%d AND relevant IS NULL" % (domain, model, step)) 183 | num_unj = dbc.fetchone()[0] 184 | print "Number of unjudged terms before update_data_table:", num_unj 185 | 186 | for term in pos_terms: 187 | dbc.execute("UPDATE data SET relevant=1 WHERE domain=? AND model=? AND term=? AND step=?", (domain, model, term, step)) 188 | conn.commit() 189 | 190 | # get terms which are obviously not relevant (which were not in the pos_terms) 191 | dbc.execute("SELECT term FROM data WHERE domain='%s' AND model='%s' AND step=%d AND relevant IS NULL" % (domain, model, step)) 192 | res = dbc.fetchall() 193 | 194 | neg_terms = [t[0] for t in res] 195 | # update in DB 196 | for term in neg_terms: 197 | dbc.execute("UPDATE data SET relevant=0 WHERE domain=? AND model=? AND term=? AND step=?", (domain, model, term, step)) 198 | conn.commit() 199 | 200 | #get missing terms 201 | dbc.execute("SELECT count(*) FROM data WHERE domain='%s' AND model='%s' AND relevant IS NULL" % (domain, model)) 202 | num_missing = dbc.fetchone()[0] 203 | print "Number of unjudged terms after update_data_table:", num_missing 204 | 205 | return pos_terms, neg_terms, num_missing 206 | 207 | def save_checked_terms(conf, pos_terms, neg_terms): 208 | conn, domain, model, step = conf['db'], conf['domain'], conf['model'], conf['step'] 209 | dbc = conn.cursor() 210 | 211 | for term in pos_terms: 212 | if not term_in_checked_terms(term): 213 | dbc.execute("INSERT INTO checked_terms (domain, term, relevant) VALUES (?, ?, 1)", (domain, term)) 214 | conn.commit() 215 | 216 | for term in neg_terms: 217 | if not term_in_checked_terms(term): 218 | dbc.execute("INSERT INTO checked_terms (domain, term, relevant) VALUES (?, ?, 0)", (domain, term)) 219 | conn.commit() 220 | 221 | def term_in_checked_terms(term): 222 | conn, domain = conf['db'], conf['domain'] 223 | dbc = conn.cursor() 224 | 225 | dbc.execute("SELECT count(*) FROM checked_terms WHERE domain=? AND term=?", (domain, term)) 226 | n = dbc.fetchone()[0] 227 | 228 | if n>0: return True 229 | else: return False 230 | 231 | def generate_taxonomy(conf): 232 | """ 233 | try to generate a taxonomy using simple word2vec operations 234 | """ 235 | conn, domain, model = conf['db'], conf['domain'], conf['model'] 236 | dbc = conn.cursor() 237 | 238 | # get all terms 239 | dbc.execute("SELECT term FROM data WHERE domain='%s' AND model='%s' AND relevant=1" % (domain, model)) 240 | res = dbc.fetchall() 241 | terms = [r[0] for r in res] 242 | 243 | model = fs_helpers.load_model(conf) 244 | 245 | result_pairs = [] 246 | 247 | input_hyperynm_pairs = [('tree', 'forest'), ('carbon_emissions', 'emissions'), ('methane', 'greenhouse_gas')] 248 | 249 | for term in terms: 250 | print "Getting Hypernyms for", term 251 | 252 | for input_pair in input_hyperynm_pairs: 253 | 254 | try: 255 | hypo = model.most_similar(positive=[input_pair[0], term], negative=[input_pair[1]], topn=conf['num_taxomy_best']) 256 | except KeyError: 257 | continue 258 | # go to next term 259 | 260 | #print "HYPER", hyper 261 | for (hy, sim) in hypo: 262 | if hy in terms: 263 | print "found connection", term, "hyopym of", hy, "sim", sim 264 | result_pairs.append( (hy, term, sim ) ) 265 | 266 | # write to file 267 | fn = "../data/hypernyms." + conf['model'] 268 | fh = open(fn, 'wb') 269 | 270 | final_pairs = [] 271 | # filter result pairs 272 | for pair in result_pairs: 273 | if pair[2] > conf['sim_threshold']: 274 | # check if pair already in final_pairs 275 | found = False 276 | for p in final_pairs: 277 | if p[0] == pair[0] and p[1] == pair[1]: 278 | found = True 279 | if not found: 280 | final_pairs.append(pair) 281 | else: 282 | print "found already, skipping", pair 283 | 284 | 285 | for pair in final_pairs: 286 | fh.write(pair[0] + " isA " + pair[1] + " " + str(pair[2]) + "\n") 287 | 288 | 289 | fh.close() 290 | 291 | 292 | if __name__== "__main__": 293 | 294 | # some testing 295 | inserted_terms = [('climate',0.8) , ('asfsafsadf',0.9), ('climate_scientist',0.7)] 296 | get_db() 297 | #set_judgements_from_existing(inserted_terms, conf) 298 | 299 | generate_taxonomy(conf) 300 | -------------------------------------------------------------------------------- /src/fs_helpers.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | import os 4 | from config import * 5 | import gensim.models 6 | 7 | 8 | 9 | def load_judgements_from_fs(conf): 10 | """ 11 | load the manual evalations from the file system 12 | """ 13 | fn = "../data/to_check/" + conf['model'] + "step_" + str(conf['step']) + ".NEW" 14 | 15 | try: 16 | lines = open(fn).readlines() 17 | except Exception, e: 18 | print "\n\n *** \n Please check in the data/to_check folder if there are terms to manually evaluate!!\n *** \n\n" 19 | raise e 20 | 21 | terms = [l.strip() for l in lines] 22 | return terms 23 | 24 | def load_model(conf): 25 | """ 26 | load the gensim model for usage 27 | """ 28 | 29 | # load model 30 | print "loading model", conf['MFN'] 31 | 32 | if conf['binary_model']: 33 | model = gensim.models.Word2Vec.load_word2vec_format(conf['MFN'], binary=True) 34 | else: 35 | model = gensim.models.Word2Vec.load_word2vec_format(conf['MFN'], binary=False) 36 | model.init_sims(replace=True) # clean up RAM 37 | return model 38 | -------------------------------------------------------------------------------- /src/old/CHECK_ME: -------------------------------------------------------------------------------- 1 | climatechange_a 2 | newsclimate 3 | sids 4 | emitters 5 | eurekamag 6 | peatland 7 | bumblebees' 8 | malthusian 9 | economy 10 | competitiveness 11 | macroeconomic 12 | cyclical 13 | embracing 14 | challenges 15 | pressing 16 | turmoil 17 | strengthening 18 | growth 19 | adapting 20 | prosperity 21 | realities 22 | characterized 23 | rapidly 24 | confronting 25 | weakness 26 | landscape 27 | today's 28 | transformational 29 | changing 30 | instability 31 | china's 32 | emerging 33 | economies 34 | headwinds 35 | widespread 36 | innovation 37 | innovating 38 | disparities 39 | disruptive 40 | explores 41 | consumerism 42 | inflationary 43 | divergence 44 | reforms 45 | midst 46 | impacting 47 | innovate 48 | obesity 49 | homebuilding 50 | trend 51 | untapped 52 | marketplace 53 | challenge 54 | disinflation 55 | facing 56 | threats 57 | policymakers 58 | pains 59 | threat 60 | spur 61 | consensus 62 | implications 63 | contributing 64 | megatrends 65 | secular 66 | evolve 67 | risks 68 | emission 69 | scientists 70 | gases 71 | temperatures 72 | catastrophic 73 | permafrost 74 | precipitation 75 | reduction 76 | bleaching 77 | manmade 78 | dispersal 79 | paris 80 | countries 81 | ambitious 82 | decarbonisation 83 | pact 84 | agreement 85 | agreement's 86 | accord 87 | pledges 88 | governments 89 | nations 90 | developing 91 | decarbonization 92 | commitments 93 | humanity's 94 | ambition 95 | stakeholder 96 | aimed 97 | conservation 98 | efforts 99 | allstate's 100 | biomass 101 | insecurity 102 | stance 103 | stewardship 104 | weak 105 | hazards 106 | combating 107 | monarch 108 | improving 109 | farming 110 | ragweed 111 | disasters 112 | inundation 113 | waterborne 114 | earthquakes 115 | dioxide 116 | soils 117 | wildfires 118 | effects 119 | smog 120 | planet's 121 | planet 122 | earth's 123 | toxins 124 | ghg 125 | slcps 126 | collectively 127 | reductions 128 | methane 129 | renewables 130 | energy 131 | indc 132 | mechanism 133 | fossil 134 | binding 135 | celsius 136 | decarbonise 137 | reducing 138 | limit 139 | adverse 140 | projections 141 | bold 142 | mitigate 143 | fluctuation 144 | negotiators 145 | disproportionately 146 | combat 147 | uncertainties 148 | hazard 149 | wto 150 | consequences 151 | action 152 | avert 153 | systemic 154 | unavoidable 155 | worsening 156 | urgent 157 | amplify 158 | disruption 159 | runaway 160 | slowing 161 | rise 162 | decisive 163 | tackles 164 | stevesgoddard 165 | climatereality 166 | barackobama 167 | warming's 168 | carbonbrief 169 | cagw 170 | tcot 171 | potus 172 | climatecentral 173 | agw 174 | aerosol 175 | climatehawk 176 | 'global 177 | hoax 178 | chriscmooney 179 | abpoli 180 | sciam 181 | deniers 182 | empirical 183 | alarmists 184 | skeptic 185 | sceptics 186 | pope 187 | denier 188 | flawed 189 | predictions 190 | bipartisanism 191 | conservative 192 | denying 193 | radical 194 | petefrt 195 | alec 196 | skeptics 197 | terrorism 198 | asthma 199 | daughter's 200 | pause 201 | conservatives 202 | cnn 203 | yaleclimatecomm 204 | conspiracy 205 | proved 206 | poses 207 | obamas 208 | scientific 209 | ccdeditor 210 | republicans 211 | dangers 212 | globalist 213 | documentarythe 214 | changeclimate 215 | donors 216 | elites 217 | congestion 218 | agree 219 | flooding 220 | debunked 221 | posed 222 | prostitutes 223 | inhofe 224 | christians 225 | drastic 226 | morillo 227 | dire 228 | doomsday 229 | junkscience 230 | isis 231 | deception 232 | warming' 233 | synod 234 | drought 235 | exposes 236 | hysteria 237 | gender 238 | hockeyschtick 239 | temps 240 | proves 241 | linked 242 | causing 243 | caused 244 | scam 245 | mammoths 246 | megafauna 247 | reversed 248 | deny 249 | vanish 250 | unless 251 | bumblebee 252 | abrupt 253 | guardianeco 254 | conflict 255 | decline 256 | haarp 257 | tackle 258 | cause 259 | worried 260 | tenth 261 | underway 262 | ivar 263 | examine 264 | audiobook 265 | warns 266 | causes 267 | blame 268 | solve 269 | kyoto 270 | priorities 271 | biblical 272 | fraud 273 | shocks' 274 | 'food 275 | tonyabbottmhr 276 | vulnerable 277 | politicians 278 | severe 279 | climatechangrr 280 | myth 281 | spokesman 282 | billmckibben 283 | cruz 284 | serious 285 | evidence 286 | poll 287 | berlin 288 | theory 289 | signsyourerightwing 290 | philippeheller 291 | summer's 292 | gore 293 | gop 294 | jrehling 295 | farts 296 | issue 297 | havoc 298 | crucial 299 | gore's 300 | proposal 301 | patterns 302 | nanotechnology 303 | wreaking 304 | ari 305 | climatologist 306 | o'reilly 307 | statistical 308 | frozen 309 | affect 310 | affected 311 | seas 312 | religious 313 | faith 314 | threaten 315 | entertains 316 | humanity 317 | problems 318 | particularly 319 | devastating 320 | bumblebees 321 | increasing 322 | societies 323 | concern 324 | scholars 325 | profound 326 | warned 327 | excessive 328 | rains 329 | exxon 330 | provinces 331 | recognized 332 | failure 333 | industrialised 334 | occurring 335 | rising 336 | stronger 337 | findings 338 | signs 339 | geographies 340 | pneumophila 341 | weather 342 | markets 343 | pressures 344 | diseases 345 | ripple 346 | swiftly 347 | exports 348 | pests 349 | uncontrollable 350 | sectors 351 | recovers 352 | nahal 353 | complexities 354 | accommodative 355 | abuses 356 | apocalyptic 357 | hunger 358 | conflicts 359 | scenario 360 | exist 361 | bangladesh 362 | floods 363 | victim 364 | rainfall 365 | allergies 366 | encyclical 367 | cdnpoli 368 | debunks 369 | alarmist 370 | alarmism 371 | whitehouse 372 | vatican 373 | spaceweather 374 | tarsands 375 | vd 376 | debunking 377 | orgs 378 | squeeze 379 | contributor 380 | pins 381 | dna 382 | bleak 383 | cited 384 | cow 385 | cycles 386 | walker 387 | jeb 388 | schwarzenegger 389 | blames 390 | deadly 391 | finprint 392 | extreme 393 | epidemics 394 | resilient 395 | -------------------------------------------------------------------------------- /src/old/NEW: -------------------------------------------------------------------------------- 1 | emitters 2 | economy 3 | competitiveness 4 | macroeconomic 5 | cyclical 6 | challenges 7 | turmoil 8 | strengthening 9 | growth 10 | adapting 11 | prosperity 12 | realities 13 | landscape 14 | transformational 15 | instability 16 | emerging 17 | economies 18 | innovation 19 | innovating 20 | disparities 21 | disruptive 22 | consumerism 23 | inflationary 24 | divergence 25 | reforms 26 | impacting 27 | trend 28 | marketplace 29 | challenge 30 | threats 31 | policymakers 32 | pains 33 | threat 34 | spur 35 | consensus 36 | implications 37 | contributing 38 | megatrends 39 | risks 40 | emission 41 | scientists 42 | gases 43 | temperatures 44 | catastrophic 45 | permafrost 46 | precipitation 47 | reduction 48 | bleaching 49 | manmade 50 | countries 51 | decarbonisation 52 | pact 53 | agreement 54 | pledges 55 | governments 56 | nations 57 | developing 58 | decarbonization 59 | commitments 60 | ambition 61 | stakeholder 62 | conservation 63 | efforts 64 | biomass 65 | insecurity 66 | stance 67 | hazards 68 | combating 69 | farming 70 | disasters 71 | earthquakes 72 | dioxide 73 | soils 74 | wildfires 75 | effects 76 | smog 77 | planet 78 | toxins 79 | ghg 80 | slcps 81 | reductions 82 | methane 83 | renewables 84 | energy 85 | indc 86 | mechanism 87 | fossil 88 | binding 89 | celsius 90 | decarbonise 91 | reducing 92 | limit 93 | adverse 94 | projections 95 | mitigate 96 | fluctuation 97 | negotiators 98 | combat 99 | uncertainties 100 | hazard 101 | wto 102 | consequences 103 | action 104 | disruption 105 | runaway 106 | slowing 107 | rise 108 | decisive 109 | cagw 110 | agw 111 | aerosol 112 | hoax 113 | deniers 114 | empirical 115 | alarmists 116 | skeptic 117 | sceptics 118 | pope 119 | denier 120 | flawed 121 | predictions 122 | bipartisanism 123 | conservative 124 | radical 125 | skeptics 126 | conservatives 127 | conspiracy 128 | scientific 129 | republicans 130 | dangers 131 | globalist 132 | elites 133 | congestion 134 | flooding 135 | debunked 136 | doomsday 137 | junkscience 138 | deception 139 | drought 140 | hysteria 141 | scam 142 | mammoths 143 | megafauna 144 | deny 145 | vanish 146 | conflict 147 | decline 148 | causes 149 | blame 150 | kyoto 151 | priorities 152 | fraud 153 | politicians 154 | severe 155 | spokesman 156 | evidence 157 | poll 158 | theory 159 | gore 160 | gop 161 | issue 162 | havoc 163 | proposal 164 | patterns 165 | nanotechnology 166 | wreaking 167 | ari 168 | climatologist 169 | statistical 170 | frozen 171 | seas 172 | humanity 173 | problems 174 | devastating 175 | societies 176 | scholars 177 | rains 178 | exxon 179 | provinces 180 | failure 181 | industrialised 182 | rising 183 | findings 184 | signs 185 | geographies 186 | pneumophila 187 | weather 188 | markets 189 | pressures 190 | diseases 191 | ripple 192 | exports 193 | uncontrollable 194 | sectors 195 | complexities 196 | apocalyptic 197 | hunger 198 | conflicts 199 | scenario 200 | bangladesh 201 | floods 202 | victim 203 | rainfall 204 | alarmist 205 | alarmism 206 | vatican 207 | tarsands 208 | contributor 209 | cow 210 | cycles 211 | epidemics 212 | resilient 213 | -------------------------------------------------------------------------------- /src/old/create_and_load_model.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | BASE="/home/wohlg/word2vec/trunk/" 4 | #FN= BASE + "climate2_2015_7.txt" # unfiltered unigrams 5 | FN= BASE + "climate2_2015_7.txt-norm1-phrase1" 6 | end = FN.split("/")[-1].strip() 7 | MFN="/home/wohlg/word2vec/trunk/my_onto/data/"+end 8 | 9 | 10 | import gensim.models 11 | # setup logging 12 | import logging, os 13 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 14 | 15 | # train the basic model with text8-rest, which is all the sentences 16 | # without the word - queen 17 | # -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15 18 | 19 | if (not os.path.isfile(MFN)): 20 | 21 | model = gensim.models.Word2Vec(size=200, window=8, min_count=10, workers=4) 22 | sentences = gensim.models.word2vec.LineSentence(FN) 23 | model.build_vocab(sentences) 24 | model.train(sentences) 25 | model.save(MFN) 26 | 27 | else: 28 | model = gensim.models.Word2Vec.load(MFN) 29 | 30 | # Evaluation 31 | print model.n_similarity(["king"], ["duke"]) 32 | print model.n_similarity(["king"], ["queen"]) 33 | print model.n_similarity(["climate"], ["greenhouse"]) 34 | print model.n_similarity(["king"], ["greenhouse"]) 35 | print "climate+greenhouse", model.most_similar(positive=['climate', 'greenhouse']) #, negative=['man']) 36 | print "climate", model.most_similar(positive=['climate']) #, 'king'], negative=['man']) 37 | 38 | -------------------------------------------------------------------------------- /src/old/create_tables.sql: -------------------------------------------------------------------------------- 1 | 2 | /* those two tables are used 3 | -- create a sqlite3 DB for them .. 4 | */ 5 | 6 | CREATE TABLE data ( 7 | id INTEGER PRIMARY KEY, 8 | domain VARCHAR, 9 | corpus VARCHAR, 10 | term VARCHAR, 11 | step INT, 12 | relevant BOOLEAN 13 | ); 14 | 15 | 16 | CREATE TABLE checked_terms ( 17 | id INTEGER PRIMARY KEY AUTOINCREMENT, 18 | domain VARCHAR, 19 | term VARCHAR, 20 | relevant BOOLEAN 21 | ); 22 | 23 | -------------------------------------------------------------------------------- /src/old/minic_judge_terms.py: -------------------------------------------------------------------------------- 1 | import sqlite3,os,subprocess 2 | 3 | DBFILE="/home/wohlg/workspace/dl4j-0.4-examples/src/main/java/MinicBac/python/data/our.db" 4 | conn = sqlite3.connect(DBFILE) 5 | dbc = conn.cursor() 6 | 7 | def get_and_save_terms(): 8 | 9 | # get terms which are obviously not relevant (which were not in the pos_terms) 10 | dbc.execute("SELECT distinct term FROM data WHERE relevant IS NULL") 11 | res = dbc.fetchall() 12 | 13 | terms = [t[0] for t in res] 14 | print "number of terms", len(terms) 15 | 16 | # reuse existing judgements 17 | 18 | if terms: 19 | # write into a checkme file 20 | fn = "CHECK_ME" 21 | fh = open(fn, 'wb') 22 | for term in terms: 23 | fh.write(term.encode('utf-8') + "\n") 24 | 25 | def load_and_save_terms(): 26 | fn = "NEW" 27 | 28 | if os.path.isfile(fn): 29 | lines = open(fn).readlines() 30 | terms = [l.strip() for l in lines] 31 | 32 | print "number of relevant terms", len(terms) 33 | 34 | for term in terms: 35 | # update db 36 | dbc.execute("UPDATE data SET relevant=1 WHERE term=? AND relevant IS NULL", (term,)) 37 | conn.commit() 38 | 39 | dbc.execute("update data set relevant = 0 where relevant is null") 40 | conn.commit() 41 | 42 | get_and_save_terms() 43 | load_and_save_terms() 44 | subprocess.call(["svn", "commit", "-m", "wohlg:new judgements", DBFILE]) 45 | -------------------------------------------------------------------------------- /src/run_ontology_learning.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | """ 4 | this is the main script which calls the rest 5 | basically, it loads the model, gets the term candidates, and saves them for evaluation 6 | we use a sqlite3 DB to store the data and the ratings 7 | """ 8 | 9 | import fs_helpers, db 10 | from config import * 11 | 12 | # simple: connect to the sqlite DB 13 | get_db() 14 | 15 | # check if sqlite tables exists, else create them 16 | db.check_tables_exist(conf) 17 | 18 | # check if seed terms already in DB -- if not, insert them 19 | db.check_initial_seeds(conf) 20 | 21 | # find out in what iteration we are, load seed terms, etc 22 | db.get_status(conf) 23 | 24 | # load model 25 | model = fs_helpers.load_model(conf) 26 | 27 | # use word2vec to get term candidates 28 | candidates = model.most_similar(positive=conf['seeds'], topn=75) 29 | 30 | # save the term candidates 31 | db.save_candidates(conf, candidates) 32 | --------------------------------------------------------------------------------