├── .gitignore ├── README.md ├── code ├── brown-cluster │ ├── .gitignore │ ├── Makefile │ ├── README │ ├── basic │ │ ├── city.cc │ │ ├── city.h │ │ ├── hard-ofstream.h │ │ ├── indent.cc │ │ ├── indent.h │ │ ├── lisp.cc │ │ ├── lisp.h │ │ ├── logging.cc │ │ ├── logging.h │ │ ├── mem-tracker.cc │ │ ├── mem-tracker.h │ │ ├── mem.h │ │ ├── multi-ostream.cc │ │ ├── multi-ostream.h │ │ ├── opt.cc │ │ ├── opt.h │ │ ├── pipe.h │ │ ├── prob-utils.cc │ │ ├── prob-utils.h │ │ ├── stats.cc │ │ ├── stats.h │ │ ├── std.cc │ │ ├── std.h │ │ ├── stl-basic.cc │ │ ├── stl-basic.h │ │ ├── stl-utils.cc │ │ ├── stl-utils.h │ │ ├── str-str-db.cc │ │ ├── str-str-db.h │ │ ├── str.cc │ │ ├── str.h │ │ ├── strdb.cc │ │ ├── strdb.h │ │ ├── timer.cc │ │ ├── timer.h │ │ ├── union-set.cc │ │ └── union-set.h │ ├── cluster-viewer │ │ ├── LICENSE │ │ ├── README.md │ │ ├── build-viewer.sh │ │ └── code │ │ │ ├── final.py │ │ │ ├── htmlrows.html │ │ │ ├── make_html.py │ │ │ ├── style.css │ │ │ └── template.html │ ├── generateBClusterInput.py │ ├── input.txt │ ├── wcluster │ └── wcluster.cc ├── distantSupervision.py └── generateJson.py ├── data ├── documents.txt ├── emTypeMap.txt ├── rmTypeMap.txt └── test.txt └── getInputJsonFile.sh /.gitignore: -------------------------------------------------------------------------------- 1 | **/eigen-3.2.5/ 2 | *.pyc 3 | *.DS_Store 4 | *.o 5 | *.zip 6 | DataProcessor/stanford-corenlp-python/ 7 | Intermediate/* 8 | Results/* 9 | Data/BBN/* 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # StructMineDataPipeline 2 | Data Processing Pipeline for StructMine Tools: [CoType](https://github.com/shanzhenren/CoType), [PLE](https://github.com/shanzhenren/PLE), [AFET](https://github.com/shanzhenren/AFET). 3 | 4 | ## Description 5 | It generates the train & test json files and brown cluster file for the above three information extraction models as input files. Each line of a json file contains information of a sentence, including entity mentions, relation mentions, etc. 6 | To generate such json files, you need to provide the following input files (we include examples in ./data folder): 7 | 8 | ### Training: 9 | 1. Freebase files (download from [here](https://drive.google.com/file/d/0B--ZKWD8ahE4aXhOLXFUeDZBVzA/view?usp=sharing) (8G) and put the unzipped freebase folder in parallel with code/ and data/ folders) 10 | 11 | * The freebase folder should contain: 12 | 13 | freebase-facts.txt (relation triples in the format of 'id of entity 1, relation type, id of entity 2'); 14 | 15 | freebase-mid-name.map (entity id to name map in the format of 'entity id, entity surface name'); 16 | 17 | freebase-mid-type.map (entity id to type map in the format of 'entity id, entity type'). 18 | 19 | 2. Raw training corpus file (each line as a document) 20 | 21 | 3. Entity & Relation mention target type mapping from freebase type name to target type name 22 | 23 | ### Test: 24 | 1. Raw test corpus file (each line as a document) 25 | 26 | ## Dependencies 27 | We will take Ubuntu for example. 28 | 29 | * python 2.7 30 | * Python library dependencies 31 | ``` 32 | $ pip install nltk 33 | ``` 34 | 35 | * [stanford coreNLP 3.7.0](http://stanfordnlp.github.io/CoreNLP/) and its [python wrapper](https://github.com/stanfordnlp/stanza). Please put the library under `CoType/code/DataProcessor/'. 36 | 37 | ``` 38 | $ cd code/ 39 | $ git clone git@github.com:stanfordnlp/stanza.git 40 | $ cd stanza 41 | $ pip install -e . 42 | $ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip 43 | $ unzip stanford-corenlp-full-2016-10-31.zip 44 | ``` 45 | 46 | ## Example Run 47 | Run CoTypeDataProcessing to generate Json input files of CoType for the example training and test raw corpus 48 | 49 | ``` 50 | $ java -mx4g -cp "code/stanford-corenlp-full-2016-10-31/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer 51 | $ ./getInputJsonFile.sh 52 | ``` 53 | Our example data files are located in ./data folder. You should be able to see these 3 files generated in the same folder - train.json, test.json and brown (brown cluster file), after running the above command. 54 | 55 | ## Parameters - getInputJsonFile.sh 56 | Raw train & test files to run on. 57 | ``` 58 | inTrainFile='./data/documents.txt' 59 | inTestFile='./data/test.txt' 60 | ``` 61 | Output files (input json files for CoType, PLE, AFET). 62 | ``` 63 | outTrainFile='./data/train.json' 64 | outTestFile='./data/test.json' 65 | bcOutFile='./data/brown' 66 | ``` 67 | Directory path of freebase files: 68 | ``` 69 | freebaseDir='./freebase' 70 | ``` 71 | Mention type(s) required to generate. You can choose to generate entity mentions only or relation mentions only or both. The parameter value can be set to 'em' or 'rm' or 'both'. 72 | ``` 73 | mentionType='both' 74 | ``` 75 | Target mention type mapping files. 76 | ``` 77 | emTypeMapFile='./data/emTypeMap.txt' 78 | rmTypeMapFile='./data/rmTypeMap.txt' # leave it empty if only entity mention is needed 79 | ``` 80 | Parsing tool to do sentence splitting, tokenization, entity mention detection, etc. It can be 'nltk' or 'stanford'. 81 | ``` 82 | parseTool='stanford' 83 | ``` 84 | Set this parameter to be true if you already have a pretrained model and only need to generate test json file. 85 | ``` 86 | testOnly=false 87 | ``` 88 | -------------------------------------------------------------------------------- /code/brown-cluster/.gitignore: -------------------------------------------------------------------------------- 1 | *.o 2 | -------------------------------------------------------------------------------- /code/brown-cluster/Makefile: -------------------------------------------------------------------------------- 1 | # 1.2: need to make sure opt.o goes in the right order to get the right scope on the command-line arguments 2 | # Use this for Linux 3 | ifeq ($(shell uname),Linux) 4 | files=$(subst .cc,.o,basic/logging.cc $(shell /bin/ls *.cc) $(shell /bin/ls basic/*.cc | grep -v logging.cc)) 5 | else 6 | files=$(subst .cc,.o,basic/opt.cc $(shell /bin/ls *.cc) $(shell /bin/ls basic/*.cc | grep -v opt.cc)) 7 | endif 8 | 9 | wcluster: $(files) 10 | g++ -Wall -g -std=c++0x -O3 -o wcluster $(files) -lpthread 11 | 12 | %.o: %.cc 13 | g++ -Wall -g -O3 -std=c++0x -o $@ -c $< 14 | 15 | clean: 16 | rm wcluster basic/*.o *.o 17 | -------------------------------------------------------------------------------- /code/brown-cluster/README: -------------------------------------------------------------------------------- 1 | Implementation of the Brown hierarchical word clustering algorithm. 2 | Percy Liang 3 | Release 1.3 4 | 2012.07.24 5 | 6 | Input: a sequence of words separated by whitespace (see input.txt for an example). 7 | Output: for each word type, its cluster (see output.txt for an example). 8 | In particular, each line is: 9 | 10 | 11 | Runs in $O(N C^2)$, where $N$ is the number of word types and $C$ 12 | is the number of clusters. 13 | 14 | References: 15 | 16 | Brown, et al.: Class-Based n-gram Models of Natural Language 17 | http://acl.ldc.upenn.edu/J/J92/J92-4003.pdf 18 | 19 | Liang: Semi-supervised learning for natural language processing 20 | http://cs.stanford.edu/~pliang/papers/meng-thesis.pdf 21 | 22 | Compile: 23 | 24 | make 25 | 26 | Run: 27 | 28 | # Clusters input.txt into 50 clusters: 29 | ./wcluster --text input.txt --c 50 30 | # Output in input-c50-p1.out/paths 31 | 32 | ============================================================ 33 | Change Log 34 | 35 | 1.3: compatibility updates for newer versions of g++ (courtesy of Chris Dyer). 36 | 1.2: make compatible with MacOS (replaced timespec with timeval and changed order of linking). 37 | 1.1: Removed deprecated operators so it works with GCC 4.3. 38 | 39 | ============================================================ 40 | (C) Copyright 2007-2012, Percy Liang 41 | 42 | http://cs.stanford.edu/~pliang 43 | 44 | Permission is granted for anyone to copy, use, or modify these programs and 45 | accompanying documents for purposes of research or education, provided this 46 | copyright notice is retained, and note is made of any changes that have been 47 | made. 48 | 49 | These programs and documents are distributed without any warranty, express or 50 | implied. As the programs were written for research purposes only, they have 51 | not been tested to the degree that would be advisable in any important 52 | application. All use of these programs is entirely at the user's own risk. 53 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/city.cc: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2011 Google, Inc. 2 | // 3 | // Permission is hereby granted, free of charge, to any person obtaining a copy 4 | // of this software and associated documentation files (the "Software"), to deal 5 | // in the Software without restriction, including without limitation the rights 6 | // to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | // copies of the Software, and to permit persons to whom the Software is 8 | // furnished to do so, subject to the following conditions: 9 | // 10 | // The above copyright notice and this permission notice shall be included in 11 | // all copies or substantial portions of the Software. 12 | // 13 | // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | // FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | // AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | // LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | // OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 19 | // THE SOFTWARE. 20 | // 21 | // CityHash, by Geoff Pike and Jyrki Alakuijala 22 | // 23 | // This file provides CityHash64() and related functions. 24 | // 25 | // It's probably possible to create even faster hash functions by 26 | // writing a program that systematically explores some of the space of 27 | // possible hash functions, by using SIMD instructions, or by 28 | // compromising on hash quality. 29 | 30 | #include "city.h" 31 | 32 | #include 33 | #include // for memcpy and memset 34 | 35 | using namespace std; 36 | 37 | static uint64 UNALIGNED_LOAD64(const char *p) { 38 | uint64 result; 39 | memcpy(&result, p, sizeof(result)); 40 | return result; 41 | } 42 | 43 | static uint32 UNALIGNED_LOAD32(const char *p) { 44 | uint32 result; 45 | memcpy(&result, p, sizeof(result)); 46 | return result; 47 | } 48 | 49 | #if !defined(WORDS_BIGENDIAN) 50 | 51 | #define uint32_in_expected_order(x) (x) 52 | #define uint64_in_expected_order(x) (x) 53 | 54 | #else 55 | 56 | #ifdef _MSC_VER 57 | #include 58 | #define bswap_32(x) _byteswap_ulong(x) 59 | #define bswap_64(x) _byteswap_uint64(x) 60 | 61 | #elif defined(__APPLE__) 62 | // Mac OS X / Darwin features 63 | #include 64 | #define bswap_32(x) OSSwapInt32(x) 65 | #define bswap_64(x) OSSwapInt64(x) 66 | 67 | #else 68 | #include 69 | #endif 70 | 71 | #define uint32_in_expected_order(x) (bswap_32(x)) 72 | #define uint64_in_expected_order(x) (bswap_64(x)) 73 | 74 | #endif // WORDS_BIGENDIAN 75 | 76 | #if !defined(LIKELY) 77 | #if HAVE_BUILTIN_EXPECT 78 | #define LIKELY(x) (__builtin_expect(!!(x), 1)) 79 | #else 80 | #define LIKELY(x) (x) 81 | #endif 82 | #endif 83 | 84 | static uint64 Fetch64(const char *p) { 85 | return uint64_in_expected_order(UNALIGNED_LOAD64(p)); 86 | } 87 | 88 | static uint32 Fetch32(const char *p) { 89 | return uint32_in_expected_order(UNALIGNED_LOAD32(p)); 90 | } 91 | 92 | // Some primes between 2^63 and 2^64 for various uses. 93 | static const uint64 k0 = 0xc3a5c85c97cb3127ULL; 94 | static const uint64 k1 = 0xb492b66fbe98f273ULL; 95 | static const uint64 k2 = 0x9ae16a3b2f90404fULL; 96 | static const uint64 k3 = 0xc949d7c7509e6557ULL; 97 | 98 | // Bitwise right rotate. Normally this will compile to a single 99 | // instruction, especially if the shift is a manifest constant. 100 | static uint64 Rotate(uint64 val, int shift) { 101 | // Avoid shifting by 64: doing so yields an undefined result. 102 | return shift == 0 ? val : ((val >> shift) | (val << (64 - shift))); 103 | } 104 | 105 | // Equivalent to Rotate(), but requires the second arg to be non-zero. 106 | // On x86-64, and probably others, it's possible for this to compile 107 | // to a single instruction if both args are already in registers. 108 | static uint64 RotateByAtLeast1(uint64 val, int shift) { 109 | return (val >> shift) | (val << (64 - shift)); 110 | } 111 | 112 | static uint64 ShiftMix(uint64 val) { 113 | return val ^ (val >> 47); 114 | } 115 | 116 | static uint64 HashLen16(uint64 u, uint64 v) { 117 | return Hash128to64(uint128(u, v)); 118 | } 119 | 120 | static uint64 HashLen0to16(const char *s, size_t len) { 121 | if (len > 8) { 122 | uint64 a = Fetch64(s); 123 | uint64 b = Fetch64(s + len - 8); 124 | return HashLen16(a, RotateByAtLeast1(b + len, len)) ^ b; 125 | } 126 | if (len >= 4) { 127 | uint64 a = Fetch32(s); 128 | return HashLen16(len + (a << 3), Fetch32(s + len - 4)); 129 | } 130 | if (len > 0) { 131 | uint8 a = s[0]; 132 | uint8 b = s[len >> 1]; 133 | uint8 c = s[len - 1]; 134 | uint32 y = static_cast(a) + (static_cast(b) << 8); 135 | uint32 z = len + (static_cast(c) << 2); 136 | return ShiftMix(y * k2 ^ z * k3) * k2; 137 | } 138 | return k2; 139 | } 140 | 141 | // This probably works well for 16-byte strings as well, but it may be overkill 142 | // in that case. 143 | static uint64 HashLen17to32(const char *s, size_t len) { 144 | uint64 a = Fetch64(s) * k1; 145 | uint64 b = Fetch64(s + 8); 146 | uint64 c = Fetch64(s + len - 8) * k2; 147 | uint64 d = Fetch64(s + len - 16) * k0; 148 | return HashLen16(Rotate(a - b, 43) + Rotate(c, 30) + d, 149 | a + Rotate(b ^ k3, 20) - c + len); 150 | } 151 | 152 | // Return a 16-byte hash for 48 bytes. Quick and dirty. 153 | // Callers do best to use "random-looking" values for a and b. 154 | static pair WeakHashLen32WithSeeds( 155 | uint64 w, uint64 x, uint64 y, uint64 z, uint64 a, uint64 b) { 156 | a += w; 157 | b = Rotate(b + a + z, 21); 158 | uint64 c = a; 159 | a += x; 160 | a += y; 161 | b += Rotate(a, 44); 162 | return make_pair(a + z, b + c); 163 | } 164 | 165 | // Return a 16-byte hash for s[0] ... s[31], a, and b. Quick and dirty. 166 | static pair WeakHashLen32WithSeeds( 167 | const char* s, uint64 a, uint64 b) { 168 | return WeakHashLen32WithSeeds(Fetch64(s), 169 | Fetch64(s + 8), 170 | Fetch64(s + 16), 171 | Fetch64(s + 24), 172 | a, 173 | b); 174 | } 175 | 176 | // Return an 8-byte hash for 33 to 64 bytes. 177 | static uint64 HashLen33to64(const char *s, size_t len) { 178 | uint64 z = Fetch64(s + 24); 179 | uint64 a = Fetch64(s) + (len + Fetch64(s + len - 16)) * k0; 180 | uint64 b = Rotate(a + z, 52); 181 | uint64 c = Rotate(a, 37); 182 | a += Fetch64(s + 8); 183 | c += Rotate(a, 7); 184 | a += Fetch64(s + 16); 185 | uint64 vf = a + z; 186 | uint64 vs = b + Rotate(a, 31) + c; 187 | a = Fetch64(s + 16) + Fetch64(s + len - 32); 188 | z = Fetch64(s + len - 8); 189 | b = Rotate(a + z, 52); 190 | c = Rotate(a, 37); 191 | a += Fetch64(s + len - 24); 192 | c += Rotate(a, 7); 193 | a += Fetch64(s + len - 16); 194 | uint64 wf = a + z; 195 | uint64 ws = b + Rotate(a, 31) + c; 196 | uint64 r = ShiftMix((vf + ws) * k2 + (wf + vs) * k0); 197 | return ShiftMix(r * k0 + vs) * k2; 198 | } 199 | 200 | uint64 CityHash64(const char *s, size_t len) { 201 | if (len <= 32) { 202 | if (len <= 16) { 203 | return HashLen0to16(s, len); 204 | } else { 205 | return HashLen17to32(s, len); 206 | } 207 | } else if (len <= 64) { 208 | return HashLen33to64(s, len); 209 | } 210 | 211 | // For strings over 64 bytes we hash the end first, and then as we 212 | // loop we keep 56 bytes of state: v, w, x, y, and z. 213 | uint64 x = Fetch64(s + len - 40); 214 | uint64 y = Fetch64(s + len - 16) + Fetch64(s + len - 56); 215 | uint64 z = HashLen16(Fetch64(s + len - 48) + len, Fetch64(s + len - 24)); 216 | pair v = WeakHashLen32WithSeeds(s + len - 64, len, z); 217 | pair w = WeakHashLen32WithSeeds(s + len - 32, y + k1, x); 218 | x = x * k1 + Fetch64(s); 219 | 220 | // Decrease len to the nearest multiple of 64, and operate on 64-byte chunks. 221 | len = (len - 1) & ~static_cast(63); 222 | do { 223 | x = Rotate(x + y + v.first + Fetch64(s + 8), 37) * k1; 224 | y = Rotate(y + v.second + Fetch64(s + 48), 42) * k1; 225 | x ^= w.second; 226 | y += v.first + Fetch64(s + 40); 227 | z = Rotate(z + w.first, 33) * k1; 228 | v = WeakHashLen32WithSeeds(s, v.second * k1, x + w.first); 229 | w = WeakHashLen32WithSeeds(s + 32, z + w.second, y + Fetch64(s + 16)); 230 | std::swap(z, x); 231 | s += 64; 232 | len -= 64; 233 | } while (len != 0); 234 | return HashLen16(HashLen16(v.first, w.first) + ShiftMix(y) * k1 + z, 235 | HashLen16(v.second, w.second) + x); 236 | } 237 | 238 | uint64 CityHash64WithSeed(const char *s, size_t len, uint64 seed) { 239 | return CityHash64WithSeeds(s, len, k2, seed); 240 | } 241 | 242 | uint64 CityHash64WithSeeds(const char *s, size_t len, 243 | uint64 seed0, uint64 seed1) { 244 | return HashLen16(CityHash64(s, len) - seed0, seed1); 245 | } 246 | 247 | // A subroutine for CityHash128(). Returns a decent 128-bit hash for strings 248 | // of any length representable in signed long. Based on City and Murmur. 249 | static uint128 CityMurmur(const char *s, size_t len, uint128 seed) { 250 | uint64 a = Uint128Low64(seed); 251 | uint64 b = Uint128High64(seed); 252 | uint64 c = 0; 253 | uint64 d = 0; 254 | signed long l = len - 16; 255 | if (l <= 0) { // len <= 16 256 | a = ShiftMix(a * k1) * k1; 257 | c = b * k1 + HashLen0to16(s, len); 258 | d = ShiftMix(a + (len >= 8 ? Fetch64(s) : c)); 259 | } else { // len > 16 260 | c = HashLen16(Fetch64(s + len - 8) + k1, a); 261 | d = HashLen16(b + len, c + Fetch64(s + len - 16)); 262 | a += d; 263 | do { 264 | a ^= ShiftMix(Fetch64(s) * k1) * k1; 265 | a *= k1; 266 | b ^= a; 267 | c ^= ShiftMix(Fetch64(s + 8) * k1) * k1; 268 | c *= k1; 269 | d ^= c; 270 | s += 16; 271 | l -= 16; 272 | } while (l > 0); 273 | } 274 | a = HashLen16(a, c); 275 | b = HashLen16(d, b); 276 | return uint128(a ^ b, HashLen16(b, a)); 277 | } 278 | 279 | uint128 CityHash128WithSeed(const char *s, size_t len, uint128 seed) { 280 | if (len < 128) { 281 | return CityMurmur(s, len, seed); 282 | } 283 | 284 | // We expect len >= 128 to be the common case. Keep 56 bytes of state: 285 | // v, w, x, y, and z. 286 | pair v, w; 287 | uint64 x = Uint128Low64(seed); 288 | uint64 y = Uint128High64(seed); 289 | uint64 z = len * k1; 290 | v.first = Rotate(y ^ k1, 49) * k1 + Fetch64(s); 291 | v.second = Rotate(v.first, 42) * k1 + Fetch64(s + 8); 292 | w.first = Rotate(y + z, 35) * k1 + x; 293 | w.second = Rotate(x + Fetch64(s + 88), 53) * k1; 294 | 295 | // This is the same inner loop as CityHash64(), manually unrolled. 296 | do { 297 | x = Rotate(x + y + v.first + Fetch64(s + 8), 37) * k1; 298 | y = Rotate(y + v.second + Fetch64(s + 48), 42) * k1; 299 | x ^= w.second; 300 | y += v.first + Fetch64(s + 40); 301 | z = Rotate(z + w.first, 33) * k1; 302 | v = WeakHashLen32WithSeeds(s, v.second * k1, x + w.first); 303 | w = WeakHashLen32WithSeeds(s + 32, z + w.second, y + Fetch64(s + 16)); 304 | std::swap(z, x); 305 | s += 64; 306 | x = Rotate(x + y + v.first + Fetch64(s + 8), 37) * k1; 307 | y = Rotate(y + v.second + Fetch64(s + 48), 42) * k1; 308 | x ^= w.second; 309 | y += v.first + Fetch64(s + 40); 310 | z = Rotate(z + w.first, 33) * k1; 311 | v = WeakHashLen32WithSeeds(s, v.second * k1, x + w.first); 312 | w = WeakHashLen32WithSeeds(s + 32, z + w.second, y + Fetch64(s + 16)); 313 | std::swap(z, x); 314 | s += 64; 315 | len -= 128; 316 | } while (LIKELY(len >= 128)); 317 | x += Rotate(v.first + z, 49) * k0; 318 | z += Rotate(w.first, 37) * k0; 319 | // If 0 < len < 128, hash up to 4 chunks of 32 bytes each from the end of s. 320 | for (size_t tail_done = 0; tail_done < len; ) { 321 | tail_done += 32; 322 | y = Rotate(x + y, 42) * k0 + v.second; 323 | w.first += Fetch64(s + len - tail_done + 16); 324 | x = x * k0 + w.first; 325 | z += w.second + Fetch64(s + len - tail_done); 326 | w.second += v.first; 327 | v = WeakHashLen32WithSeeds(s + len - tail_done, v.first + z, v.second); 328 | } 329 | // At this point our 56 bytes of state should contain more than 330 | // enough information for a strong 128-bit hash. We use two 331 | // different 56-byte-to-8-byte hashes to get a 16-byte final result. 332 | x = HashLen16(x, v.first); 333 | y = HashLen16(y + z, w.first); 334 | return uint128(HashLen16(x + v.second, w.second) + y, 335 | HashLen16(x + w.second, y + v.second)); 336 | } 337 | 338 | uint128 CityHash128(const char *s, size_t len) { 339 | if (len >= 16) { 340 | return CityHash128WithSeed(s + 16, 341 | len - 16, 342 | uint128(Fetch64(s) ^ k3, 343 | Fetch64(s + 8))); 344 | } else if (len >= 8) { 345 | return CityHash128WithSeed(NULL, 346 | 0, 347 | uint128(Fetch64(s) ^ (len * k0), 348 | Fetch64(s + len - 8) ^ k1)); 349 | } else { 350 | return CityHash128WithSeed(s, len, uint128(k0, k1)); 351 | } 352 | } 353 | 354 | #ifdef __SSE4_2__ 355 | #include 356 | #include 357 | 358 | // Requires len >= 240. 359 | static void CityHashCrc256Long(const char *s, size_t len, 360 | uint32 seed, uint64 *result) { 361 | uint64 a = Fetch64(s + 56) + k0; 362 | uint64 b = Fetch64(s + 96) + k0; 363 | uint64 c = result[0] = HashLen16(b, len); 364 | uint64 d = result[1] = Fetch64(s + 120) * k0 + len; 365 | uint64 e = Fetch64(s + 184) + seed; 366 | uint64 f = seed; 367 | uint64 g = 0; 368 | uint64 h = 0; 369 | uint64 i = 0; 370 | uint64 j = 0; 371 | uint64 t = c + d; 372 | 373 | // 240 bytes of input per iter. 374 | size_t iters = len / 240; 375 | len -= iters * 240; 376 | do { 377 | #define CHUNK(multiplier, z) \ 378 | { \ 379 | uint64 old_a = a; \ 380 | a = Rotate(b, 41 ^ z) * multiplier + Fetch64(s); \ 381 | b = Rotate(c, 27 ^ z) * multiplier + Fetch64(s + 8); \ 382 | c = Rotate(d, 41 ^ z) * multiplier + Fetch64(s + 16); \ 383 | d = Rotate(e, 33 ^ z) * multiplier + Fetch64(s + 24); \ 384 | e = Rotate(t, 25 ^ z) * multiplier + Fetch64(s + 32); \ 385 | t = old_a; \ 386 | } \ 387 | f = _mm_crc32_u64(f, a); \ 388 | g = _mm_crc32_u64(g, b); \ 389 | h = _mm_crc32_u64(h, c); \ 390 | i = _mm_crc32_u64(i, d); \ 391 | j = _mm_crc32_u64(j, e); \ 392 | s += 40 393 | 394 | CHUNK(1, 1); CHUNK(k0, 0); 395 | CHUNK(1, 1); CHUNK(k0, 0); 396 | CHUNK(1, 1); CHUNK(k0, 0); 397 | } while (--iters > 0); 398 | 399 | while (len >= 40) { 400 | CHUNK(k0, 0); 401 | len -= 40; 402 | } 403 | if (len > 0) { 404 | s = s + len - 40; 405 | CHUNK(k0, 0); 406 | } 407 | j += i << 32; 408 | a = HashLen16(a, j); 409 | h += g << 32; 410 | b += h; 411 | c = HashLen16(c, f) + i; 412 | d = HashLen16(d, e + result[0]); 413 | j += e; 414 | i += HashLen16(h, t); 415 | e = HashLen16(a, d) + j; 416 | f = HashLen16(b, c) + a; 417 | g = HashLen16(j, i) + c; 418 | result[0] = e + f + g + h; 419 | a = ShiftMix((a + g) * k0) * k0 + b; 420 | result[1] += a + result[0]; 421 | a = ShiftMix(a * k0) * k0 + c; 422 | result[2] = a + result[1]; 423 | a = ShiftMix((a + e) * k0) * k0; 424 | result[3] = a + result[2]; 425 | } 426 | 427 | // Requires len < 240. 428 | static void CityHashCrc256Short(const char *s, size_t len, uint64 *result) { 429 | char buf[240]; 430 | memcpy(buf, s, len); 431 | memset(buf + len, 0, 240 - len); 432 | CityHashCrc256Long(buf, 240, ~static_cast(len), result); 433 | } 434 | 435 | void CityHashCrc256(const char *s, size_t len, uint64 *result) { 436 | if (LIKELY(len >= 240)) { 437 | CityHashCrc256Long(s, len, 0, result); 438 | } else { 439 | CityHashCrc256Short(s, len, result); 440 | } 441 | } 442 | 443 | uint128 CityHashCrc128WithSeed(const char *s, size_t len, uint128 seed) { 444 | if (len <= 900) { 445 | return CityHash128WithSeed(s, len, seed); 446 | } else { 447 | uint64 result[4]; 448 | CityHashCrc256(s, len, result); 449 | uint64 u = Uint128High64(seed) + result[0]; 450 | uint64 v = Uint128Low64(seed) + result[1]; 451 | return uint128(HashLen16(u, v + result[2]), 452 | HashLen16(Rotate(v, 32), u * k0 + result[3])); 453 | } 454 | } 455 | 456 | uint128 CityHashCrc128(const char *s, size_t len) { 457 | if (len <= 900) { 458 | return CityHash128(s, len); 459 | } else { 460 | uint64 result[4]; 461 | CityHashCrc256(s, len, result); 462 | return uint128(result[2], result[3]); 463 | } 464 | } 465 | 466 | #endif 467 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/city.h: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2011 Google, Inc. 2 | // 3 | // Permission is hereby granted, free of charge, to any person obtaining a copy 4 | // of this software and associated documentation files (the "Software"), to deal 5 | // in the Software without restriction, including without limitation the rights 6 | // to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | // copies of the Software, and to permit persons to whom the Software is 8 | // furnished to do so, subject to the following conditions: 9 | // 10 | // The above copyright notice and this permission notice shall be included in 11 | // all copies or substantial portions of the Software. 12 | // 13 | // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | // FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | // AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | // LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | // OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 19 | // THE SOFTWARE. 20 | // 21 | // CityHash, by Geoff Pike and Jyrki Alakuijala 22 | // 23 | // This file provides a few functions for hashing strings. On x86-64 24 | // hardware in 2011, CityHash64() is faster than other high-quality 25 | // hash functions, such as Murmur. This is largely due to higher 26 | // instruction-level parallelism. CityHash64() and CityHash128() also perform 27 | // well on hash-quality tests. 28 | // 29 | // CityHash128() is optimized for relatively long strings and returns 30 | // a 128-bit hash. For strings more than about 2000 bytes it can be 31 | // faster than CityHash64(). 32 | // 33 | // Functions in the CityHash family are not suitable for cryptography. 34 | // 35 | // WARNING: This code has not been tested on big-endian platforms! 36 | // It is known to work well on little-endian platforms that have a small penalty 37 | // for unaligned reads, such as current Intel and AMD moderate-to-high-end CPUs. 38 | // 39 | // By the way, for some hash functions, given strings a and b, the hash 40 | // of a+b is easily derived from the hashes of a and b. This property 41 | // doesn't hold for any hash functions in this file. 42 | 43 | #ifndef CITY_HASH_H_ 44 | #define CITY_HASH_H_ 45 | 46 | #include // for size_t. 47 | #include 48 | #include 49 | 50 | typedef uint8_t uint8; 51 | typedef uint32_t uint32; 52 | typedef uint64_t uint64; 53 | typedef std::pair uint128; 54 | 55 | inline uint64 Uint128Low64(const uint128& x) { return x.first; } 56 | inline uint64 Uint128High64(const uint128& x) { return x.second; } 57 | 58 | // Hash function for a byte array. 59 | uint64 CityHash64(const char *buf, size_t len); 60 | 61 | // Hash function for a byte array. For convenience, a 64-bit seed is also 62 | // hashed into the result. 63 | uint64 CityHash64WithSeed(const char *buf, size_t len, uint64 seed); 64 | 65 | // Hash function for a byte array. For convenience, two seeds are also 66 | // hashed into the result. 67 | uint64 CityHash64WithSeeds(const char *buf, size_t len, 68 | uint64 seed0, uint64 seed1); 69 | 70 | // Hash function for a byte array. 71 | uint128 CityHash128(const char *s, size_t len); 72 | 73 | // Hash function for a byte array. For convenience, a 128-bit seed is also 74 | // hashed into the result. 75 | uint128 CityHash128WithSeed(const char *s, size_t len, uint128 seed); 76 | 77 | // Hash 128 input bits down to 64 bits of output. 78 | // This is intended to be a reasonably good hash function. 79 | inline uint64 Hash128to64(const uint128& x) { 80 | // Murmur-inspired hashing. 81 | const uint64 kMul = 0x9ddfea08eb382d69ULL; 82 | uint64 a = (Uint128Low64(x) ^ Uint128High64(x)) * kMul; 83 | a ^= (a >> 47); 84 | uint64 b = (Uint128High64(x) ^ a) * kMul; 85 | b ^= (b >> 47); 86 | b *= kMul; 87 | return b; 88 | } 89 | 90 | #endif // CITY_HASH_H_ 91 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/hard-ofstream.h: -------------------------------------------------------------------------------- 1 | #ifndef __HARD_OFSTREAM_H__ 2 | #define __HARD_OFSTREAM_H__ 3 | 4 | // On AFS, flushing a file writes it to the local disk but not AFS. 5 | // Hard flushing ensures that the file will be written, by closing 6 | // and re-opening the file. 7 | 8 | #include 9 | #include 10 | 11 | using namespace std; 12 | 13 | class hard_ofstream : public ofstream { 14 | public: 15 | hard_ofstream() { } 16 | hard_ofstream(const char *file, ofstream::openmode mode = ofstream::trunc) { open(file, mode); } 17 | 18 | void open(const char *file, ofstream::openmode mode = ofstream::trunc) { 19 | ofstream::open(file, mode); 20 | this->file = file; 21 | } 22 | 23 | void hard_flush() { 24 | close(); 25 | open(file.c_str(), ofstream::app); 26 | } 27 | 28 | private: 29 | string file; 30 | }; 31 | 32 | #endif 33 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/indent.cc: -------------------------------------------------------------------------------- 1 | #include "indent.h" 2 | 3 | #include "opt.h" 4 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/indent.h: -------------------------------------------------------------------------------- 1 | #ifndef __INDENT_H__ 2 | #define __INDENT_H__ 3 | 4 | #include 5 | 6 | using namespace std; 7 | 8 | struct Indent { 9 | Indent(int level) : level(level) { } 10 | int level; 11 | }; 12 | 13 | inline ostream &operator<<(ostream &out, const Indent &ind) { 14 | for(int i = 0; i < ind.level; i++) out << " "; 15 | return out; 16 | } 17 | 18 | #endif 19 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/lisp.cc: -------------------------------------------------------------------------------- 1 | #include "lisp.h" 2 | #include "std.h" 3 | #include "indent.h" 4 | 5 | void LispNode::destroy() { 6 | forvec(_, LispNode *, node, children) { 7 | node->destroy(); 8 | delete node; 9 | } 10 | } 11 | 12 | void LispNode::print(int ind) const { 13 | cout << Indent(ind) << (value.empty() ? "(empty)" : value) << endl; 14 | forvec(_, LispNode *, subnode, children) 15 | subnode->print(ind+1); 16 | } 17 | 18 | //////////////////////////////////////////////////////////// 19 | 20 | LispTree::~LispTree() { 21 | root->destroy(); 22 | delete root; 23 | } 24 | 25 | bool is_paren(char c) { 26 | return c == '(' || c == ')' || c == '[' || c == ']'; 27 | } 28 | bool is_paren(string s) { 29 | return s == "(" || s == ")" || s == "[" || s == "]"; 30 | } 31 | bool is_left_paren(string s) { 32 | return s == "(" || s == "["; 33 | } 34 | bool is_right_paren(string s) { 35 | return s == ")" || s == "]"; 36 | } 37 | string matching_right_paren(char c) { 38 | if(c == '(') return ")"; 39 | if(c == '[') return "]"; 40 | return ""; 41 | } 42 | 43 | // Return first non-space character. 44 | char skip_space(istream &in) { 45 | char c; 46 | while(true) { 47 | c = in.peek(); 48 | if(!isspace(c)) break; 49 | in.get(); 50 | } 51 | return c; 52 | } 53 | 54 | // Comments start with # and end with the line. 55 | // There must be a space before the #. 56 | char skip_comments(istream &in) { 57 | while(true) { 58 | char c = skip_space(in); 59 | if(c == '#') 60 | while((c = in.peek()) != '\n') in.get(); 61 | else 62 | return c; 63 | } 64 | } 65 | 66 | bool LispTree::read_token(istream &in, string &s) { 67 | char c = skip_comments(in); 68 | 69 | if(is_paren(c)) { 70 | s = in.get(); 71 | return true; 72 | } 73 | 74 | s = ""; 75 | while(true) { 76 | c = in.peek(); 77 | if(c == EOF) return false; 78 | if(isspace(c) || is_paren(c)) break; 79 | s += in.get(); 80 | } 81 | 82 | return true; 83 | } 84 | 85 | LispNode *LispTree::read_node(const vector &tokens, int &i) { 86 | LispNode *node = new LispNode(); 87 | assert(i < len(tokens)); 88 | 89 | string s = tokens[i++]; 90 | if(is_left_paren(s)) { 91 | char left_paren = s[0]; 92 | 93 | if(left_paren == '(') { 94 | assert(i < len(tokens) && !is_paren(tokens[i])); 95 | node->value = tokens[i++]; 96 | } 97 | 98 | while(i < len(tokens) && !is_right_paren(tokens[i])) { 99 | node->children.push_back(read_node(tokens, i)); 100 | } 101 | 102 | assert(i < len(tokens)); 103 | s = tokens[i++]; 104 | assert(s == matching_right_paren(left_paren)); 105 | } 106 | else if(is_right_paren(s)) 107 | assert(false); 108 | else 109 | node->value = s; 110 | 111 | return node; 112 | } 113 | 114 | void LispTree::read(const char *file) { 115 | ifstream in(file); 116 | vector tokens; 117 | string token; 118 | while(read_token(in, token)) { 119 | tokens.push_back(token); 120 | } 121 | int i = 0; 122 | root = read_node(tokens, i); 123 | assert(i == len(tokens)); 124 | } 125 | 126 | void LispTree::print() const { 127 | assert(root); 128 | root->print(0); 129 | } 130 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/lisp.h: -------------------------------------------------------------------------------- 1 | #ifndef __LISP_H__ 2 | #define __LISP_H__ 3 | 4 | #include 5 | #include 6 | 7 | using namespace std; 8 | 9 | //////////////////////////////////////////////////////////// 10 | 11 | struct LispNode { 12 | void destroy(); 13 | void print(int ind) const; 14 | 15 | string value; 16 | vector children; 17 | }; 18 | 19 | //////////////////////////////////////////////////////////// 20 | 21 | struct LispTree { 22 | LispTree() : root(NULL) { } 23 | ~LispTree(); 24 | 25 | bool read_token(istream &in, string &s); 26 | LispNode *read_node(const vector &tokens, int &i); 27 | void read(const char *file); 28 | void print() const; 29 | 30 | LispNode *root; 31 | }; 32 | 33 | #endif 34 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/logging.cc: -------------------------------------------------------------------------------- 1 | #include "logging.h" 2 | #include "opt.h" 3 | #include "mem.h" 4 | 5 | // The logging output has a tree structure, where each node is a 6 | // line of output, and the depth of a node is its indent level. 7 | // A run is the sequence of children of some node. 8 | // A subset of the lines in the run will get printed. 9 | 10 | //////////////////////////////////////////////////////////// 11 | 12 | void Run::init() { 13 | num_lines = 0; 14 | num_lines_printed = 0; 15 | next_line_to_print = 0; 16 | print_all_lines = false; 17 | timer.start(); 18 | } 19 | 20 | void Run::finish() { 21 | // Make it clear that this run is not printed. 22 | // Otherwise, logss might think its 23 | // parent was printed when it really wasn't. 24 | next_line_to_print = -1; 25 | timer.stop(); 26 | } 27 | 28 | bool Run::new_line() { 29 | bool p = print(); 30 | num_lines++; 31 | if(!p) return false; 32 | 33 | // We're going to print this line. Now decide next line to print. 34 | int ms_per_line = log_info.ms_per_line; 35 | if(num_lines <= 2 || // Print first few lines anyway. 36 | ms_per_line == 0 || // Print everything. 37 | print_all_lines) // Print every line in this run. 38 | next_line_to_print++; 39 | else { 40 | timer.stop(); 41 | if(timer.ms == 0) // No time has elapsed. 42 | next_line_to_print *= 2; // Exponentially increase time between lines. 43 | else 44 | next_line_to_print += max(int((double)num_lines * ms_per_line / timer.ms), 1); 45 | } 46 | 47 | num_lines_printed++; 48 | return true; 49 | } 50 | 51 | //////////////////////////////////////////////////////////// 52 | // Global information about logging. 53 | 54 | LogInfo::LogInfo() { 55 | ms_per_line = 0; //1000; // 1 second 56 | max_ind_level = 3; 57 | 58 | ind_level = 0; 59 | buf = ""; 60 | 61 | runs.resize(128); 62 | timer.start(); 63 | } 64 | 65 | LogInfo::~LogInfo() { 66 | out.flush(); 67 | } 68 | 69 | void LogInfo::init() { 70 | if (log_file.empty()) { 71 | out.open("/dev/stdout"); 72 | } else { 73 | cout << "Logging to " << log_file << endl; 74 | out.open(log_file.c_str()); 75 | } 76 | } 77 | 78 | LogInfo log_info; 79 | 80 | //////////////////////////////////////////////////////////// 81 | // LogTracker:: For tracking functions or blocks. 82 | 83 | void LogTracker::begin(bool print_all_lines) { 84 | if(_ind_within) { 85 | if(log_info.this_run().print()) { 86 | const string &s = descrip.str(); 87 | 88 | _logs(name); 89 | if(s.size() > 0 && name[0]) 90 | lout << ": "; 91 | lout << s; 92 | 93 | lout.flush(); 94 | log_info.buf = " {\n"; // Open the block. 95 | 96 | log_info.child_run().init(); 97 | log_info.child_run().print_all_lines = print_all_lines; 98 | } 99 | else { 100 | log_info.max_ind_level = -log_info.max_ind_level; // Prevent children from outputting. 101 | output_stopped = true; 102 | } 103 | } 104 | 105 | log_info.ind_level++; 106 | } 107 | 108 | LogTracker::~LogTracker() { 109 | log_info.ind_level--; 110 | 111 | if(output_stopped) 112 | log_info.max_ind_level = -log_info.max_ind_level; // Restore indent level. 113 | 114 | if(_ind_within) { 115 | if(log_info.this_run().new_line()) { 116 | // Finish up child level. 117 | log_info.ind_level++; 118 | int n = log_info.this_run().num_omitted(); 119 | if(n > 0) 120 | _logs("... " << n << " lines omitted ...\n"); 121 | log_info.ind_level--; 122 | log_info.child_run().finish(); 123 | 124 | if(log_info.buf[0]) // Nothing was printed, because buf hasn't been emptied. 125 | log_info.buf = ""; // Just pretend we didn't open the block. 126 | else // Something indented was printed. 127 | _logs("}"); // Close the block. 128 | 129 | // Print time 130 | Timer &ct = log_info.child_run().timer; 131 | lout << " [" << ct; 132 | if(log_info.ind_level > 0) { 133 | Timer &tt = log_info.this_run().timer; 134 | tt.stop(); 135 | lout << ", cumulative " << tt; 136 | } 137 | lout << "]\n"; 138 | } 139 | } 140 | } 141 | 142 | // Options for logging. 143 | int _log_info_max_ind_level = opt_define_int_wrap("max-ind-level", &log_info.max_ind_level, log_info.max_ind_level, "Maximum indent level for logging", false); 144 | int _log_info_ms_per_line = opt_define_int_wrap("ms-per-line", &log_info.ms_per_line, log_info.ms_per_line, "Print a line out every this many milliseconds", false); 145 | string _log_info_log_file = opt_define_string_wrap("log", &log_info.log_file, log_info.log_file, "File to write log to (\"\" for stdout)", false); 146 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/logging.h: -------------------------------------------------------------------------------- 1 | #ifndef __LOGGING_H__ 2 | #define __LOGGING_H__ 3 | 4 | #include "std.h" 5 | #include "mem.h" 6 | #include "timer.h" 7 | #include "indent.h" 8 | 9 | //////////////////////////////////////////////////////////// 10 | 11 | // State associated with a run. 12 | struct Run { 13 | Run() { init(); } 14 | bool print() const { return num_lines == next_line_to_print; } 15 | 16 | int num_omitted() { return num_lines - num_lines_printed; } 17 | bool new_line(); 18 | 19 | void init(); 20 | void finish(); 21 | 22 | int num_lines; // Number of lines that we've gone through so far in this run. 23 | int num_lines_printed; // Number of lines actually printed. 24 | int next_line_to_print; // Next line to be printed (lines are 0-based). 25 | Timer timer; // Keeps track of time spent on this run. 26 | bool print_all_lines; // Whether or not to force the printing of each line. 27 | }; 28 | 29 | //////////////////////////////////////////////////////////// 30 | // Global information about logging. 31 | 32 | struct LogInfo { 33 | LogInfo(); 34 | ~LogInfo(); 35 | 36 | void init(); 37 | void hard_flush() { out.flush(); } 38 | 39 | Run &parent_run() { return runs[ind_level-1]; } 40 | Run &this_run() { return runs[ind_level]; } 41 | Run &child_run() { return runs[ind_level+1]; } 42 | 43 | // Parameters. 44 | int max_ind_level; // Maximum indent level. 45 | int ms_per_line; // Number of milliseconds between consecutive lines of output. 46 | string log_file; 47 | 48 | // State. 49 | ofstream out; 50 | int ind_level; // Current indent level. 51 | const char *buf; // The buffer to be flushed out the next time _logs is called. 52 | vector runs; // Indent level -> state 53 | Timer timer; // Timer that starts at the beginning of the program 54 | }; 55 | 56 | extern LogInfo log_info; 57 | 58 | //////////////////////////////////////////////////////////// 59 | 60 | #define lout (log_info.out) 61 | #define here lout << "HERE " << __FILE__ << ':' << __LINE__ << endl 62 | #define _ind_within (log_info.ind_level <= log_info.max_ind_level) 63 | #define _parent_ind_within (log_info.ind_level-1 <= log_info.max_ind_level) 64 | #define _logs(x) \ 65 | do { lout << log_info.buf << Indent(log_info.ind_level) << x; log_info.buf = ""; } while(0) 66 | #define logs(x) \ 67 | do { \ 68 | if(_ind_within && log_info.this_run().new_line()) { \ 69 | _logs(x << endl); \ 70 | } \ 71 | } while(0) 72 | // Output something if parent outputted something. 73 | // Subtle note: parent must have been a track, not logs, so its run 74 | // information has not been updated yet until it closes. 75 | // Therefore, calling print() on it is valid. 76 | #define logss(x) \ 77 | do { \ 78 | if(_parent_ind_within && log_info.parent_run().print()) { \ 79 | log_info.this_run().new_line(); \ 80 | _logs(x << endl); \ 81 | } \ 82 | } while(0) 83 | 84 | #define LOGS(x) _logs(x << endl) 85 | 86 | //////////////////////////////////////////////////////////// 87 | // For tracking functions or blocks. 88 | struct LogTracker { 89 | LogTracker(const char *name) : b(true), output_stopped(false), name(name) { } 90 | void begin(bool print_all_lines); 91 | ~LogTracker(); 92 | 93 | bool b; // Trick used in track_block to execute the for loop exactly once. 94 | bool output_stopped; 95 | const char *name; 96 | ostringstream descrip; 97 | }; 98 | 99 | #define track(name, x, all) \ 100 | LogTracker _lt(name); \ 101 | (_ind_within && log_info.this_run().print() && _lt.descrip << x), _lt.begin(all) 102 | #define track_block(name, x, all) \ 103 | for(LogTracker _lt(name); \ 104 | _lt.b && ((_ind_within && log_info.this_run().print() && _lt.descrip << x), _lt.begin(all), true); \ 105 | _lt.b = false) 106 | 107 | #define track_foridx(i, n, s, all) \ 108 | foridx(i, n) track_block(s, i << '/' << n, all) 109 | #define track_forvec(i, tx, x, vec, s, all) \ 110 | forvec(i, tx, x, vec) track_block(s, i << '/' << len(vec), all) 111 | 112 | #define init_log \ 113 | log_info.init(); \ 114 | track("main", to_vector(argv, argc), true); \ 115 | logs(now() << " on " << hostname() << " (" << cpu_speed_mhz() << "MHz)"); 116 | 117 | #define prog_status \ 118 | "PROG_STATUS: " << \ 119 | "time = " << log_info.timer.stop() << \ 120 | ", memory = " << Mem(mem_usage()*1024) 121 | 122 | #endif 123 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/mem-tracker.cc: -------------------------------------------------------------------------------- 1 | #include "mem-tracker.h" 2 | #include "mem.h" 3 | 4 | /* 5 | * Currently, memory tracking is not accurate. 6 | * Alway underestimates. 7 | */ 8 | 9 | //////////////////////////////////////////////////////////// 10 | 11 | int MemTracker::compute_mem_usage(const MemRecord &r) { 12 | switch(r.type) { 13 | list_types(define_case); 14 | default: assert(0); 15 | } 16 | return 0; 17 | } 18 | 19 | int MemTracker::compute_mem_usage() { 20 | int total_mem = 0; 21 | forvec(_, MemRecord &, r, records) { 22 | if(r.type != T_RAWNUMBER) r.mem = compute_mem_usage(r); 23 | total_mem += r.mem; 24 | } 25 | return total_mem; 26 | } 27 | 28 | static bool record_less_than(const MemRecord &r1, const MemRecord &r2) { 29 | return r1.mem > r2.mem; 30 | } 31 | 32 | void MemTracker::report_mem_usage() { 33 | track("report_mem_usage()", "", true); 34 | 35 | int total_mem = compute_mem_usage(); 36 | 37 | sort(records.begin(), records.end(), record_less_than); 38 | 39 | forvec(_, const MemRecord &, r, records) { 40 | logs(type_names[r.type] << ' ' << r.name << ": " << 41 | Mem(r.mem) << " (" << (double)r.mem/total_mem << ')'); 42 | } 43 | logs("Total: " << Mem(total_mem)); 44 | } 45 | 46 | //////////////////////////////////////////////////////////// 47 | 48 | MemTracker mem_tracker; 49 | 50 | const char *MemTracker::type_names[] = { 51 | "?", 52 | list_types(define_str) 53 | }; 54 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/mem-tracker.h: -------------------------------------------------------------------------------- 1 | #ifndef __MEM_TRACKER_H__ 2 | #define __MEM_TRACKER_H__ 3 | 4 | #include "std.h" 5 | #include "stl-basic.h" 6 | #include "union-set.h" 7 | #include "strdb.h" 8 | 9 | // Currently, memory tracking is not accurate. 10 | // Alway underestimates. 11 | 12 | // Call this function. Don't use anything else. 13 | #define track_mem(x) mem_tracker.add(__STRING(x), x) 14 | 15 | #define list_types(f) \ 16 | f(IntVec) \ 17 | f(IntMat) \ 18 | f(IntIntMap) \ 19 | f(IntDoubleMap) \ 20 | f(IntIntPairMap) \ 21 | f(IntPairDoubleMap) \ 22 | f(IntSet) \ 23 | f(DoubleVec) \ 24 | f(DoubleVecVec) \ 25 | f(StrVec) \ 26 | f(StrIntMap) \ 27 | f(UnionSet) \ 28 | f(StrDB) 29 | 30 | #define prefix_t(type) T_##type, 31 | #define define_str(type) __STRING(type), 32 | #define define_add(type) \ 33 | void add(const char *name, const type &data) { \ 34 | records.push_back(MemRecord(name, T_##type, &data)); \ 35 | } 36 | #define define_case(type) \ 37 | case T_##type: return mem_usage(*((const type *)r.data)); 38 | 39 | enum MemType { T_RAWNUMBER, list_types(prefix_t) }; 40 | 41 | struct MemRecord { 42 | MemRecord(const char *name, int mem) : 43 | name(name), type(T_RAWNUMBER), data(NULL), mem(mem) { } 44 | MemRecord(const char *name, MemType type, const void *data) : 45 | name(name), type(type), data(data), mem(0) { } 46 | string name; 47 | MemType type; 48 | const void *data; 49 | int mem; 50 | }; 51 | 52 | // Track amount of memory used. 53 | class MemTracker { 54 | public: 55 | static const char *type_names[]; 56 | 57 | list_types(define_add) 58 | 59 | void add(const char *name, int mem) { 60 | records.push_back(MemRecord(name, mem)); 61 | } 62 | 63 | int compute_mem_usage(const MemRecord &r); 64 | int compute_mem_usage(); 65 | void report_mem_usage(); 66 | 67 | private: 68 | vector records; 69 | }; 70 | 71 | extern MemTracker mem_tracker; 72 | 73 | //////////////////////////////////////////////////////////// 74 | // Various mem_usage() functions on various data types. 75 | 76 | template int mem_usage(const vector< vector< vector< vector > > > &mat) { // matrix 77 | int mem = 0; 78 | foridx(i, len(mat)) { 79 | foridx(j, len(mat[i])) { 80 | foridx(k, len(mat[i][j])) 81 | mem += len(mat[i][j][k]) * sizeof(T); 82 | mem += len(mat[i][j]) * sizeof(vector); 83 | } 84 | mem += len(mat[i]) * sizeof(vector); 85 | } 86 | mem += len(mat) * sizeof(vector); 87 | return mem; 88 | } 89 | 90 | template int mem_usage(const vector< vector< vector > > &mat) { // matrix 91 | int mem = 0; 92 | foridx(i, len(mat)) { 93 | foridx(j, len(mat[i])) 94 | mem += len(mat[i][j]) * sizeof(T); 95 | mem += len(mat[i]) * sizeof(vector); 96 | } 97 | mem += len(mat) * sizeof(vector); 98 | return mem; 99 | } 100 | 101 | template int mem_usage(const vector< vector > &mat) { // matrix 102 | int mem = 0; 103 | foridx(i, len(mat)) 104 | mem += len(mat[i]) * sizeof(T); 105 | mem += len(mat) * sizeof(vector); 106 | return mem; 107 | } 108 | 109 | template int mem_usage(const vector &vec) { // vector 110 | return len(vec) * sizeof(T); 111 | } 112 | 113 | template int mem_usage(const unordered_set &set) { // hash_set 114 | return (int)set.bucket_count()*4 + len(set)*(sizeof(T)+sizeof(void *)); 115 | } 116 | 117 | template int mem_usage(const unordered_map &map) { // hash_map 118 | return (int)map.bucket_count()*4 + len(map)*(sizeof(Tx)+sizeof(Ty)+sizeof(void *)); 119 | } 120 | 121 | inline int mem_usage(const UnionSet &u) { // UnionSet 122 | return mem_usage(u.parent); 123 | } 124 | 125 | inline int mem_usage(const StrDB &db) { // StrDB 126 | int mem = mem_usage(db.s2i) + mem_usage(db.i2s); 127 | foridx(i, len(db)) 128 | mem += (strlen(db[i])+1) * sizeof(char); 129 | return mem; 130 | } 131 | 132 | #endif 133 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/mem.h: -------------------------------------------------------------------------------- 1 | #ifndef __MEM_H__ 2 | #define __MEM_H__ 3 | 4 | // Takes memory is in bytes and formats it nicely 5 | struct Mem { Mem(int mem) : mem(mem) { } int mem; }; 6 | inline ostream &operator<<(ostream &out, const Mem &m) { 7 | unsigned int mem = m.mem; 8 | if(mem < 1024) out << mem; 9 | else if(mem < 1024*1024) out << mem/1024 << 'K'; 10 | else out << mem/(1024*1024) << 'M'; 11 | return out; 12 | } 13 | 14 | #endif 15 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/multi-ostream.cc: -------------------------------------------------------------------------------- 1 | #include "multi-ostream.h" 2 | 3 | /* 4 | * Create a multi_ostream, and you can add many files or any ostream objects 5 | * to it. The output sent to the multi_ostream will be redirected to the many 6 | * destinations. 7 | * Useful for logging to a file and stdout. 8 | */ 9 | 10 | #include 11 | #include 12 | #include 13 | 14 | using namespace std; 15 | 16 | multi_buf::~multi_buf() { 17 | flush(); 18 | for(int i = 0; i < (int)infos.size(); i++) 19 | infos[i].destroy(); 20 | } 21 | 22 | void multi_buf::add(ostream *out, bool own, bool hard) { 23 | infos.push_back(ostream_info(out, own, hard)); 24 | } 25 | 26 | void multi_buf::flush() { 27 | for(int i = 0; i < (int)infos.size(); i++) { 28 | ostream_info &info = infos[i]; 29 | info.out->write(buf, buf_i); 30 | info.out->flush(); 31 | } 32 | buf_i = 0; 33 | } 34 | 35 | void multi_buf::hard_flush() { 36 | for(int i = 0; i < (int)infos.size(); i++) { 37 | ostream_info &info = infos[i]; 38 | info.out->write(buf, buf_i); 39 | if(info.hard) 40 | ((hard_ofstream *)info.out)->hard_flush(); 41 | else 42 | info.out->flush(); 43 | } 44 | buf_i = 0; 45 | } 46 | 47 | int multi_buf::overflow(int ch) { 48 | buf[buf_i++] = ch; 49 | if(buf_i == sizeof(buf) || ch == '\n') flush(); 50 | return ch; 51 | } 52 | 53 | ostream &multi_ostream::flush() { 54 | sbuf.flush(); 55 | return *this; 56 | } 57 | 58 | ostream &multi_ostream::hard_flush() { 59 | sbuf.hard_flush(); 60 | return *this; 61 | } 62 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/multi-ostream.h: -------------------------------------------------------------------------------- 1 | #ifndef __MULTI_OSTREAM_H__ 2 | #define __MULTI_OSTREAM_H__ 3 | 4 | /* 5 | * Create a multi_ostream, and you can add many files or any ostream objects 6 | * to it. The output sent to the multi_ostream will be redirected to the many 7 | * destinations. 8 | * Useful for logging to a file and stdout. 9 | */ 10 | 11 | #include 12 | #include 13 | #include 14 | 15 | #include "hard-ofstream.h" 16 | 17 | using namespace std; 18 | 19 | struct ostream_info { 20 | ostream_info(ostream *out, bool own, bool hard) : out(out), own(own), hard(hard) { } 21 | ostream *out; 22 | bool own; // Whether we own the ostream and should destroy it at the end. 23 | bool hard; // Whether this is a hard_ofstream. 24 | 25 | void destroy() { if(own) delete out; } 26 | }; 27 | 28 | class multi_buf : public streambuf { 29 | public: 30 | multi_buf() : buf_i(0) { } 31 | ~multi_buf(); 32 | 33 | void flush(); 34 | void hard_flush(); 35 | 36 | void add(ostream *out, bool own, bool hard); 37 | void remove_last() { flush(); infos.back().destroy(); infos.pop_back(); } 38 | 39 | protected: 40 | virtual int overflow(int ch); 41 | 42 | private: 43 | vector infos; 44 | char buf[16384]; 45 | int buf_i; 46 | }; 47 | 48 | class multi_ostream : public basic_ostream > { 49 | public: 50 | multi_ostream() : basic_ostream >(&sbuf) { } 51 | 52 | virtual ostream &flush(); 53 | virtual ostream &hard_flush(); 54 | 55 | void add(const char *file, bool hard = false) { 56 | ostream *out = hard ? new hard_ofstream(file) : new ofstream(file); 57 | sbuf.add(out, true, hard); 58 | } 59 | void add(ostream *out) { sbuf.add(out, false, false); } 60 | 61 | void remove_last() { sbuf.remove_last(); } 62 | 63 | private: 64 | multi_buf sbuf; 65 | }; 66 | 67 | #endif 68 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/opt.cc: -------------------------------------------------------------------------------- 1 | #include "opt.h" 2 | #include "std.h" 3 | #include "logging.h" 4 | #include 5 | 6 | //////////////////////////////////////////////////////////////////////// 7 | // command-line arguments 8 | 9 | void GetOpt::AddOpt(const string &name, bool has_arg) { 10 | opts.push_back(pair(name, has_arg)); 11 | } 12 | 13 | void GetOpt::Parse(int argc, char *argv[]) { 14 | option *opt_list = new option[opts.size()+1]; 15 | for(int i = 0; i <= (int)opts.size(); i++) { 16 | option *o = &opt_list[i]; 17 | if(i < (int)opts.size()) { 18 | o->name = opts[i].first.c_str(); 19 | o->has_arg = opts[i].second; 20 | //printf("N %s\n", o->name); 21 | } 22 | else { 23 | o->name = NULL; 24 | o->has_arg = 0; 25 | } 26 | o->flag = NULL; 27 | o->val = 0; 28 | } 29 | 30 | int i; 31 | 32 | values.clear(); 33 | values.resize(opts.size()); 34 | while(true) { 35 | int status = getopt_long(argc, argv, "", opt_list, &i); 36 | if(status == -1) break; 37 | assert(status == 0); 38 | //debug("%d %s -> %s\n", i, opt_list[i].name, optarg); 39 | // put a 1 to signify that the argument exists 40 | values[i] = optarg ? optarg : "1"; 41 | } 42 | 43 | delete [] opt_list; 44 | } 45 | 46 | int GetOpt::Lookup(const string &name) const { 47 | for(int i = 0; i < (int)opts.size(); i++) { 48 | if(opts[i].first == name) return i; 49 | } 50 | return -1; 51 | } 52 | 53 | string GetOpt::Get(const string &name, const string &default_value) const { 54 | int i = Lookup(name); 55 | return i != -1 && !values[i].empty() ? values[i] : default_value; 56 | } 57 | 58 | string GetOpt::Get(const string &name) const { 59 | string x = Get(name, ""); 60 | if(x.empty()) { 61 | fprintf(stderr, "Missing required parameter `%s'.\n", name.c_str()); 62 | exit(1); 63 | } 64 | return x; 65 | } 66 | 67 | bool GetOpt::Exists(const string &name) const { 68 | return !Get(name, "").empty(); 69 | } 70 | 71 | int GetOpt::GetInt(const string &name) const { 72 | int x; 73 | int r = sscanf(Get(name).c_str(), "%d", &x); 74 | assert(r == 1); 75 | return x; 76 | } 77 | 78 | int GetOpt::GetInt(const string &name, int default_value) const { 79 | return Exists(name) ? GetInt(name) : default_value; 80 | } 81 | 82 | double GetOpt::GetDouble(const string &name) const { 83 | double x; 84 | int r = sscanf(Get(name).c_str(), "%lf", &x); 85 | assert(r == 1); 86 | return x; 87 | } 88 | 89 | double GetOpt::GetDouble(const string &name, double default_value) const { 90 | return Exists(name) ? GetDouble(name) : default_value; 91 | } 92 | 93 | //////////////////////////////////////////////////////////// 94 | 95 | void process_opt(int argc, char *argv[]) { 96 | GetOpt opt; 97 | 98 | // set up GetOpt to parse 99 | for(int i = 0; i < (int)bool_opts.size(); i++) { 100 | opt.AddOpt(bool_opts[i].name, false); 101 | opt.AddOpt("no" + bool_opts[i].name, false); 102 | } 103 | for(int i = 0; i < (int)int_opts.size(); i++) 104 | opt.AddOpt(int_opts[i].name, true); 105 | for(int i = 0; i < (int)double_opts.size(); i++) 106 | opt.AddOpt(double_opts[i].name, true); 107 | for(int i = 0; i < (int)string_opts.size(); i++) 108 | opt.AddOpt(string_opts[i].name, true); 109 | opt.AddOpt("help", false); 110 | 111 | // parse 112 | opt.Parse(argc, argv); 113 | 114 | // print help if called for 115 | if(opt.Exists("help") || !opt.Exists("text")) { 116 | printf("usage: %s\n", argv[0]); 117 | for(int i = 0; i < (int)bool_opts.size(); i++) { 118 | const OptInfo &o = bool_opts[i]; 119 | printf(" %c%-20s: %s", " *"[o.required], o.name.c_str(), o.msg.c_str()); 120 | if(!o.required) printf(" [%s]", *(o.var) ? "true" : "false"); 121 | printf("\n"); 122 | } 123 | for(int i = 0; i < (int)int_opts.size(); i++) { 124 | const OptInfo &o = int_opts[i]; 125 | printf(" %c%-13s : %s", " *"[o.required], o.name.c_str(), o.msg.c_str()); 126 | if(!o.required) printf(" [%d]", *(o.var)); 127 | printf("\n"); 128 | } 129 | for(int i = 0; i < (int)double_opts.size(); i++) { 130 | const OptInfo &o = double_opts[i]; 131 | printf(" %c%-13s : %s", " *"[o.required], o.name.c_str(), o.msg.c_str()); 132 | if(!o.required) printf(" [%f]", *(o.var)); 133 | printf("\n"); 134 | } 135 | for(int i = 0; i < (int)string_opts.size(); i++) { 136 | const OptInfo &o = string_opts[i]; 137 | printf(" %c%-13s : %s", " *"[o.required], o.name.c_str(), o.msg.c_str()); 138 | if(!o.required) printf(" [%s]", (o.var)->c_str()); 139 | printf("\n"); 140 | } 141 | exit(1); 142 | } 143 | 144 | // retrieve data; store the variables 145 | for(int i = 0; i < (int)bool_opts.size(); i++) { 146 | const OptInfo &o = bool_opts[i]; 147 | bool yes = opt.Exists(o.name); 148 | bool no = opt.Exists("no" + o.name); 149 | assert(!o.required || (yes || no)); 150 | assert(!yes || !no); 151 | if(yes) *(o.var) = true; 152 | if(no) *(o.var) = false; 153 | } 154 | for(int i = 0; i < (int)int_opts.size(); i++) { 155 | const OptInfo &o = int_opts[i]; 156 | *(o.var) = o.required ? opt.GetInt(o.name) : opt.GetInt(o.name, *(o.var)); 157 | } 158 | for(int i = 0; i < (int)double_opts.size(); i++) { 159 | const OptInfo &o = double_opts[i]; 160 | *(o.var) = o.required ? opt.GetDouble(o.name) : opt.GetDouble(o.name, *(o.var)); 161 | } 162 | for(int i = 0; i < (int)string_opts.size(); i++) { 163 | const OptInfo &o = string_opts[i]; 164 | *(o.var) = o.required ? opt.Get(o.name) : opt.Get(o.name, *(o.var)); 165 | } 166 | } 167 | 168 | void init_opt(int argc, char *argv[]) { 169 | process_opt(argc, argv); 170 | srand(rand_seed); 171 | } 172 | 173 | void print_opts() { 174 | track("print_opts()", "", true); 175 | forvec(_, const OptInfo &, o, bool_opts) 176 | logs(o.name << " = " << (*o.var ? "true" : "false")); 177 | forvec(_, const OptInfo &, o, int_opts) 178 | logs(o.name << " = " << *o.var); 179 | forvec(_, const OptInfo &, o, double_opts) 180 | logs(o.name << " = " << *o.var); 181 | forvec(_, const OptInfo &, o, string_opts) 182 | logs(o.name << " = " << *o.var); 183 | } 184 | 185 | //////////////////////////////////////////////////////////// 186 | // Pre defined options. 187 | 188 | // allow user to specify a comment always, so some arbitrary description 189 | // of this program execution can be embedded in the command-line 190 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/opt.h: -------------------------------------------------------------------------------- 1 | #ifndef __OPT_H__ 2 | #define __OPT_H__ 3 | 4 | #include 5 | #include 6 | #include 7 | 8 | using namespace std; 9 | 10 | // First thing to call in main(). 11 | void init_opt(int argc, char *argv[]); 12 | 13 | //////////////////////////////////////////////////////////////////////// 14 | // command-line arguments 15 | 16 | class GetOpt { 17 | public: 18 | GetOpt() { } 19 | 20 | void AddOpt(const string &name, bool has_arg); 21 | void Parse(int argc, char *argv[]); 22 | int Lookup(const string &name) const; 23 | 24 | bool Exists(const string &name) const; 25 | string Get(const string &name, const string &default_value) const; 26 | string Get(const string &name) const; 27 | int GetInt(const string &name) const; 28 | int GetInt(const string &name, int default_value) const; 29 | double GetDouble(const string &name) const; 30 | double GetDouble(const string &name, double default_value) const; 31 | 32 | private: 33 | vector< pair > opts; 34 | vector values; 35 | }; 36 | 37 | template struct OptInfo { 38 | OptInfo(const string &name, T *var, const string &msg, bool required) 39 | : name(name), var(var), msg(msg), required(required) { } 40 | 41 | string name; 42 | T *var; // location of the variable that stores this value 43 | string msg; 44 | bool required; 45 | }; 46 | 47 | extern vector< OptInfo > bool_opts; 48 | extern vector< OptInfo > int_opts; 49 | extern vector< OptInfo > double_opts; 50 | extern vector< OptInfo > string_opts; 51 | 52 | //////////////////////////////////////////////////////////// 53 | 54 | // two versions: in one, option is required 55 | #define opt_define_bool_req(var, name, msg) \ 56 | bool var = opt_define_bool_wrap(name, &var, false, msg, true) 57 | #define opt_define_bool(var, name, val, msg) \ 58 | bool var = opt_define_bool_wrap(name, &var, val, msg, false) 59 | #define opt_define_int_req(var, name, msg) \ 60 | int var = opt_define_int_wrap(name, &var, 0, msg, true) 61 | #define opt_define_int(var, name, val, msg) \ 62 | int var = opt_define_int_wrap(name, &var, val, msg, false) 63 | #define opt_define_double_req(var, name, msg) \ 64 | double var = opt_define_double_wrap(name, &var, 0.0, msg, true) 65 | #define opt_define_double(var, name, val, msg) \ 66 | double var = opt_define_double_wrap(name, &var, val, msg, false) 67 | #define opt_define_string_req(var, name, msg) \ 68 | string var = opt_define_string_wrap(name, &var, "", msg, true) 69 | #define opt_define_string(var, name, val, msg) \ 70 | string var = opt_define_string_wrap(name, &var, val, msg, false) 71 | 72 | inline bool opt_define_bool_wrap(const string &name, bool *var, bool val, const string &msg, bool required) { 73 | bool_opts.push_back(OptInfo(name, var, msg, required)); 74 | return val; 75 | } 76 | 77 | inline int opt_define_int_wrap(const string &name, int *var, int val, const string &msg, bool required) { 78 | //printf("HELLO %s\n", name.c_str()); 79 | int_opts.push_back(OptInfo(name, var, msg, required)); 80 | //printf("N %d\n", (int)int_opts.size()); 81 | return val; 82 | } 83 | inline double opt_define_double_wrap(const string &name, double *var, double val, const string &msg, bool required) { 84 | double_opts.push_back(OptInfo(name, var, msg, required)); 85 | return val; 86 | } 87 | inline string opt_define_string_wrap(const string &name, string *var, const string &val, const string &msg, bool required) { 88 | string_opts.push_back(OptInfo(name, var, msg, required)); 89 | return val; 90 | } 91 | 92 | //////////////////////////////////////////////////////////// 93 | 94 | void print_opts(); 95 | 96 | extern int rand_seed; 97 | extern string comment; 98 | extern int initC; 99 | 100 | #endif 101 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/pipe.h: -------------------------------------------------------------------------------- 1 | /* 2 | Execute another application, piping input to and from its stdin and stdout. 3 | */ 4 | 5 | #ifndef __PIPE_H__ 6 | #define __PIPE_H__ 7 | 8 | typedef pair FILEPair; 9 | 10 | // Return input and output file pointers. 11 | // User is responsible for closing them. 12 | // May have to close out before reading from in. 13 | FILEPair create_pipe(char *const cmd[]) { 14 | int p2c_fds[2], c2p_fds[2]; 15 | 16 | assert(pipe(p2c_fds) == 0); 17 | assert(pipe(c2p_fds) == 0); 18 | 19 | int pid = fork(); 20 | assert(pid != -1); 21 | if(pid != 0) { // parent 22 | close(p2c_fds[0]); 23 | close(c2p_fds[1]); 24 | 25 | FILE *in = fdopen(c2p_fds[0], "r"); 26 | FILE *out = fdopen(p2c_fds[1], "w"); 27 | 28 | assert(in && out); 29 | 30 | return FILEPair(in, out); 31 | } 32 | else { // child 33 | close(p2c_fds[1]); 34 | close(c2p_fds[0]); 35 | 36 | assert(dup2(p2c_fds[0], fileno(stdin)) != -1); 37 | assert(dup2(c2p_fds[1], fileno(stdout)) != -1); 38 | execvp(cmd[0], cmd); 39 | 40 | // Execution should not reach here. 41 | assert(0); 42 | return FILEPair(NULL, NULL); 43 | } 44 | } 45 | 46 | #endif 47 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/prob-utils.cc: -------------------------------------------------------------------------------- 1 | #include "prob-utils.h" 2 | 3 | double rand_gaussian(double mean, double var) { 4 | // Use the Box-Muller Transformation 5 | // if x_1 and x_2 are independent uniform [0, 1], 6 | // then sqrt(-2 ln x_1) * cos(2*pi*x_2) is Gaussian with mean 0 and variance 1 7 | double x1 = rand_double(), x2 = rand_double(); 8 | double z = sqrt(-2*log(x1))*cos(2*M_PI*x2); 9 | return z * sqrt(var) + mean; 10 | } 11 | 12 | // The probability of heads is p. 13 | // Throw n coin tosses. 14 | // Return number of heads. 15 | int rand_binomial(int n, double p) { 16 | int k = 0; 17 | while(n--) k += rand_double() < p; 18 | return k; 19 | } 20 | 21 | inline double factorial(int n) { 22 | double ans = 1; 23 | while(n > 1) ans *= n--; 24 | return ans; 25 | } 26 | 27 | inline double choose(int n, int k) { 28 | if(n-k < k) k = n-k; 29 | double ans = 1; 30 | for(int i = 0; i < k; i++) ans *= n-i; 31 | ans /= factorial(k); 32 | return ans; 33 | } 34 | 35 | double binomial_prob(int n, int k, double p) { 36 | return choose(n, k) * pow(p, k) * pow(1-p, n-k); 37 | } 38 | 39 | int rand_index(const fvector &probs) { 40 | double v = rand_double(); 41 | double sum = 0; 42 | foridx(i, len(probs)) { 43 | sum += probs[i]; 44 | if(v < sum) return i; 45 | } 46 | assert(0); 47 | } 48 | 49 | void norm_distrib(fvector &vec) { 50 | double sum = 0; 51 | foridx(i, len(vec)) sum += vec[i]; 52 | foridx(i, len(vec)) vec[i] /= sum; 53 | } 54 | 55 | void norm_distrib(fmatrix &mat, int c) { 56 | double sum = 0; 57 | foridx(r, len(mat)) sum += mat[r][c]; 58 | foridx(r, len(mat)) mat[r][c] /= sum; 59 | } 60 | 61 | void rand_distrib(fvector &probs, int n) { 62 | probs.resize(n); 63 | foridx(i, n) probs[i] = rand(); 64 | norm_distrib(probs); 65 | } 66 | 67 | IntVec rand_permutation(int n) { 68 | IntVec perm(n); 69 | foridx(i, n) perm[i] = i; 70 | foridx(i, n) { 71 | int j = mrand(i, n); 72 | int t = perm[i]; perm[i] = perm[j]; perm[j] = t; 73 | } 74 | return perm; 75 | } 76 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/prob-utils.h: -------------------------------------------------------------------------------- 1 | #ifndef __PROB_UTILS__ 2 | #define __PROB_UTILS__ 3 | 4 | #include "stl-basic.h" 5 | 6 | int rand_binomial(int n, double p); 7 | int rand_index(const fvector &probs); 8 | double rand_gaussian(double mean, double var); 9 | 10 | inline double factorial(int n); 11 | inline double choose(int n, int k); 12 | double binomial_prob(int n, int k, double p); 13 | 14 | void norm_distrib(fvector &vec); 15 | void norm_distrib(fmatrix &mat, int c); 16 | void rand_distrib(fvector &probs, int n); 17 | IntVec rand_permutation(int n); 18 | 19 | #endif 20 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/stats.cc: -------------------------------------------------------------------------------- 1 | #include "stats.h" 2 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/stats.h: -------------------------------------------------------------------------------- 1 | #ifndef __STATS_H__ 2 | #define __STATS_H__ 3 | 4 | #include "std.h" 5 | #include "stl-basic.h" 6 | #define DBL_MAX 1e300 7 | #define DBL_MIN (-1e300) 8 | 9 | struct StatFig { 10 | StatFig() { clear(); } 11 | StatFig(double sum, int n) : sum(sum), n(n) { } 12 | virtual ~StatFig() { } 13 | 14 | static double F1(const StatFig &fig1, const StatFig &fig2) { 15 | if(fig1.n == 0 || fig2.n == 0) return 0; 16 | return 2*fig1.val()*fig2.val() / (fig1.val()+fig2.val()); 17 | } 18 | 19 | void add() { add(1); } 20 | virtual void add(double v) { sum += v; n++; } 21 | virtual void clear() { sum = n = 0; } 22 | int size() const { return n; } 23 | double val() const { return sum / n; } 24 | double mean() const { return sum / n; } 25 | double sum; 26 | int n; 27 | }; 28 | 29 | inline ostream &operator<<(ostream &out, const StatFig &fig) { 30 | return out << fig.sum << '/' << fig.n << '=' << fig.val(); 31 | } 32 | 33 | //////////////////////////////////////////////////////////// 34 | // Stores the min and the amx 35 | 36 | struct BigStatFig : public StatFig { 37 | BigStatFig() { clear(); } 38 | void add(double v) { if(v < min) min = v; if(v > max) max = v; StatFig::add(v); } 39 | void clear() { min = DBL_MAX; max = DBL_MIN; StatFig::clear(); } 40 | double min, max; 41 | }; 42 | 43 | inline ostream &operator<<(ostream &out, const BigStatFig &fig) { 44 | return out << fig.n << ':' << fig.min << "/<< " << fig.val() << " >>/" << fig.max; 45 | } 46 | 47 | //////////////////////////////////////////////////////////// 48 | // Stores the standard deviation (and all points) 49 | 50 | struct FullStatFig : public BigStatFig { 51 | FullStatFig() { clear(); } 52 | virtual ~FullStatFig() { } 53 | void add(double v) { data.push_back(v); BigStatFig::add(v); } 54 | void clear() { data.clear(); BigStatFig::clear(); } 55 | 56 | double variance() const { 57 | double var = 0, mean = val(); 58 | forvec(_, double, v, data) var += sq(v-mean); 59 | var /= n; 60 | return var; 61 | } 62 | double stddev() const { return sqrt(variance()); } 63 | 64 | DoubleVec data; 65 | }; 66 | 67 | inline ostream &operator<<(ostream &out, const FullStatFig &fig) { 68 | return out << (BigStatFig)fig << '~' << fig.stddev(); 69 | } 70 | 71 | #endif 72 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/std.cc: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include "std.h" 5 | #include "str.h" 6 | #include "timer.h" 7 | 8 | // Return the current date/time. 9 | string now() { 10 | time_t t = time(NULL); 11 | return substr(ctime(&t), 0, -1); 12 | } 13 | 14 | string hostname() { 15 | char buf[1024]; 16 | gethostname(buf, sizeof(buf)); 17 | return buf; 18 | } 19 | 20 | // Return the amount of memory (kB) used by this process 21 | int mem_usage() { 22 | ifstream in("/proc/self/status"); 23 | if(!in) return 0; 24 | char buf[1024]; 25 | static const char *key = "VmRSS"; 26 | 27 | while(in.getline(buf, sizeof(buf))) { 28 | if(strncmp(buf, key, strlen(key)) != 0) continue; 29 | char *s = strchr(buf, ':'); 30 | if(!s) return 0; 31 | int x; 32 | sscanf(s+1, "%d", &x); 33 | return x; 34 | } 35 | return -1; 36 | } 37 | 38 | // Return whether the file exists. 39 | bool file_exists(const char *file) { 40 | return access(file, F_OK) == 0; 41 | } 42 | 43 | // Create an empty file. Return success. 44 | bool create_file(const char *file) { 45 | ofstream out(file); 46 | if(!out) return false; 47 | out.close(); 48 | return true; 49 | } 50 | 51 | time_t file_modified_time(const char *file) { 52 | struct stat stat_buf; 53 | if(stat(file, &stat_buf) != 0) 54 | return 0; 55 | return stat_buf.st_mtime; 56 | } 57 | 58 | // Return the cpu speed in MHz. 59 | int cpu_speed_mhz() { 60 | ifstream in("/proc/cpuinfo"); 61 | if(!in) return 0; 62 | char buf[1024]; 63 | static const char *key = "cpu MHz"; 64 | 65 | while(in.getline(buf, sizeof(buf))) { 66 | if(strncmp(buf, key, strlen(key)) != 0) continue; 67 | char *s = strchr(buf, ':'); 68 | if(!s) return 0; 69 | double x; 70 | sscanf(s+1, "%lf", &x); 71 | return (int)x; 72 | } 73 | return 0; 74 | } 75 | 76 | // "file" -> "file" 77 | // "dir/file" -> "file" 78 | string strip_dir(string s) { 79 | return substr(s, s.rfind('/')+1); 80 | } 81 | 82 | // "file" -> "file" 83 | // "dir/file" -> "dir" 84 | string get_dir(string s) { 85 | int i = s.rfind('/'); 86 | return i == -1 ? "." : substr(s, 0, s.rfind('/')); 87 | } 88 | 89 | // "base" -> "base" 90 | // "base.ext" -> "base" 91 | string file_base(string s) { 92 | int i = s.rfind('.'); 93 | return i == -1 ? s : substr(s, 0, i); 94 | } 95 | 96 | bool get_files_in_dir(string dirname, bool fullpath, vector &files) { 97 | DIR *dir = opendir(dirname.c_str()); 98 | if(!dir) return false; 99 | while(true) { 100 | dirent *ent = readdir(dir); 101 | if(!ent) break; 102 | // For some reason, sometimes files show up as d_type == DT_UNKNOWN, I 103 | // think due to AFS issues 104 | //cout << "FFF " << ent->d_name << ' ' << (int)ent->d_type << endl; 105 | if(ent->d_type != DT_DIR) { 106 | files.push_back((fullpath ? dirname+"/" : string()) + ent->d_name); 107 | } 108 | } 109 | closedir(dir); 110 | return true; 111 | } 112 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/std.h: -------------------------------------------------------------------------------- 1 | #ifndef __STD_H__ 2 | #define __STD_H__ 3 | 4 | #include 5 | #include 6 | #include 7 | //#include 8 | #include 9 | #include 10 | #include 11 | #include 12 | #include 13 | #include 14 | #include 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include 20 | 21 | using namespace std; 22 | 23 | //////////////////////////////////////////////////////////// 24 | 25 | #define len(vec) (int)(vec).size() 26 | #define sq(x) ((x)*(x)) 27 | 28 | // For loop sugar. This is such a hack! 29 | #define foridx(i, n) for(int i = 0; i < n; i++) 30 | #define forvec(i, tx, x, vec) for(int i = 0, _##i = 0; i < len(vec); i++) \ 31 | for(tx x = (vec)[i]; i == _##i; _##i++) 32 | #define formap(tx, x, ty, y, t, map) forstl(t, _##x##y, map) _mapvars(tx, x, ty, y) 33 | #define forcmap(tx, x, ty, y, t, map) forcstl(t, _##x##y, map) _mapvars(tx, x, ty, y) 34 | #define forstl(t, x, container) for(t::iterator x = (container).begin(); x != (container).end(); x++) 35 | #define forcstl(t, x, container) for(t::const_iterator x = (container).begin(); x != (container).end(); x++) 36 | #define _mapvars(tx, x, ty, y) for(tx x = _##x##y->first, *_##x = &x; _##x; _##x = NULL) \ 37 | for(ty y = _##x##y->second, *_##y = &y; _##y; _##y = NULL) 38 | 39 | //////////////////////////////////////////////////////////// 40 | // Generate random numbers. 41 | 42 | inline int mrand(int a) { return rand() % a; } 43 | inline int mrand(int a, int b) { return rand() % (b-a) + a; } 44 | inline double rand_double() { 45 | static const int BASE = 100000; 46 | return (double)(rand()%BASE)/BASE; 47 | } 48 | 49 | //////////////////////////////////////////////////////////// 50 | // Floating point stuff. 51 | 52 | const double TOL = 1e-10; 53 | 54 | inline bool flt(double u, double v) { return u + TOL < v; } 55 | inline bool fgt(double u, double v) { return u - TOL > v; } 56 | 57 | // Comparing floating point numbers. 58 | inline bool feq(double u, double v, double tol = TOL) { return fabs(u-v) < tol; } 59 | 60 | template inline int sign(T u) { 61 | if(u < 0) return -1; 62 | if(u > 0) return 1; 63 | return 0; 64 | } 65 | 66 | #define assert_feq(u, v) do { _assert_feq(u, v, __FILE__, __LINE__); } while(0); 67 | #define assert_feq2(u, v, tol) do { _assert_feq(u, v, tol, __FILE__, __LINE__); } while(0); 68 | #define assert_fneq(u, v) do { _assert_fneq(u, v, __FILE__, __LINE__); } while(0); 69 | inline void _assert_feq(double u, double v, const char *file, int line) { 70 | if(!feq(u, v)) { printf("At %s:%d, %f != %f\n", file, line, u, v); assert(0); } 71 | } 72 | inline void _assert_feq(double u, double v, double tol, const char *file, int line) { 73 | if(!feq(u, v, tol)) { printf("At %s:%d, %f != %f\n", file, line, u, v); assert(0); } 74 | } 75 | inline void _assert_fneq(double u, double v, const char *file, int line) { 76 | if(feq(u, v)) { printf("At %s:%d, %f == %f\n", file, line, u, v); assert(0); } 77 | } 78 | #define assert_eq(u, v) do { _assert_eq(u, v, __STRING(u), __STRING(v), __FILE__, __LINE__); } while(0) 79 | template inline void _assert_eq(const T &u, const T &v, const char *us, const char *vs, const char *file, int line) { 80 | if(u != v) { 81 | cout << "At " << file << ':' << line << ", " << 82 | us << '(' << u << ')' << " != " << 83 | vs << '(' << v << ')' << endl; 84 | assert(0); 85 | } 86 | } 87 | 88 | #define assert2(x, reason) \ 89 | do { \ 90 | if(!(x)) { \ 91 | cout << "\nFAILURE REASON: " << reason << endl; \ 92 | assert(x); \ 93 | } \ 94 | } while(0) 95 | 96 | string now(); 97 | string hostname(); 98 | int cpu_speed_mhz(); 99 | int mem_usage(); // in kB 100 | 101 | bool create_file(const char *file); 102 | bool file_exists(const char *file); 103 | time_t file_modified_time(const char *file); 104 | 105 | string strip_dir(string s); 106 | string get_dir(string s); 107 | string file_base(string s); 108 | bool get_files_in_dir(string dirname, bool fullpath, vector &files); 109 | 110 | #endif 111 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/stl-basic.cc: -------------------------------------------------------------------------------- 1 | #include "stl-basic.h" 2 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/stl-basic.h: -------------------------------------------------------------------------------- 1 | #ifndef __STL_BASIC_H__ 2 | #define __STL_BASIC_H__ 3 | 4 | #include "std.h" 5 | #include "city.h" 6 | 7 | //////////////////////////////////////////////////////////// 8 | 9 | typedef double real; 10 | //typedef float real; 11 | 12 | typedef pair IntPair; 13 | typedef pair IntDouble; 14 | typedef pair DoubleInt; 15 | typedef pair DoublePair; 16 | typedef vector IntPairVec; 17 | typedef vector DoubleIntVec; 18 | typedef vector BoolVec; 19 | typedef vector IntVec; 20 | typedef vector StringVec; 21 | typedef vector IntMat; 22 | typedef vector IntVecVec; 23 | typedef vector IntVecVecVec; 24 | typedef vector IntVecVecVecVec; 25 | typedef vector DoubleVec; 26 | typedef vector DoubleVecVec; 27 | typedef vector DoubleVecVecVec; 28 | typedef vector DoubleVecVecVecVec; 29 | typedef vector IntDoubleVec; 30 | typedef vector IntDoubleVecVec; 31 | typedef vector IntDoubleVecVecVec; 32 | typedef vector IntDoubleVecVecVecVec; 33 | 34 | typedef IntVec ivector; 35 | typedef DoubleVec fvector; 36 | typedef DoubleVecVec fmatrix; 37 | 38 | //////////////////////////////////////////////////////////// 39 | 40 | struct vector_eq { 41 | bool operator()(const IntVec &v1, const IntVec &v2) const { 42 | return v1 == v2; 43 | } 44 | }; 45 | struct vector_hf { 46 | size_t operator()(const IntVec &v) const { 47 | return CityHash64(reinterpret_cast(&v[0]), sizeof(int) * v.size()); 48 | #if 0 49 | int h = 0; 50 | foridx(i, len(v)) 51 | h = (h<<4)^(h>>28)^v[i]; 52 | return h; 53 | #endif 54 | } 55 | }; 56 | 57 | struct pair_eq { 58 | bool operator()(const IntPair &p1, const IntPair &p2) const { 59 | return p1 == p2; 60 | } 61 | }; 62 | struct pair_hf { 63 | size_t operator()(const IntPair &p) const { 64 | return (p.first<<4)^(p.first>>28) ^ p.second; 65 | } 66 | }; 67 | 68 | struct str_eq { 69 | bool operator()(const char *s1, const char *s2) const { 70 | return strcmp(s1, s2) == 0; 71 | } 72 | }; 73 | struct str_hf { 74 | size_t operator()(const char *s) const { 75 | return CityHash64(s, strlen(s)); 76 | } 77 | }; 78 | 79 | struct string_eq { 80 | bool operator()(const string &s1, const string &s2) const { 81 | return s1 == s2; 82 | } 83 | }; 84 | struct string_hf { 85 | size_t operator()(const string &s) const { 86 | return CityHash64(s.c_str(), s.size()); 87 | } 88 | }; 89 | 90 | //////////////////////////////////////////////////////////// 91 | 92 | typedef unordered_set IntSet; 93 | typedef unordered_set IntPairSet; 94 | typedef unordered_set IntVecSet; 95 | typedef unordered_map IntVecDoubleMap; 96 | typedef unordered_map IntVecIntMap; 97 | typedef unordered_map IntIntMap; 98 | typedef unordered_map IntDoubleMap; 99 | typedef unordered_map IntIntPairMap; 100 | typedef unordered_map IntIntVecMap; 101 | typedef unordered_map IntIntIntMapMap; 102 | typedef unordered_map IntPairIntMap; 103 | typedef unordered_map IntPairDoubleMap; 104 | typedef unordered_map IntPairDoubleVecMap; 105 | typedef unordered_map IntVecIntVecMap; 106 | typedef unordered_map IntVecDoubleVecMap; 107 | typedef vector IntIntMapVec; 108 | 109 | typedef vector StrVec; 110 | typedef unordered_map StrIntMap; 111 | typedef unordered_map StrStrMap; 112 | 113 | #endif 114 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/stl-utils.cc: -------------------------------------------------------------------------------- 1 | #include "stl-utils.h" 2 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/stl-utils.h: -------------------------------------------------------------------------------- 1 | #ifndef __STL_UTILS__ 2 | #define __STL_UTILS__ 3 | 4 | #include "stl-basic.h" 5 | #include 6 | 7 | #define contains(X, x) ((X).find(x) != (X).end()) 8 | 9 | inline void improve(DoubleInt &x, const DoubleInt &y) { 10 | if(y.first > x.first) x = y; // Bigger is better. 11 | } 12 | 13 | template inline void improve(DoubleInt &x, const DoubleInt &y, Compare compare) { 14 | if(compare(y.first, x.first)) x = y; 15 | } 16 | 17 | // Free up the memory in a vector or hash_map. 18 | template void destroy(T &obj) { 19 | T empty_obj; 20 | obj.swap(empty_obj); 21 | } 22 | 23 | template int index_of(const vector &vec, const T &x, int i0 = 0) { 24 | for(int i = i0; i < len(vec); i++) 25 | if(vec[i] == x) return i; 26 | return -1; 27 | } 28 | 29 | template int count_of(const vector &vec, const T &x) { 30 | int n = 0; 31 | forvec(_, const T &, y, vec) 32 | if(x == y) n++; 33 | return n; 34 | } 35 | 36 | // Get vec[i], but if i is out of range, expand the vector and fill 37 | // everything with x. 38 | template T &expand_get(vector &vec, int i, const T &x) { 39 | int n = len(vec); 40 | if(i >= n) { 41 | vec.resize(i+1); 42 | for(int ii = n; ii <= i; ii++) vec[ii] = x; 43 | } 44 | return vec[i]; 45 | } 46 | template T &expand_get(vector< vector > &mat, int i, int j, const T &x) { 47 | int n = len(mat); 48 | if(i >= n) mat.resize(i+1); 49 | return expand_get(mat[i], j, x); 50 | } 51 | template T &expand_get(vector< vector< vector > > &mat, int i, int j, int k, const T &x) { 52 | int n = len(mat); 53 | if(i >= n) mat.resize(i+1); 54 | return expand_get(mat[i], j, k, x); 55 | } 56 | 57 | // Assuming this vector/matrix will not grow any more, 58 | // we can safely call compact to reduce the memory usage. 59 | // This is only effective after deletions. 60 | // This isn't necessary if we haven't actually touched 61 | // the memory past size (i.e., we didn't have a bigger 62 | // structure). 63 | template void vector_compact(vector &vec) { 64 | vector new_vec(len(vec)); 65 | new_vec = vec; 66 | vec.swap(new_vec); 67 | } 68 | template void matrix_compact(vector< vector > &mat) { 69 | vector< vector > new_mat(len(mat)); 70 | foridx(i, len(mat)) compact(mat[i]); 71 | new_mat = mat; 72 | mat.swap(new_mat); 73 | } 74 | 75 | // Append to a vector and return the value type. 76 | template inline T &push_back(vector &vec, const T &x = T()) { 77 | vec.push_back(x); 78 | return vec[len(vec)-1]; 79 | } 80 | 81 | template inline void matrix_resize(vector< vector > &mat, int nr, int nc) { 82 | mat.resize(nr); 83 | foridx(r, nr) mat[r].resize(nc); 84 | } 85 | 86 | template inline void matrix_resize(vector< vector< vector > > &mat, int n1, int n2, int n3) { 87 | mat.resize(n1); 88 | foridx(i, n1) { 89 | mat[i].resize(n2); 90 | foridx(j, n2) 91 | mat[i][j].resize(n3); 92 | } 93 | } 94 | 95 | template inline vector< vector > new_matrix(int nr, int nc, T v) { 96 | vector< vector > mat; 97 | mat.resize(nr); 98 | foridx(r, nr) { 99 | mat[r].resize(nc); 100 | foridx(c, nc) 101 | mat[r][c] = v; 102 | } 103 | return mat; 104 | } 105 | 106 | template inline void matrix_fill(vector< vector > &mat, T v) { 107 | foridx(i, len(mat)) vector_fill(mat[i], v); 108 | } 109 | 110 | template inline void vector_fill(vector &vec, T v) { 111 | foridx(i, len(vec)) vec[i] = v; 112 | } 113 | 114 | template inline T vector_sum(const vector &vec) { 115 | T sum = 0; 116 | foridx(i, len(vec)) sum += vec[i]; 117 | return sum; 118 | } 119 | 120 | // Returns the index of the minimum element in vec. 121 | template inline int vector_index_min(const vector &vec) { 122 | T min = vec[0]; 123 | int best_i = 0; 124 | foridx(i, len(vec)) { 125 | if(vec[i] < min) { 126 | min = vec[i]; 127 | best_i = i; 128 | } 129 | } 130 | return best_i; 131 | } 132 | 133 | template inline int vector_min(const vector &vec) { 134 | return vec[vector_index_min(vec)]; 135 | } 136 | 137 | // Returns the index of the maximum element in vec. 138 | template inline int vector_index_max(const vector &vec) { 139 | T max = vec[0]; 140 | int best_i = 0; 141 | foridx(i, len(vec)) { 142 | if(vec[i] > max) { 143 | max = vec[i]; 144 | best_i = i; 145 | } 146 | } 147 | return best_i; 148 | } 149 | 150 | template inline int vector_max(const vector &vec) { 151 | return vec[vector_index_max(vec)]; 152 | } 153 | 154 | // Returns the index of the maximum element in vec. 155 | template inline IntPair matrix_index_max(const vector< vector > &mat) { 156 | T max = mat[0][0]; 157 | IntPair best_ij = IntPair(0, 0); 158 | foridx(i, len(mat)) { 159 | foridx(j, len(mat[i])) { 160 | if(mat[i][j] > max) { 161 | max = mat[i][j]; 162 | best_ij = IntPair(i, j); 163 | } 164 | } 165 | } 166 | return best_ij; 167 | } 168 | 169 | // Returns the sum of the elements in column c. 170 | template inline T matrix_col_sum(const vector< vector > &mat, int c) { 171 | T sum = 0; 172 | foridx(r, len(mat)) sum += mat[r][c]; 173 | return sum; 174 | } 175 | 176 | template ostream &operator<<(ostream &out, const pair &p) { 177 | return out << p.first << ' ' << p.second; 178 | } 179 | 180 | template ostream &operator<<(ostream &out, const vector &vec) { 181 | foridx(i, len(vec)) { 182 | if(i > 0) out << ' '; 183 | out << vec[i]; 184 | } 185 | return out; 186 | } 187 | 188 | template ostream &operator<<(ostream &out, const vector< vector > &mat) { 189 | foridx(r, len(mat)) out << mat[r] << endl; 190 | return out; 191 | } 192 | 193 | template vector subvector(const vector &vec, int i, int j = -1) { 194 | int N = len(vec); 195 | if(j < 0) j += N; 196 | if(j < i) j = i; 197 | 198 | // Probably some fancy STL way to do this. 199 | vector subvec(j-i); 200 | foridx(k, j-i) subvec[k] = vec[i+k]; 201 | return subvec; 202 | } 203 | 204 | template vector to_vector(T arr[], int n) { 205 | vector vec(n); 206 | foridx(i, n) vec[i] = arr[i]; 207 | return vec; 208 | } 209 | 210 | inline IntVec to_vector(int n, ...) { 211 | va_list ap; 212 | IntVec vec; 213 | va_start(ap, n); 214 | foridx(i, n) vec.push_back(va_arg(ap, int)); 215 | va_end(ap); 216 | return vec; 217 | } 218 | 219 | inline DoubleVec to_fvector(int n, ...) { 220 | va_list ap; 221 | DoubleVec vec; 222 | va_start(ap, n); 223 | foridx(i, n) vec.push_back(va_arg(ap, double)); 224 | va_end(ap); 225 | return vec; 226 | } 227 | 228 | template inline void operator+=(vector &vec1, const vector &vec2) { 229 | foridx(i, len(vec1)) vec1[i] += vec2[i]; 230 | } 231 | 232 | #endif 233 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/str-str-db.cc: -------------------------------------------------------------------------------- 1 | #include "str-str-db.h" 2 | #include "std.h" 3 | #include "str.h" 4 | #include "strdb.h" 5 | 6 | StrStrDB::~StrStrDB() { 7 | destroy_strings(s2t); 8 | } 9 | 10 | // File format: lines of \t\t<...junk...> 11 | void StrStrDB::read(const char *file) { 12 | track("StrStrDB::read()", file, true); 13 | char buf[1024]; 14 | ifstream in(file); 15 | assert2(in, file); 16 | 17 | // Read the s2t for each word. 18 | max_t_len = 0; 19 | while(in.getline(buf, sizeof(buf))) { 20 | char *t = strtok(buf, "\t"); 21 | char *s = strtok(NULL, "\t"); 22 | assert(s && t); 23 | 24 | assert2(!contains(s2t, s), s << " appears twice"); 25 | s2t[copy_str(s)] = copy_str(t); 26 | max_t_len = max(max_t_len, (int)strlen(t)); 27 | } 28 | logs("Read " << len(s2t) << " strings"); 29 | logs("Longest mapped string is " << max_t_len << " characters."); 30 | } 31 | 32 | const char *StrStrDB::operator[](const char *word) const { 33 | StrStrMap::const_iterator it = s2t.find(word); 34 | return it == s2t.end() ? "" : it->second; 35 | } 36 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/str-str-db.h: -------------------------------------------------------------------------------- 1 | #ifndef __STR_STR_DB_H__ 2 | #define __STR_STR_DB_H__ 3 | 4 | #include "stl-basic.h" 5 | 6 | // Maps strings (s) to strings (t). 7 | class StrStrDB { 8 | public: 9 | ~StrStrDB(); 10 | 11 | void read(const char *file); 12 | const char *operator[](const char *s) const; 13 | 14 | int max_t_len; 15 | private: 16 | StrStrMap s2t; 17 | }; 18 | 19 | #endif 20 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/str.cc: -------------------------------------------------------------------------------- 1 | #include "stl-basic.h" 2 | #include 3 | 4 | string substr(const string &s, int i, int j) { 5 | if(i < 0) i += len(s); 6 | if(j < 0) j += len(s); 7 | i = max(i, 0); 8 | j = max(j, i); 9 | return s.substr(i, j-i); 10 | } 11 | string substr(const string &s, int i) { 12 | return substr(s, i, len(s)); 13 | } 14 | 15 | string str_printf(const char *fmt, ...) { 16 | char buf[16384]; 17 | va_list ap; 18 | va_start(ap, fmt); 19 | vsnprintf(buf, sizeof(buf), fmt, ap); 20 | va_end(ap); 21 | return buf; 22 | } 23 | 24 | char *copy_str(const char *s) { 25 | char *t = new char[strlen(s)+1]; 26 | strcpy(t, s); 27 | return t; 28 | } 29 | 30 | string int2str(int x) { 31 | return str_printf("%d", x); 32 | } 33 | 34 | string double2str(double x) { 35 | ostringstream os; 36 | os << x; 37 | return os.str(); 38 | } 39 | 40 | StringVec split(const char *str, const char *delims, bool keep_empty) { 41 | StringVec vec; // Store the result. 42 | // Build quick lookup table. 43 | BoolVec is_delim(256); 44 | for(const char *p = delims; *p; p++) is_delim[*p] = true; 45 | is_delim['\0'] = true; 46 | 47 | const char *end = str; 48 | while(true) { 49 | if(is_delim[*end]) { 50 | if(keep_empty || end-str > 0) // Extract token. 51 | vec.push_back(string(str, end-str)); 52 | str = end+1; 53 | } 54 | if(!*end) break; 55 | end++; 56 | } 57 | return vec; 58 | } 59 | 60 | StrVec mutate_split(char *str, const char *delims) { 61 | StrVec vec; 62 | for(char *p = strtok(str, delims); p; p = strtok(NULL, delims)) 63 | vec.push_back(p); 64 | return vec; 65 | } 66 | 67 | // Remove leading and trailing white space. 68 | char *trim(char *s) { 69 | // Removing leading spaces. 70 | while(*s && isspace(*s)) s++; 71 | 72 | // Remove trailing spaces. 73 | char *t; 74 | for(t = s+strlen(s)-1; t != s && isspace(*t); t--); 75 | t[1] = '\0'; 76 | return s; 77 | } 78 | 79 | string tolower(const char *s) { 80 | string t = s; 81 | foridx(i, len(t)) t[i] = tolower(t[i]); 82 | return t; 83 | } 84 | 85 | // String matching with brute force. 86 | int index_of(const char *s, const char *t) { 87 | int ns = strlen(s), nt = strlen(t); 88 | foridx(i, ns-nt+1) 89 | if(strncmp(s+i, t, nt) == 0) return i; 90 | return -1; 91 | } 92 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/str.h: -------------------------------------------------------------------------------- 1 | #ifndef __STR_H__ 2 | #define __STR_H__ 3 | 4 | #include "stl-basic.h" 5 | 6 | string substr(const string &s, int i, int j); 7 | string substr(const string &s, int i); 8 | 9 | string str_printf(const char *fmt, ...); 10 | char *copy_str(const char *s); 11 | string int2str(int x); 12 | string double2str(double x); 13 | 14 | StringVec split(const char *str, const char *delims, bool keep_empty); 15 | StrVec mutate_split(char *str, const char *delims); 16 | 17 | char *trim(char *s); 18 | string tolower(const char *s); 19 | 20 | int index_of(const char *s, const char *t); 21 | 22 | #endif 23 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/strdb.cc: -------------------------------------------------------------------------------- 1 | #include "strdb.h" 2 | #include "str.h" 3 | 4 | void destroy_strings(StrVec &vec) { 5 | foridx(i, len(vec)) 6 | delete [] vec[i]; 7 | } 8 | 9 | void destroy_strings(StrStrMap &map) { 10 | typedef const char *const_char_ptr; 11 | StrVec strs; 12 | formap(const_char_ptr, s, const_char_ptr, t, StrStrMap, map) { 13 | strs.push_back(s); 14 | strs.push_back(t); 15 | } 16 | destroy_strings(strs); 17 | } 18 | 19 | //////////////////////////////////////////////////////////// 20 | 21 | int StrDB::read(istream &in, int N, bool one_way) { 22 | char s[16384]; 23 | clear(); 24 | while(size() < N && in >> s) { 25 | if(one_way) i2s.push_back(copy_str(s)); 26 | else (*this)[s]; 27 | } 28 | logs(size() << " strings read"); 29 | return size(); 30 | } 31 | 32 | int StrDB::read(const char *file, bool one_way) { 33 | track("StrDB::read()", file << ", one_way=" << one_way, true); 34 | ifstream in(file); 35 | assert(in); 36 | return read(in, INT_MAX, one_way); 37 | } 38 | 39 | void StrDB::write(ostream &out) { 40 | foridx(i, size()) 41 | out << i2s[i] << endl; 42 | logs(size() << " strings written"); 43 | } 44 | 45 | void StrDB::write(const char *file) { 46 | track("StrDB::write()", file, true); 47 | ofstream out(file); 48 | write(out); 49 | } 50 | 51 | const char *StrDB::operator[](int i) const { 52 | assert(i >= 0 && i < len(i2s)); 53 | return i2s[i]; 54 | } 55 | 56 | int StrDB::lookup(const char *s, bool incorp_new, int default_i) { 57 | StrIntMap::const_iterator it = s2i.find(s); 58 | if(it != s2i.end()) return it->second; 59 | if(incorp_new) { 60 | char *t = copy_str(s); 61 | int i = s2i[t] = len(i2s); 62 | i2s.push_back(t); 63 | return i; 64 | } 65 | else 66 | return default_i; 67 | } 68 | 69 | IntVec StrDB::lookup(const StrVec &svec) { 70 | IntVec ivec(len(svec)); 71 | foridx(i, len(svec)) 72 | ivec[i] = lookup(svec[i], true, -1); 73 | return ivec; 74 | } 75 | 76 | int StrDB::operator[](const char *s) const { 77 | StrIntMap::const_iterator it = s2i.find(s); 78 | if(it != s2i.end()) return it->second; 79 | return -1; 80 | } 81 | 82 | int StrDB::operator[](const char *s) { 83 | return lookup(s, true, -1); 84 | } 85 | 86 | ostream &operator<<(ostream &out, const StrDB &db) { 87 | foridx(i, len(db)) out << db[i] << endl; 88 | return out; 89 | } 90 | 91 | //////////////////////////////////////////////////////////// 92 | 93 | int IntPairIntDB::lookup(const IntPair &p, bool incorp_new, int default_i) { 94 | IntPairIntMap::const_iterator it = p2i.find(p); 95 | if(it != p2i.end()) return it->second; 96 | 97 | if(incorp_new) { 98 | int i = p2i[p] = len(i2p); 99 | i2p.push_back(p); 100 | return i; 101 | } 102 | else 103 | return default_i; 104 | } 105 | 106 | int IntPairIntDB::read(istream &in, int N) { 107 | assert(size() == 0); 108 | int a, b; 109 | while(size() < N && in >> a >> b) 110 | (*this)[IntPair(a, b)]; 111 | return size(); 112 | } 113 | 114 | void IntPairIntDB::write(ostream &out) { 115 | forvec(_, const IntPair &, p, i2p) 116 | out << p.first << ' ' << p.second << endl; 117 | } 118 | 119 | //////////////////////////////////////////////////////////// 120 | 121 | int IntVecIntDB::lookup(const IntVec &v, bool incorp_new, int default_i) { 122 | IntVecIntMap::const_iterator it = v2i.find(v); 123 | if(it != v2i.end()) return it->second; 124 | 125 | if(incorp_new) { 126 | int i = v2i[v] = len(i2v); 127 | i2v.push_back(v); 128 | return i; 129 | } 130 | else 131 | return default_i; 132 | } 133 | 134 | //////////////////////////////////////////////////////////// 135 | 136 | // A text is basically a string of words. 137 | // Normally, we just read the strings from file, put them in db, 138 | // and call back func. 139 | // But if the db already exists and the strings have been converted 140 | // into integers (i.e., .{strdb,int} exist), then use those. 141 | // If incorp_new is false, then words not in db will just get passed -1. 142 | typedef void int_func(int a); 143 | void read_text(const char *file, int_func *func, StrDB &db, bool read_cached, bool write_cached, bool incorp_new) { 144 | track("read_text()", file, true); 145 | 146 | string strdb_file = string(file)+".strdb"; 147 | string int_file = string(file)+".int"; 148 | 149 | // Use the cached strdb and int files only if they exist and they are 150 | // newer than the text file. 151 | read_cached &= file_exists(strdb_file.c_str()) && 152 | file_exists(int_file.c_str()) && 153 | file_modified_time(strdb_file.c_str()) > file_modified_time(file) && 154 | file_modified_time(int_file.c_str()) > file_modified_time(file); 155 | 156 | if(read_cached) { 157 | // Read from strdb and int. 158 | assert(db.size() == 0); // db must be empty because we're going to clobber it all 159 | db.read(strdb_file.c_str(), true); 160 | track_block("", "Reading from " << int_file, false) { 161 | ifstream in(int_file.c_str()); 162 | char buf[16384]; 163 | while(true) { 164 | in.read(buf, sizeof(buf)); 165 | if(in.gcount() == 0) break; 166 | assert(in.gcount() % sizeof(int) == 0); 167 | for(int buf_i = 0; buf_i < in.gcount(); buf_i += 4) { 168 | int a = *((int *)(buf+buf_i)); 169 | assert(a >= 0 && a < db.size()); 170 | func(a); 171 | } 172 | } 173 | } 174 | } 175 | else { 176 | track_block("", "Reading from " << file, false) { 177 | // Write to strdb and int. 178 | ifstream in(file); 179 | ofstream out; 180 | 181 | if(write_cached) { 182 | out.open(int_file.c_str()); 183 | if(!out) write_cached = false; 184 | } 185 | if(write_cached) logs("Writing to " << int_file); 186 | 187 | char s[16384]; 188 | char buf[16384]; int buf_i = 0; // Output buffer 189 | while(in >> s) { // Read a string 190 | int a = db.lookup(s, incorp_new, -1); 191 | if(func) func(a); 192 | 193 | if(write_cached) { 194 | if(buf_i + sizeof(a) > sizeof(buf)) { // Flush buffer if full 195 | out.write(buf, buf_i); 196 | buf_i = 0; 197 | } 198 | *((int *)(buf+buf_i)) = a; 199 | buf_i += sizeof(a); 200 | } 201 | } 202 | if(write_cached) // Final flush 203 | out.write(buf, buf_i); 204 | } 205 | 206 | if(write_cached && create_file(strdb_file.c_str())) 207 | db.write(strdb_file.c_str()); 208 | } 209 | } 210 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/strdb.h: -------------------------------------------------------------------------------- 1 | #ifndef __STRDB_H__ 2 | #define __STRDB_H__ 3 | 4 | #include "std.h" 5 | #include "stl-basic.h" 6 | #include "stl-utils.h" 7 | #include "logging.h" 8 | 9 | void destroy_strings(StrVec &vec); 10 | void destroy_strings(StrStrMap &map); 11 | 12 | // Map between strings and integers. 13 | // Strings must not have spaces in them. 14 | // File format: strings, one per line. Assume strings are distinct. 15 | struct StrDB { 16 | StrDB() { } 17 | ~StrDB() { destroy_strings(); } 18 | 19 | int read(istream &in, int n, bool one_way); 20 | int read(const char *file, bool one_way); 21 | 22 | void write(ostream &out); 23 | void write(const char *file); 24 | 25 | int size() const { return len(i2s); } 26 | void clear() { destroy_strings(); i2s.clear(); s2i.clear(); } 27 | void destroy() { destroy_strings(); ::destroy(i2s); ::destroy(s2i); } 28 | void destroy_s2i() { ::destroy(s2i); } 29 | void clear_keep_strings() { i2s.clear(); s2i.clear(); } 30 | 31 | const char *operator[](int i) const; 32 | int operator[](const char *s) const; 33 | int operator[](const char *s); 34 | int lookup(const char *s, bool incorp_new, int default_i); 35 | 36 | IntVec lookup(const StrVec &svec); 37 | 38 | bool exists(const char *s) const { return s2i.find(s) != s2i.end(); } 39 | 40 | // /usr/bin/top might not show the memory reduced. 41 | void destroy_strings() { ::destroy_strings(i2s); } 42 | 43 | StrVec i2s; 44 | StrIntMap s2i; 45 | }; 46 | 47 | ostream &operator<<(ostream &out, const StrDB &db); 48 | 49 | //////////////////////////////////////////////////////////// 50 | 51 | // Map between IntPairs and ints. 52 | struct IntPairIntDB { 53 | IntPair operator[](int i) const { return i2p[i]; } 54 | int operator[](const IntPair &p) { return lookup(p, true, -1); } 55 | int lookup(const IntPair &p, bool incorp_new, int default_i); 56 | int size() const { return len(i2p); } 57 | 58 | int read(istream &in, int N); 59 | void write(ostream &out); 60 | 61 | IntPairIntMap p2i; 62 | IntPairVec i2p; 63 | }; 64 | 65 | //////////////////////////////////////////////////////////// 66 | 67 | // Map between IntVecs and ints. 68 | struct IntVecIntDB { 69 | const IntVec &operator[](int i) const { return i2v[i]; } 70 | int operator[](const IntVec &v) { return lookup(v, true, -1); } 71 | int lookup(const IntVec &v, bool incorp_new, int default_i); 72 | int size() const { return len(i2v); } 73 | 74 | IntVecIntMap v2i; 75 | IntVecVec i2v; 76 | }; 77 | 78 | //////////////////////////////////////////////////////////// 79 | 80 | #if 0 81 | // Map between IntArrays and ints. Arrays terminate with -1. 82 | struct IntArrayIntDB { 83 | int *operator[](int i) const { return i2a[i]; } 84 | int operator[](const IntArray &a) { return lookup(a, true, -1); } 85 | int lookup(const IntArray &a, bool incorp_new, int default_i); 86 | int size() const { return len(i2a); } 87 | 88 | int read(istream &in, int N); 89 | void write(ostream &out); 90 | 91 | hash_map p2i; 92 | vector i2a; 93 | }; 94 | #endif 95 | 96 | //////////////////////////////////////////////////////////// 97 | 98 | typedef void int_func(int a); 99 | void read_text(const char *file, int_func *func, StrDB &db, bool read_cached, bool write_cached, bool incorp_new); 100 | 101 | #endif 102 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/timer.cc: -------------------------------------------------------------------------------- 1 | #include "timer.h" 2 | 3 | ostream &operator<<(ostream &out, const Timer &timer) { 4 | int ms = timer.ms; 5 | int m = ms / 60000; ms %= 60000; 6 | int h = m / 60; m %= 60; 7 | if(h > 0) out << h << 'h'; 8 | if(h > 0 || m > 0) out << m << 'm'; 9 | out << ms/1000.0 << 's'; 10 | return out; 11 | } 12 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/timer.h: -------------------------------------------------------------------------------- 1 | #ifndef __TIMER_H__ 2 | #define __TIMER_H__ 3 | 4 | #include 5 | #include 6 | #include 7 | #include 8 | 9 | using namespace std; 10 | 11 | struct Timer { 12 | Timer() { } 13 | Timer(int ms) : ms(ms) { } 14 | 15 | //void start() { clock_gettime(0, &start_time); } 16 | void start() { gettimeofday(&start_time, NULL); } 17 | Timer &stop() { 18 | //clock_gettime(0, &end_time); 19 | gettimeofday(&end_time, NULL); 20 | ms = Timer::to_ms(end_time) - Timer::to_ms(start_time); 21 | return *this; 22 | } 23 | //static int to_ms(const timespec &tv) { return tv.tv_sec*1000 + tv.tv_nsec/1000000; } 24 | static int to_ms(const timeval &tv) { return tv.tv_sec*1000 + tv.tv_usec/1000; } 25 | 26 | //timespec start_time; 27 | //timespec end_time; 28 | timeval start_time; 29 | timeval end_time; 30 | int ms; 31 | }; 32 | 33 | ostream &operator<<(ostream &out, const Timer &timer); 34 | 35 | #endif 36 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/union-set.cc: -------------------------------------------------------------------------------- 1 | #include "union-set.h" 2 | 3 | void UnionSet::Init(int n) { 4 | parent.resize(n); 5 | for(int v = 0; v < n; v++) 6 | parent[v] = v; 7 | } 8 | 9 | // return whether u and v are in the same connected component; 10 | // connect them if they aren't 11 | bool UnionSet::Do(int u, int v, bool doit) { 12 | int ru = GetRoot(u); 13 | int rv = GetRoot(v); 14 | if(ru == rv) return true; 15 | if(doit) parent[ru] = rv; 16 | return false; 17 | } 18 | 19 | int UnionSet::GetRoot(int v) { 20 | int rv = v; 21 | while(parent[rv] != rv) 22 | rv = parent[rv]; 23 | while(v != rv) { 24 | int pv = parent[v]; 25 | parent[v] = rv; 26 | v = pv; 27 | } 28 | return rv; 29 | } 30 | -------------------------------------------------------------------------------- /code/brown-cluster/basic/union-set.h: -------------------------------------------------------------------------------- 1 | #ifndef __UNION_SET_H__ 2 | #define __UNION_SET_H__ 3 | 4 | #include 5 | 6 | using namespace std; 7 | 8 | struct UnionSet { 9 | UnionSet() { } 10 | UnionSet(int n) { Init(n); } 11 | void Init(int n); 12 | 13 | bool Join(int u, int v) { return Do(u, v, true); } 14 | bool InSameSet(int u, int v) { return Do(u, v, false); } 15 | 16 | bool Do(int u, int v, bool doit); 17 | int GetRoot(int v); 18 | 19 | vector parent; 20 | }; 21 | 22 | #endif 23 | -------------------------------------------------------------------------------- /code/brown-cluster/cluster-viewer/LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2014 Chris Dyer and Brendan O'Connor 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /code/brown-cluster/cluster-viewer/README.md: -------------------------------------------------------------------------------- 1 | This code generates an HTML viewer for the clustering tree generated, similar to [this clustering of the words in a corpus of English Twitter data](http://www.ark.cs.cmu.edu/TweetNLP/cluster_viewer.html). 2 | 3 | ## Instructions 4 | 5 | The `wcluster` tool generates a directory with a file called `paths` that contains the bit string representations of the clustering tree, e.g. 6 | 7 | 000000 Westfalenpokalfinale 10 8 | 000000 Heimpunktspiel 10 9 | 000000 Jugendhallenturnier 10 10 | ... 11 | 12 | The script `cluster-viewer/build-viewer.sh` creates an HTML visualization of the contents of this file. You can run it with as follows: 13 | 14 | ./cluster-viewer/build-viewer.sh corpus.out/paths 15 | 16 | This command creates a directory called `clusters/` containing the HTML viewer. Specify an alternative directory as follows: 17 | 18 | ./cluster-viewer/build-viewer.sh corpus.out/paths /some/other/output-dir 19 | 20 | ## Requirements 21 | 22 | * Python must be in your path 23 | 24 | ## Acknowledgements 25 | 26 | These scripts were originally written by [Brendan O'Connor](http://brenocon.com/) and extended by [Chris Dyer](http://www.cs.cmu.edu/~cdyer/). 27 | -------------------------------------------------------------------------------- /code/brown-cluster/cluster-viewer/build-viewer.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -e 3 | 4 | CODEDIR=`dirname $0`/code 5 | 6 | if [ "$#" -lt "1" ] || [ "$#" -gt "2" ] 7 | then 8 | echo "Usage: $0 path/to/clusters.out/paths [outdir]" 1>&2 9 | echo 1>&2 10 | echo "Builds an HTML cluster viewer." 1>&2 11 | echo 1>&2 12 | exit 13 | fi 14 | MAPFILE=$1 15 | CATCMD=cat 16 | if [[ "$MAPFILE" == *.gz ]] 17 | then 18 | CATCMD='gunzip -c' 19 | fi 20 | OUTDIR=clusters 21 | if [ $# -eq 2 ] 22 | then 23 | OUTDIR=$2 24 | fi 25 | 26 | echo "Creating output in $OUTDIR ..." 1>&2 27 | mkdir -p $OUTDIR 28 | mkdir -p $OUTDIR/paths 29 | $CATCMD $MAPFILE | python $CODEDIR/make_html.py $CODEDIR $OUTDIR > $OUTDIR/htmlrows.html 30 | python $CODEDIR/final.py $CODEDIR $OUTDIR > $OUTDIR/cluster_viewer.html 31 | echo "Done. View clusters in $OUTDIR/cluster_viewer.html" 1>&2 32 | 33 | -------------------------------------------------------------------------------- /code/brown-cluster/cluster-viewer/code/final.py: -------------------------------------------------------------------------------- 1 | import sys 2 | template = open(sys.argv[1] + '/template.html').read() 3 | final = template 4 | final = final.replace('STYLE', open(sys.argv[1] + '/style.css').read()) 5 | htmlrows = open(sys.argv[2] + '/htmlrows.html').read() 6 | final = final.replace('TABLE', htmlrows) 7 | print final 8 | 9 | -------------------------------------------------------------------------------- /code/brown-cluster/cluster-viewer/code/htmlrows.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | ^000000 (3) 4 | Westfalenpokalfinale Heimpunktspiel Jugendhallenturnier 5 | 6 | 7 | 8 | 9 | ^0000010 (3) 10 | Friesendorf Fallenstellen Strafjustizsystem 11 | 12 | 13 | 14 | 15 | ^00000110 (3) 16 | Gewerbeflächenkonzept Musikprotokoll Familienbetreuungszentrum 17 | 18 | 19 | -------------------------------------------------------------------------------- /code/brown-cluster/cluster-viewer/code/make_html.py: -------------------------------------------------------------------------------- 1 | import sys,itertools 2 | 3 | style = open(sys.argv[1] + '/style.css').read() 4 | 5 | def get_word_rows(): 6 | for line in sys.stdin: 7 | path, word, count = line.split('\t') 8 | count = int(count) 9 | yield path,word,count 10 | 11 | def get_cluster_rows(): 12 | for path, rows in itertools.groupby(get_word_rows(), key=lambda x: x[0]): 13 | wordcounts = [(w,c) for _,w,c in rows] 14 | wordcounts.sort(key=lambda (w,c): -c) 15 | 16 | yield path, len(wordcounts), wordcounts[:50], wordcounts 17 | 18 | def htmlescape(s): 19 | return s.replace('&','&').replace('<','<').replace('>','>') 20 | 21 | def wc_table(wordcounts, tdword=''): 22 | r = [''] 23 | for i,(w,c) in enumerate(wordcounts): 24 | r.append('
{} {} {:,}'.format(i+1, tdword, htmlescape(w), c)) 25 | r.append('
') 26 | return '\n'.join(r) 27 | 28 | def top(wc, th): 29 | cutoff = int(wc[0][1] * th) 30 | res = [] 31 | for (w,c) in wc: 32 | if c > cutoff: res.append((w,c)) 33 | return res 34 | 35 | for path, nwords, wordcounts, allwc in get_cluster_rows(): 36 | # wc1 = ' '.join("{w}[{c}]".format( 37 | # w=htmlescape(w), c=c) for w,c in wordcounts) 38 | wc1 = ' '.join("{w}".format( 39 | w=htmlescape(w)) for w,c in top(wordcounts, 0.01)) 40 | 41 | print """ 42 | 43 | ^{path} ({nwords}) 44 | {wc} 45 | """.format(path=path, nwords=nwords, wc=wc1) 46 | print "" 47 | 48 | with open(sys.argv[2] + '/paths/{path}.html'.format(**locals()),'w') as f: 49 | print>>f,"""""".format(**locals()) 50 | print>>f,"""""" 51 | print>>f,"back to cluster viewer" 52 | print>>f,"

cluster path {path}

".format(path=path) 53 | 54 | print>>f, "{n:,} words, {t:,} tokens".format(n=nwords, t=sum(c for w,c in allwc)) 55 | print>>f, "freq alpha suffix" 56 | 57 | print>>f,"

Words in frequency order

" 58 | allwc.sort(key=lambda (w,c): (-c,w)) 59 | print>>f, wc_table(allwc) 60 | # wc1 = ' '.join("{w} ({c})".format( 61 | # w=htmlescape(w), c=c) for w,c in allwc) 62 | # print>>f, wc1 63 | 64 | print>>f, "

Words in alphabetical order

" 65 | allwc.sort(key=lambda (w,c): (w,-c)) 66 | print>>f, wc_table(allwc) 67 | 68 | print>>f, "

Words in suffix order

" 69 | allwc.sort(key=lambda (w,c): (list(reversed(w)),-c)) 70 | print>>f, wc_table(allwc, tdword='suffixsort') 71 | # wc1 = ' '.join("{w} ({c})".format( 72 | # w=htmlescape(w), c=c) for w,c in allwc) 73 | # print>>f, wc1 74 | 75 | 76 | -------------------------------------------------------------------------------- /code/brown-cluster/cluster-viewer/code/style.css: -------------------------------------------------------------------------------- 1 | table { border-collapse:collapse; border-spacing:0; } 2 | body { font-family: times; font-size: 11pt; } 3 | td { border: 1px solid gray; padding:2px 8px; } 4 | th { border: 1px solid gray; padding:2px 8px; } 5 | .count { font-size:9pt; color: solid gray; } 6 | .c { font-size:7pt; color: solid gray; } 7 | .tdcount { text-align:right } 8 | .info { font-size: 12pt; } 9 | .suffixsort { text-align: right } 10 | -------------------------------------------------------------------------------- /code/brown-cluster/cluster-viewer/code/template.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 6 | 7 |

Word cluster viewer

8 | 9 |
10 | Word cluster viewer. 11 |
12 | 13 |

14 | 15 | 16 | 19 | TABLE 20 |
Cluster path (and word type count) 17 | Words (most frequent) 18 |
21 | 22 | 23 | -------------------------------------------------------------------------------- /code/brown-cluster/generateBClusterInput.py: -------------------------------------------------------------------------------- 1 | __author__ = 'ZeqiuWu' 2 | import json 3 | import sys 4 | from collections import defaultdict 5 | import unicodedata 6 | 7 | 8 | trainFile = sys.argv[1] 9 | testFile = sys.argv[2] 10 | outFile = sys.argv[3] 11 | file = open(trainFile, 'r') 12 | f = open(outFile, 'w') 13 | writtenSents = set() 14 | for line in file.readlines(): 15 | sent = json.loads(line) 16 | sentText = unicodedata.normalize('NFKD', sent['sentText']).encode('ascii','ignore').rstrip('\n').rstrip('\r') 17 | if sentText in writtenSents: 18 | continue 19 | f.write(sentText) 20 | f.write('\n') 21 | writtenSents.add(sentText) 22 | file.close() 23 | file = open(testFile, 'r') 24 | for line in file.readlines(): 25 | sent = json.loads(line) 26 | sentText = unicodedata.normalize('NFKD', sent['sentText']).encode('ascii','ignore').rstrip('\n').rstrip('\r') 27 | if sentText in writtenSents: 28 | continue 29 | f.write(sentText) 30 | f.write('\n') 31 | writtenSents.add(sentText) 32 | file.close() 33 | f.close() 34 | -------------------------------------------------------------------------------- /code/brown-cluster/input.txt: -------------------------------------------------------------------------------- 1 | the cat chased the mouse 2 | the dog chased the cat 3 | the mouse chased the dog 4 | -------------------------------------------------------------------------------- /code/brown-cluster/wcluster: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INK-USC/StructMineDataPipeline/240c18bf325e71c1b954c7b6b9781fbe390f40b2/code/brown-cluster/wcluster -------------------------------------------------------------------------------- /code/brown-cluster/wcluster.cc: -------------------------------------------------------------------------------- 1 | /* 2 | Hierarchically clusters phrases. 3 | Running time: O(N*C^2). 4 | 5 | We want to cluster the phrases so that the pairwise mututal information between 6 | clusters is maximized. This mututal information is a sum over terms between 7 | each pair of clusters: q2[a, b] for clusters a and b. The trick is to compute 8 | quickly the loss of mututal information when two clusters a and b are merged. 9 | 10 | The four structures p1, p2, q2, L2 allow this quick computation. 11 | p1[a] = probability of of cluster a. 12 | p2[a, b] = probability of cluster a followed by cluster b. 13 | q2[a, b] = contribution to the mutual information from clusters a and b (computed from p2[a, b]). 14 | L2[a, b] = the loss of mutual information if clusters a and b were merged. 15 | 16 | Changes: 17 | * Removed hash tables for efficiency. 18 | * Notation: a is an phrase (sequence of words), c is a cluster, s is a slot. 19 | * Removed hash tables for efficiency. 20 | * Notation: a is an phrase (sequence of words), c is a cluster, s is a slot. 21 | 22 | To cut down memory usage: 23 | * Change double to float. 24 | Ideas: 25 | * Hashing vectors is really slow. 26 | * Find intuition behind algorithm based on simple cases 27 | * Test clustering algorithm on artificial generated data. Generate a text 28 | with a class-based ngram model. 29 | */ 30 | 31 | #include "basic/std.h" 32 | #include "basic/stl-basic.h" 33 | #include "basic/stl-utils.h" 34 | #include "basic/str.h" 35 | #include "basic/strdb.h" 36 | #include "basic/union-set.h" 37 | #include "basic/mem-tracker.h" 38 | #include "basic/opt.h" 39 | #include 40 | #include 41 | #include 42 | #include 43 | 44 | vector< OptInfo > bool_opts; 45 | vector< OptInfo > int_opts; 46 | vector< OptInfo > double_opts; 47 | vector< OptInfo > string_opts; 48 | 49 | opt_define_string(output_dir, "output_dir", "", "Output everything to this directory."); 50 | opt_define_string(text_file, "text", "", "Text file with corpora (input)."); 51 | opt_define_string(restrict_file, "restrict", "", "Only consider words that appear in this text (input)."); 52 | opt_define_string(paths_file, "paths", "", "File containing root-to-node paths in the clustering tree (input/output)."); 53 | opt_define_string(map_file, "map", "", "File containing lots of good information about each phrase, more general than paths (output)"); 54 | opt_define_string(collocs_file, "collocs", "", "Collocations with most mutual information (output)."); 55 | opt_define_string(featvec_file, "featvec", "", "Feature vectors (output)."); 56 | opt_define_string(comment, "comment", "", "Description of this run."); 57 | 58 | opt_define_int(ncollocs, "ncollocs", 500, "Collocations with most mutual information (output)."); 59 | opt_define_int(initC, "c", 1000, "Number of clusters."); 60 | opt_define_int(plen, "plen", 1, "Maximum length of a phrase to consider."); 61 | opt_define_int(min_occur, "min-occur", 1, "Keep phrases that occur at least this many times."); 62 | opt_define_int(rand_seed, "rand", time(NULL)*getpid(), "Number to call srand with."); 63 | opt_define_int(num_threads, "threads", 1, "Number of threads to use in the worker pool."); 64 | 65 | opt_define_bool(chk, "chk", false, "Check data structures are valid (expensive)."); 66 | opt_define_bool(print_stats, "stats", false, "Just print out stats."); 67 | opt_define_bool(paths2map, "paths2map", false, "Take the paths file and generate a map file."); 68 | 69 | #define use_restrict (!restrict_file.empty()) 70 | const char *delim_str = "$#$"; 71 | 72 | typedef IntPair _; 73 | 74 | StrDB db; // word database 75 | IntVec phrase_freqs; // phrase a < N -> number of times a appears in the text 76 | IntVecVec left_phrases; // phrase a < N -> list of phrases that appear to left of a in the text 77 | IntVecVec right_phrases; // phrase a < N -> list of phrases that appear to right of a in the text 78 | IntIntPairMap cluster_tree; // cluster c -> the 2 sub-clusters that merged to create c 79 | int delim_word; 80 | 81 | IntVec freq_order_phrases; // List of phrases in decreasing order of frequency. 82 | 83 | // Allows for very quick (inverse Ackermann) lookup of clusters and merging 84 | // of clusters. Each phrase points to an arbitrary representative phrase of 85 | // the cluster. 86 | UnionSet phrase2rep; // phrase a -> the rep phrase in the same cluster as a 87 | IntIntMap rep2cluster; // rep phrase a -> the cluster that contains a 88 | IntIntMap cluster2rep; // cluster a -> the rep phrase in cluster a 89 | 90 | // Store all the phrases efficiently. Just for printing out. 91 | // For each phrase length, we store a flattened list of words. 92 | IntVecVec phrases; // length of phrase -> flattened list of words 93 | 94 | // Each cluster will occupy a slot. There will always be two extra slots 95 | // as intermediate scratch space. 96 | IntVec slot2cluster; // slot index -> cluster (-1 if none exists) 97 | IntIntMap cluster2slot; // cluster -> slot index 98 | int free_slot1, free_slot2; // two free slots 99 | int nslots; 100 | 101 | // Partial results that allow quick computation and update of mutual information. 102 | // Mutual information is the sum of all the q2 terms. 103 | // Update p1, p2, q2 for 0..N-1, but L2 only for 0..initC-1. 104 | DoubleVec p1; // slot s (containing cluster a) -> probability Pr(a) 105 | DoubleVecVec p2; // slots s, t (containing clusters a, b) -> probability Pr(a, b) 106 | DoubleVecVec q2; // slots s, t (contianing clusters a, b) -> contribution to mutual information 107 | DoubleVecVec L2; // slots s, t (containing clusters a, b) -> loss of mutual information if merge a and b 108 | 109 | int curr_cluster_id; // ID to assign to a new cluster 110 | int stage2_cluster_offset; // start of the IDs of clusters created in stage 2 111 | 112 | double curr_minfo; // Mutual info, should be sum of all q2's 113 | 114 | // Map phrase to the KL divergence to its cluster 115 | DoubleVec kl_map[2]; 116 | 117 | // Variables used to control the thread pool 118 | mutex * thread_idle; 119 | mutex * thread_start; 120 | thread * threads; 121 | struct Compute_L2_Job { 122 | int s; 123 | int t; 124 | int u; 125 | bool is_type_a; 126 | }; 127 | Compute_L2_Job the_job; 128 | bool all_done = false; 129 | 130 | #define FOR_SLOT(s) \ 131 | for(int s = 0; s < len(slot2cluster); s++) \ 132 | for(bool _tmp = true; slot2cluster[s] != -1 && _tmp; _tmp = false) 133 | 134 | // We store only L2[s, t] for which the cluster ID in slot s is smaller 135 | // than the one in slot t. 136 | #define ORDER_VALID(s, t) (slot2cluster[s] < slot2cluster[t]) 137 | 138 | #define num_phrases(l) (len(phrases[l])/(l)) 139 | 140 | int N; // number of phrases 141 | int T; // length of text 142 | 143 | // Output a phrase. 144 | struct Phrase { Phrase(int a) : a(a) { } int a; }; 145 | ostream &operator<<(ostream &out, const Phrase &phrase) { 146 | // Decode the phrase ID into the length and the offset in phrases. 147 | int a = phrase.a; 148 | int l; for(l = 1; a >= num_phrases(l); a -= num_phrases(l), l++); 149 | 150 | foridx(i, l) { 151 | if(i > 0) out << ' '; 152 | out << db[phrases[l][a*l+i]]; 153 | } 154 | return out; 155 | } 156 | 157 | // For pretty-printing of clusters. 158 | struct Cluster { Cluster(int c) : c(c) { } int c; }; 159 | ostream &operator<<(ostream &out, const Cluster &cluster) { 160 | int c = cluster.c; 161 | out << c; 162 | 163 | int a; 164 | bool more; 165 | if(c < N) 166 | a = c, more = false; 167 | else { 168 | assert(contains(cluster2rep, c)); 169 | a = cluster2rep[c], more = true; 170 | } 171 | 172 | out << '(' << Phrase(a); 173 | if(more) out << "|..."; 174 | out << ')'; 175 | return out; 176 | } 177 | 178 | #define Slot(s) Cluster(slot2cluster[s]) 179 | 180 | //////////////////////////////////////////////////////////// 181 | 182 | // p2[s, t] + p2[t, s]. 183 | inline double bi_p2(int s, int t) { 184 | if(s == t) return p2[s][s]; 185 | return p2[s][t] + p2[t][s]; 186 | } 187 | 188 | // q2[s, t] + q2[t, s]. 189 | inline double bi_q2(int s, int t) { 190 | if(s == t) return q2[s][s]; 191 | return q2[s][t] + q2[t][s]; 192 | } 193 | 194 | // Hypothetical p1[st] = p1[s] + p1[t]. 195 | inline double hyp_p1(int s, int t) { 196 | return p1[s] + p1[t]; 197 | } 198 | 199 | //// hyp_p2 200 | 201 | // Hypothetical p2[st, u] = p2[s, u] + p2[t, u]. 202 | inline double hyp_p2(const IntPair &st, int u) { 203 | return p2[st.first][u] + p2[st.second][u]; 204 | } 205 | 206 | // Hypothetical p2[u, st] = p2[u, s] + p2[u, t]. 207 | inline double hyp_p2(int u, const IntPair &st) { 208 | return p2[u][st.first] + p2[u][st.second]; 209 | } 210 | 211 | inline double bi_hyp_p2(const IntPair &st, int u) { 212 | return hyp_p2(st, u) + hyp_p2(u, st); 213 | } 214 | 215 | // Hypothetical p2[st, st] = p2[s, s] + p2[s, t] + p2[t, s] + p2[t, t]. 216 | inline double hyp_p2(const IntPair &st) { 217 | return p2[st.first][st.first] + p2[st.first][st.second] + 218 | p2[st.second][st.first] + p2[st.second][st.second]; 219 | } 220 | 221 | //// hyp_q2 222 | 223 | inline double p2q(double pst, double ps, double pt) { 224 | if(feq(pst, 0.0)) return 0.0; 225 | return pst * log2(pst / (ps*pt)); 226 | } 227 | 228 | // Hypothetical q2[st, u]. 229 | inline double hyp_q2(const IntPair &st, int u) { 230 | return p2q(hyp_p2(st, u), hyp_p1(st.first, st.second), p1[u]); 231 | } 232 | 233 | // Hypothetical q2[u, st]. 234 | inline double hyp_q2(int u, const IntPair &st) { 235 | return p2q(hyp_p2(u, st), hyp_p1(st.first, st.second), p1[u]); 236 | } 237 | 238 | inline double bi_hyp_q2(const IntPair &st, int u) { 239 | return hyp_q2(st, u) + hyp_q2(u, st); 240 | } 241 | 242 | // Hypothetical q2[st, st]. 243 | inline double hyp_q2(const IntPair &st) { 244 | double p = hyp_p2(_(st.first, st.second)); // p2[st,st] 245 | double P = hyp_p1(st.first, st.second); 246 | return p2q(p, P, P); 247 | } 248 | 249 | //////////////////////////////////////////////////////////// 250 | 251 | // Return slot. 252 | void put_cluster_in_slot(int a, int s) { 253 | cluster2slot[a] = s; 254 | slot2cluster[s] = a; 255 | } 256 | inline int put_cluster_in_free_slot(int a) { 257 | int s = -1; 258 | 259 | // Find available slot. 260 | if(free_slot1 != -1) s = free_slot1, free_slot1 = -1; 261 | else if(free_slot2 != -1) s = free_slot2, free_slot2 = -1; 262 | assert(s != -1); 263 | 264 | put_cluster_in_slot(a, s); 265 | return s; 266 | } 267 | 268 | inline void free_up_slots(int s, int t) { 269 | free_slot1 = s; 270 | free_slot2 = t; 271 | cluster2slot.erase(slot2cluster[s]); 272 | cluster2slot.erase(slot2cluster[t]); 273 | slot2cluster[s] = slot2cluster[t] = -1; 274 | } 275 | 276 | void init_slot(int s) { 277 | // Clear any entries that relates to s. 278 | // The p1 and L2 will be filled in densely, so they 279 | // will be overwritten anyway. 280 | FOR_SLOT(t) 281 | p2[s][t] = q2[s][t] = p2[t][s] = q2[t][s] = 0; 282 | } 283 | 284 | void add_to_set(const IntVec &phrases, IntIntMap &phrase_counts, int offset) { 285 | forvec(_, int, a, phrases) 286 | phrase_counts[a+offset]++; 287 | } 288 | 289 | bool is_good_phrase(const IntVec &phrase) { 290 | if(len(phrase) == 1) return phrase[0] != delim_word && phrase[0] != -1; // Can't be delimiter or an invalid word 291 | 292 | // HACK HACK HACK - pick out some phrases 293 | // Can't be too many delim words. 294 | int di = index_of(phrase, delim_word, 1); 295 | if(di > 0 && di < len(phrase)-1) return false; // Delimiter must occur at the ends 296 | if(phrase[0] == delim_word && phrase[len(phrase)-1] == delim_word) return false; // Only one delimiter allowed 297 | 298 | // If every word is capitalized with the exception of some function 299 | // words which must go in the middle 300 | forvec(i, int, a, phrase) { 301 | bool at_end = i == 0 || i == len(phrase)-1; 302 | const string &word = db[a]; 303 | bool is_upper = isupper(word[0]); 304 | 305 | if(at_end && !is_upper) return false; // Ends must be uppercase 306 | if(is_upper) continue; // Ok 307 | if(word[0] == '\'' || word == "of" || word == "and") continue; // Ok 308 | return false; 309 | } 310 | return true; 311 | } 312 | 313 | void read_restrict_text() { 314 | // Read the words from the text file that restricts what words we will cluster 315 | if(restrict_file.empty()) return; 316 | track("read_restrict_text()", restrict_file, false); 317 | read_text(restrict_file.c_str(), NULL, db, false, false, true); 318 | } 319 | 320 | IntVecIntMap vec2phrase; 321 | IntVec text; 322 | void read_text_process_word(int w) { 323 | text.push_back(w); 324 | } 325 | void read_text() { 326 | track("read_text()", "", false); 327 | 328 | read_text(text_file.c_str(), read_text_process_word, db, !use_restrict, !use_restrict, !use_restrict); 329 | T = len(text); 330 | delim_word = db.lookup(delim_str, false, -1); 331 | if(!paths2map) db.destroy_s2i(); // Conserve memory. 332 | 333 | // Count the phrases that we care about so we can map them all to integers. 334 | track_block("Counting phrases", "", false) { 335 | phrases.resize(plen+1); 336 | for(int l = 1; l <= plen; l++) { 337 | // Count. 338 | IntVecIntMap freqs; // phrase vector -> number of occurrences 339 | for(int i = 0; i < T-l+1; i++) { 340 | IntVec a_vec = subvector(text, i, i+l); 341 | if(!is_good_phrase(a_vec)) continue; 342 | freqs[a_vec]++; 343 | } 344 | 345 | forcmap(const IntVec &, a_vec, int, count, IntVecIntMap, freqs) { 346 | if(count < min_occur) continue; 347 | 348 | int a = len(phrase_freqs); 349 | phrase_freqs.push_back(count); 350 | vec2phrase[a_vec] = a; 351 | forvec(_, int, w, a_vec) phrases[l].push_back(w); 352 | } 353 | 354 | logs(len(freqs) << " distinct phrases of length " << l << ", keeping " << num_phrases(l) << " which occur at least " << min_occur << " times"); 355 | } 356 | } 357 | 358 | N = len(phrase_freqs); // number of phrases 359 | 360 | track_block("Finding left/right phrases", "", false) { 361 | left_phrases.resize(N); 362 | right_phrases.resize(N); 363 | for(int l = 1; l <= plen; l++) { 364 | for(int i = 0; i < T-l+1; i++) { 365 | IntVec a_vec = subvector(text, i, i+l); 366 | if(!contains(vec2phrase, a_vec)) continue; 367 | int a = vec2phrase[a_vec]; 368 | 369 | // Left 370 | for(int ll = 1; ll <= plen && i-ll >= 0; ll++) { 371 | IntVec aa_vec = subvector(text, i-ll, i); 372 | if(!contains(vec2phrase, aa_vec)) continue; 373 | int aa = vec2phrase[aa_vec]; 374 | left_phrases[a].push_back(aa); 375 | //logs(i << ' ' << Cluster(a) << " L"); 376 | } 377 | 378 | // Right 379 | for(int ll = 1; ll <= plen && i+l+ll <= T; ll++) { 380 | IntVec aa_vec = subvector(text, i+l, i+l+ll); 381 | if(!contains(vec2phrase, aa_vec)) continue; 382 | int aa = vec2phrase[aa_vec]; 383 | right_phrases[a].push_back(aa); 384 | //logs(i << ' ' << Cluster(a) << " R"); 385 | } 386 | } 387 | } 388 | } 389 | 390 | #if 1 391 | if(!featvec_file.empty()) { 392 | ofstream out(featvec_file.c_str()); 393 | out << N << ' ' << 2*N << endl; 394 | foridx(a, N) { 395 | IntIntMap phrase_counts; 396 | add_to_set(left_phrases[a], phrase_counts, 0); 397 | add_to_set(right_phrases[a], phrase_counts, N); 398 | out << Phrase(a) << ' ' << len(phrase_counts); 399 | forcmap(int, b, int, count, IntIntMap, phrase_counts) 400 | out << '\t' << b << ' ' << count; 401 | out << endl; 402 | } 403 | } 404 | #endif 405 | 406 | #if 0 407 | foridx(a, N) { 408 | track("", Cluster(a), true); 409 | forvec(_, int, b, left_phrases[a]) 410 | logs("LEFT " << Cluster(b)); 411 | forvec(_, int, b, right_phrases[a]) 412 | logs("RIGHT " << Cluster(b)); 413 | } 414 | #endif 415 | 416 | destroy(text); 417 | initC = min(initC, N); 418 | 419 | logs("Text length: " << T << ", " << N << " phrases, " << len(db) << " words"); 420 | } 421 | 422 | // O(C) time. 423 | double compute_s1(int s) { // compute s1[s] 424 | double q = 0.0; 425 | 426 | for(int t = 0; t < len(slot2cluster); t++) { 427 | if (slot2cluster[t] == -1) continue; 428 | q += bi_q2(s, t); 429 | } 430 | 431 | return q; 432 | } 433 | 434 | // O(C) time. 435 | double compute_L2(int s, int t) { // compute L2[s, t] 436 | assert(ORDER_VALID(s, t)); 437 | // st is the hypothetical new cluster that combines s and t 438 | 439 | // Lose old associations with s and t 440 | double l = 0.0; 441 | for (int w = 0; w < len(slot2cluster); w++) { 442 | if ( slot2cluster[w] == -1) continue; 443 | l += q2[s][w] + q2[w][s]; 444 | l += q2[t][w] + q2[w][t]; 445 | } 446 | l -= q2[s][s] + q2[t][t]; 447 | l -= bi_q2(s, t); 448 | 449 | // Form new associations with st 450 | FOR_SLOT(u) { 451 | if(u == s || u == t) continue; 452 | l -= bi_hyp_q2(_(s, t), u); 453 | } 454 | l -= hyp_q2(_(s, t)); // q2[st, st] 455 | return l; 456 | } 457 | 458 | void repcheck() { 459 | if(!chk) return; 460 | double sum; 461 | 462 | assert_eq(len(rep2cluster), len(cluster2rep)); 463 | assert_eq(len(rep2cluster), len(cluster2slot)); 464 | 465 | assert(free_slot1 == -1 || slot2cluster[free_slot1] == -1); 466 | assert(free_slot2 == -1 || slot2cluster[free_slot2] == -1); 467 | FOR_SLOT(s) { 468 | assert(contains(cluster2slot, slot2cluster[s])); 469 | assert(cluster2slot[slot2cluster[s]] == s); 470 | } 471 | 472 | sum = 0.0; 473 | FOR_SLOT(s) FOR_SLOT(t) { 474 | double q = q2[s][t]; 475 | //logs(s << ' ' << t << ' ' << p2[s][t] << ' ' << p1[s] << ' ' << p1[t]); 476 | assert_feq(q, p2q(p2[s][t], p1[s], p1[t])); 477 | sum += q; 478 | } 479 | assert_feq(sum, curr_minfo); 480 | 481 | FOR_SLOT(s) FOR_SLOT(t) { 482 | if(!ORDER_VALID(s, t)) continue; 483 | double l = L2[s][t]; 484 | assert(l + TOL >= 0); 485 | assert_feq(l, compute_L2(s, t)); 486 | } 487 | } 488 | 489 | void dump() { 490 | track("dump()", "", true); 491 | FOR_SLOT(s) logs("p1[" << Slot(s) << "] = " << p1[s]); 492 | FOR_SLOT(s) FOR_SLOT(t) logs("p2[" << Slot(s) << ", " << Slot(t) << "] = " << p2[s][t]); 493 | FOR_SLOT(s) FOR_SLOT(t) logs("q2[" << Slot(s) << ", " << Slot(t) << "] = " << q2[s][t]); 494 | FOR_SLOT(s) FOR_SLOT(t) logs("L2[" << Slot(s) << ", " << Slot(t) << "] = " << L2[s][t]); 495 | logs("curr_minfo = " << curr_minfo); 496 | } 497 | 498 | 499 | // c is new cluster that has been just formed from a and b 500 | // Want to compute L2[d, e] 501 | // O(1) time. 502 | double compute_L2_using_old(int s, int t, int u, int v, int w) { 503 | assert(ORDER_VALID(v, w)); 504 | assert(v != u && w != u); 505 | 506 | double l = L2[v][w]; 507 | 508 | // Remove old associations between v and w with s and t 509 | l -= bi_q2(v, s) + bi_q2(w, s) + bi_q2(v, t) + bi_q2(w, t); 510 | l += bi_hyp_q2(_(v, w), s) + bi_hyp_q2(_(v, w), t); 511 | 512 | // Add new associations between v and w with u (ab) 513 | l += bi_q2(v, u) + bi_q2(w, u); 514 | l -= bi_hyp_q2(_(v, w), u); 515 | 516 | return l; 517 | } 518 | 519 | // return q2 520 | double set_p2_q2_from_count(int s, int t, int count) { 521 | double pst = (double)count / (T-1); // p2[s,t] 522 | double ps = p1[s]; 523 | double pt = p1[t]; 524 | double qst = p2q(pst, ps, pt); // q2[s,t] 525 | p2[s][t] = pst; 526 | q2[s][t] = qst; 527 | return qst; 528 | } 529 | 530 | // O(N lg N) time. 531 | // Sort the phrases by decreasing frequency and then set the initC most frequent 532 | // phrases to be in the initial cluster. 533 | bool phrase_freq_greater(int a, int b) { 534 | return phrase_freqs[a] > phrase_freqs[b]; 535 | } 536 | void create_initial_clusters() { 537 | track("create_initial_clusters()", "", true); 538 | 539 | freq_order_phrases.resize(N); 540 | foridx(a, N) freq_order_phrases[a] = a; 541 | 542 | logs("Sorting " << N << " phrases by frequency"); 543 | sort(freq_order_phrases.begin(), freq_order_phrases.end(), phrase_freq_greater); 544 | 545 | // Initialize slots 546 | logs("Selecting top " << initC << " phrases to be initial clusters"); 547 | nslots = initC+2; 548 | slot2cluster.resize(nslots); 549 | free_up_slots(initC, initC+1); 550 | 551 | // Create the inital clusters. 552 | phrase2rep.Init(N); // Init union-set: each phrase starts out in its own cluster 553 | curr_minfo = 0.0; 554 | foridx(s, initC) { 555 | int a = freq_order_phrases[s]; 556 | put_cluster_in_slot(a, s); 557 | 558 | rep2cluster[a] = a; 559 | cluster2rep[a] = a; 560 | } 561 | 562 | // Allocate memory 563 | p1.resize(nslots); 564 | matrix_resize(p2, nslots, nslots); 565 | matrix_resize(q2, nslots, nslots); 566 | matrix_resize(L2, nslots, nslots); 567 | 568 | FOR_SLOT(s) init_slot(s); 569 | 570 | // Compute p1 571 | FOR_SLOT(s) { 572 | int a = slot2cluster[s]; 573 | p1[s] = (double)phrase_freqs[a] / T; 574 | } 575 | 576 | // Compute p2, q2, curr_minfo 577 | FOR_SLOT(s) { 578 | int a = slot2cluster[s]; 579 | IntIntMap right_phrase_freqs; 580 | 581 | // Find collocations of (a, b), where both are clusters. 582 | forvec(_, int, b, right_phrases[a]) 583 | if(contains(cluster2slot, b)) 584 | right_phrase_freqs[b]++; 585 | 586 | forcmap(int, b, int, count, IntIntMap, right_phrase_freqs) { 587 | int t = cluster2slot[b]; 588 | curr_minfo += set_p2_q2_from_count(s, t, count); 589 | } 590 | } 591 | } 592 | 593 | // Output the ncollocs bigrams that have the highest mutual information. 594 | void output_best_collocations() { 595 | if(collocs_file.empty()) return; 596 | logs("Writing to " << collocs_file); 597 | 598 | vector< pair > collocs; 599 | FOR_SLOT(s) FOR_SLOT(t) { 600 | collocs.push_back(pair(q2[s][t], _(slot2cluster[s], slot2cluster[t]))); 601 | } 602 | ncollocs = min(ncollocs, len(collocs)); 603 | partial_sort(collocs.begin(), collocs.begin()+ncollocs, collocs.end(), greater< pair >()); 604 | 605 | ofstream out(collocs_file.c_str()); 606 | assert(out); 607 | for(int i = 0; i < ncollocs; i++) { 608 | const IntPair &ab = collocs[i].second; 609 | out << collocs[i].first << '\t' << Phrase(ab.first) << '\t' << Phrase(ab.second) << endl; 610 | } 611 | } 612 | 613 | // O(C^3) time. 614 | void compute_L2() { 615 | track("compute_L2()", "", true); 616 | 617 | track_block("Computing L2", "", false) 618 | FOR_SLOT(s) { 619 | track_block("L2", "L2[" << Slot(s) << ", *]", false) 620 | FOR_SLOT(t) { 621 | if(!ORDER_VALID(s, t)) continue; 622 | double l = L2[s][t] = compute_L2(s, t); 623 | logs("L2[" << Slot(s) << "," << Slot(t) << "] = " << l << ", resulting minfo = " << curr_minfo-l); 624 | } 625 | } 626 | } 627 | 628 | // Add new phrase as a cluster. 629 | // Compute its L2 between a and all existing clusters. 630 | // O(C^2) time, O(T) time over all calls. 631 | void incorporate_new_phrase(int a) { 632 | track("incorporate_new_phrase()", Cluster(a), false); 633 | 634 | int s = put_cluster_in_free_slot(a); 635 | init_slot(s); 636 | cluster2rep[a] = a; 637 | rep2cluster[a] = a; 638 | 639 | // Compute p1 640 | p1[s] = (double)phrase_freqs[a] / T; 641 | 642 | // Overall all calls: O(T) 643 | // Compute p2, q2 between a and everything in clusters 644 | IntIntMap freqs; 645 | freqs.clear(); // right bigrams 646 | forvec(_, int, b, right_phrases[a]) { 647 | b = phrase2rep.GetRoot(b); 648 | if(!contains(rep2cluster, b)) continue; 649 | b = rep2cluster[b]; 650 | if(!contains(cluster2slot, b)) continue; 651 | freqs[b]++; 652 | } 653 | forcmap(int, b, int, count, IntIntMap, freqs) { 654 | curr_minfo += set_p2_q2_from_count(cluster2slot[a], cluster2slot[b], count); 655 | logs(Cluster(a) << ' ' << Cluster(b) << ' ' << count << ' ' << set_p2_q2_from_count(cluster2slot[a], cluster2slot[b], count)); 656 | } 657 | 658 | freqs.clear(); // left bigrams 659 | forvec(_, int, b, left_phrases[a]) { 660 | b = phrase2rep.GetRoot(b); 661 | if(!contains(rep2cluster, b)) continue; 662 | b = rep2cluster[b]; 663 | if(!contains(cluster2slot, b)) continue; 664 | freqs[b]++; 665 | } 666 | forcmap(int, b, int, count, IntIntMap, freqs) { 667 | curr_minfo += set_p2_q2_from_count(cluster2slot[b], cluster2slot[a], count); 668 | logs(Cluster(b) << ' ' << Cluster(a) << ' ' << count << ' ' << set_p2_q2_from_count(cluster2slot[b], cluster2slot[a], count)); 669 | } 670 | 671 | curr_minfo -= q2[s][s]; // q2[s, s] was double-counted 672 | 673 | // Update L2: O(C^2) 674 | track_block("Update L2", "", false) { 675 | 676 | the_job.s = s; 677 | the_job.is_type_a = true; 678 | // start the jobs 679 | for (int ii=0; ii number of times a-b appears 893 | IntIntMap count1; // cluster a -> number of times a appears 894 | 895 | // Compute cluster distributions 896 | foridx(a, N) { 897 | int ca = phrase2cluster(a); 898 | forvec(_, int, b, right_phrases[a]) { 899 | int cb = phrase2cluster(b); 900 | count2[IntPair(ca, cb)]++; 901 | count1[ca]++; 902 | count1[cb]++; 903 | } 904 | } 905 | 906 | // For each word (phrase), compute its distribution 907 | kl_map[0].resize(N); 908 | kl_map[1].resize(N); 909 | foridx(a, N) { 910 | int ca = phrase2cluster(a); 911 | IntIntMap a_count2; 912 | int a_count1 = 0; 913 | real kl; 914 | 915 | // Left distribution 916 | a_count2.clear(), a_count1 = 0; 917 | forvec(_, int, b, left_phrases[a]) { 918 | int cb = phrase2cluster(b); 919 | a_count2[cb]++; 920 | a_count1++; 921 | } 922 | kl = kl_map[0][a] = kl_divergence(a_count2, a_count1, count2, count1, ca, false); 923 | //logs("Left-KL(" << Phrase(a) << " | " << Cluster(ca) << ") = " << kl); 924 | 925 | // Right distribution 926 | a_count2.clear(), a_count1 = 0; 927 | forvec(_, int, b, right_phrases[a]) { 928 | int cb = phrase2cluster(b); 929 | a_count2[cb]++; 930 | a_count1++; 931 | } 932 | kl = kl_map[1][a] = kl_divergence(a_count2, a_count1, count2, count1, ca, true); 933 | //logs("Right-KL(" << Phrase(a) << " | " << Cluster(ca) << ") = " << kl); 934 | } 935 | } 936 | 937 | int word2phrase(int a) { 938 | IntVecIntMap::const_iterator it = vec2phrase.find(to_vector(1, a)); 939 | return it == vec2phrase.end() ? -1 : it->second; 940 | } 941 | 942 | // Read in from paths_file and fill in phrase2rep, rep2cluster 943 | void convert_paths_to_map() { 944 | track("convert_paths_to_map()", "", false); 945 | assert(!paths_file.empty() && !map_file.empty()); 946 | 947 | // Read clusters 948 | ifstream in(paths_file.c_str()); 949 | char buf[1024]; 950 | typedef unordered_map SSVMap; 951 | SSVMap map; 952 | while(in.getline(buf, sizeof(buf))) { 953 | char *path = strtok(buf, "\t"); 954 | char *word = strtok(NULL, "\t"); 955 | assert(word && path); 956 | map[path].push_back(word); 957 | } 958 | 959 | // Create the inital clusters. 960 | phrase2rep.Init(N); // Init union-set: each phrase starts out in its own cluster 961 | foridx(a, N) { 962 | rep2cluster[a] = a; 963 | cluster2rep[a] = a; 964 | } 965 | 966 | // Merge clusters 967 | curr_cluster_id = N; // New cluster ids will start at N, after all the phrases. 968 | forcmap(const string &, path, const StringVec &, words, SSVMap, map) { 969 | int a = -1; 970 | forvec(i, const string &, word, words) { 971 | int b = word2phrase(db.lookup(word.c_str(), false, -1)); 972 | if(b == -1) continue; 973 | if(a != -1) { 974 | // Record merge in the cluster tree 975 | int c = curr_cluster_id++; 976 | cluster_tree[c] = _(a, b); 977 | 978 | // Update relationship between clusters and rep phrases 979 | int A = cluster2rep[a]; 980 | int B = cluster2rep[b]; 981 | phrase2rep.Join(A, B); 982 | int C = phrase2rep.GetRoot(A); // New rep phrase of cluster c (merged a and b) 983 | 984 | cluster2rep.erase(a); 985 | cluster2rep.erase(b); 986 | rep2cluster.erase(A); 987 | rep2cluster.erase(B); 988 | cluster2rep[c] = C; 989 | rep2cluster[C] = c; 990 | a = c; 991 | } 992 | else 993 | a = b; 994 | } 995 | } 996 | 997 | compute_cluster_distribs(); 998 | 999 | // Merge clusters 1000 | ofstream out(map_file.c_str()); 1001 | forcmap(const string &, path, const StringVec &, words, SSVMap, map) { 1002 | forvec(_, const string &, word, words) { 1003 | int a = word2phrase(db.lookup(word.c_str(), false, -1)); 1004 | if(a == -1) continue; 1005 | 1006 | /*cout << a << ' ' << N << endl; 1007 | cout << Phrase(a) << endl; 1008 | cout << kl_map[0][a] << endl; 1009 | cout << kl_map[1][a] << endl; 1010 | cout << phrase_freqs[a] << endl;*/ 1011 | 1012 | out << Phrase(a) << '\t' 1013 | << path << "-L " << kl_map[0][a] << '\t' 1014 | << path << "-R " << kl_map[1][a] << '\t' 1015 | << path << "-freq " << phrase_freqs[a] << endl; 1016 | } 1017 | } 1018 | } 1019 | 1020 | void do_clustering() { 1021 | track("do_clustering()", "", true); 1022 | 1023 | compute_L2(); 1024 | repcheck(); 1025 | 1026 | // start the threads 1027 | thread_start = new mutex[num_threads]; 1028 | thread_idle = new mutex[num_threads]; 1029 | threads = new thread[num_threads]; 1030 | for (int ii=0; iifirst, 0, '\0')); 1114 | 1115 | while(!stack.empty()) { 1116 | // Take off a stack item (a node in the tree). 1117 | StackItem item = stack.back(); 1118 | int a = item.a; 1119 | int path_i = item.path_i; 1120 | if(item.ch) 1121 | path[path_i-1] = item.ch; 1122 | stack.pop_back(); 1123 | 1124 | // Look at the node's children (if any). 1125 | IntIntPairMap::const_iterator it = cluster_tree.find(a); 1126 | if(it == cluster_tree.end()) { 1127 | path[path_i] = '\0'; 1128 | if(out_paths) paths_out << path << '\t' << Phrase(a) << '\t' << phrase_freqs[a] << endl; 1129 | if(out_map) map_out << Phrase(a) << '\t' 1130 | << path << "-L " << kl_map[0][a] << '\t' 1131 | << path << "-R " << kl_map[1][a] << '\t' 1132 | << path << "-freq " << phrase_freqs[a] << endl; 1133 | } 1134 | else { 1135 | const IntPair &children = it->second; 1136 | // Only print out paths through the part of the tree constructed in stage 2. 1137 | bool extend = a >= stage2_cluster_offset; 1138 | int new_path_i = path_i + extend; 1139 | 1140 | stack.push_back(StackItem(children.second, new_path_i, extend ? '1' : '\0')); 1141 | stack.push_back(StackItem(children.first, new_path_i, extend ? '0' : '\0')); 1142 | } 1143 | } 1144 | } 1145 | 1146 | int main(int argc, char *argv[]) { 1147 | init_opt(argc, argv); 1148 | 1149 | assert(file_exists(text_file.c_str())); 1150 | 1151 | // Set output_dir from arguments. 1152 | if(output_dir.empty()) { 1153 | output_dir = file_base(strip_dir(text_file)); 1154 | output_dir += str_printf("-c%d", initC); 1155 | output_dir += str_printf("-p%d", plen); 1156 | if(!restrict_file.empty()) output_dir += str_printf("-R%s", file_base(strip_dir(restrict_file)).c_str()); 1157 | output_dir += ".out"; 1158 | } 1159 | 1160 | if(system(("mkdir -p " + output_dir).c_str()) != 0) 1161 | assert2(false, "Can't create " << output_dir); 1162 | if(system(("rm -f " + output_dir + "/*").c_str()) != 0) 1163 | assert2(false, "Can't remove things in " << output_dir); 1164 | 1165 | // Set arguments from the output_dir. 1166 | if(!output_dir.empty()) { 1167 | if(paths_file.empty()) paths_file = output_dir+"/paths"; 1168 | if(map_file.empty()) map_file = output_dir+"/map"; 1169 | if(collocs_file.empty()) collocs_file = output_dir+"/collocs"; 1170 | if(log_info.log_file.empty()) log_info.log_file = output_dir+"/log"; 1171 | } 1172 | 1173 | init_log; 1174 | 1175 | track_mem(db); 1176 | track_mem(phrase_freqs); 1177 | track_mem(left_phrases); 1178 | track_mem(right_phrases); 1179 | track_mem(cluster_tree); 1180 | track_mem(freq_order_phrases); 1181 | track_mem(phrase2rep); 1182 | track_mem(rep2cluster); 1183 | track_mem(cluster2rep); 1184 | track_mem(phrases); 1185 | track_mem(slot2cluster); 1186 | track_mem(cluster2slot); 1187 | track_mem(p1); 1188 | track_mem(p2); 1189 | track_mem(q2); 1190 | track_mem(L2); 1191 | 1192 | read_restrict_text(); 1193 | read_text(); 1194 | if(featvec_file.empty()) { 1195 | if(paths2map) 1196 | convert_paths_to_map(); 1197 | else if(!print_stats) { 1198 | create_initial_clusters(); 1199 | output_best_collocations(); 1200 | do_clustering(); 1201 | output_cluster_paths(); 1202 | } 1203 | } 1204 | 1205 | return 0; 1206 | } 1207 | -------------------------------------------------------------------------------- /code/distantSupervision.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | def loadTargetTypes(filename): 4 | map = {} 5 | with open(filename, 'r') as fin: 6 | for line in fin: 7 | seg = line.strip('\r\n').split('\t') 8 | fbType = seg[0] 9 | cleanType = seg[1] 10 | map[fbType] = cleanType 11 | return map 12 | 13 | def linkToFB(jsonFname, outFname, mentionTypeRequired, entityTypesFname, relationTypesFname, freebase_dir): 14 | mid2typeFname = freebase_dir+'/freebase-mid-type.map' 15 | mid2nameFname = freebase_dir+'/freebase-mid-name.map' 16 | relationTupleFname = freebase_dir+'/freebase-facts.txt' 17 | 18 | mid2types = {} 19 | name2mids = {} 20 | mids2relation = {} 21 | targetEMTypes = loadTargetTypes(entityTypesFname)#{'':'PERSON', '':'ORGANIZATION', '':'LOCATION'} 22 | 23 | with open(mid2typeFname, 'r') as mid2typeFile, open(mid2nameFname, 'r') as mid2nameFile, open(relationTupleFname, 'r') as relationTupleFile: 24 | for line in mid2typeFile: 25 | seg = line.strip('\r\n').split('\t') 26 | mid = seg[0] 27 | type = seg[1].split('/')[-1][:-1] 28 | if type in targetEMTypes: 29 | if mid in mid2types: 30 | mid2types[mid].add(targetEMTypes[type]) 31 | else: 32 | mid2types[mid] = set([targetEMTypes[type]]) 33 | print('finish loading mid2typeFile') 34 | 35 | if mentionTypeRequired != 'em': 36 | targetRMTypes = loadTargetTypes(relationTypesFname) 37 | for line in relationTupleFile: 38 | seg = line.strip('\r\n').split('\t') 39 | mid1 = seg[0] 40 | type = seg[1].split('/')[-1][:-1] 41 | mid2 = seg[2] 42 | if type in targetRMTypes and mid1 in mid2types and mid2 in mid2types: 43 | key = (mid1, mid2) 44 | if key in mids2relation: 45 | mids2relation[key].add(targetRMTypes[type]) 46 | else: 47 | mids2relation[key] = set([targetRMTypes[type]]) 48 | print('finish loading relationTupleFile') 49 | 50 | for line in mid2nameFile: 51 | seg = line.strip('\r\n').split('\t') 52 | mid = seg[0] 53 | name = seg[1].lower() 54 | if mid in mid2types and name.endswith('@en'): 55 | name = name[1:].replace('"@en', '') 56 | if name in name2mids: 57 | name2mids[name].add(mid) 58 | else: 59 | name2mids[name] = set([mid]) 60 | print('finish loading mid2nameFile') 61 | 62 | with open(jsonFname, 'r') as fin, open(outFname, 'w') as fout: 63 | linkableCt = 0 64 | for line in fin: 65 | sentDic = json.loads(line.strip('\r\n')) 66 | entityMentions = [] 67 | em2mids = {} 68 | for em in sentDic['entityMentions']: 69 | emText = em['text'].lower() 70 | types = set() 71 | if emText in name2mids: 72 | linkableCt += 1 73 | mids = name2mids[emText] 74 | em2mids[(int(em['start']), em['text'])] = set(mids) 75 | for mid in mids: 76 | types.update(set(mid2types[mid])) 77 | em['label'] = ','.join(types) 78 | if len(types) > 0: 79 | entityMentions.append(em) 80 | sentDic['entityMentions'] = entityMentions 81 | 82 | if mentionTypeRequired != 'em': 83 | sentDic['relationMentions'] = [] 84 | for (eid1, e1text) in em2mids: 85 | for (eid2, e2text) in em2mids: 86 | if eid2 != eid1: 87 | rmDic = dict() 88 | rmDic['em1Text'] = e1text 89 | rmDic['em2Text'] = e2text 90 | labels = set() 91 | for mid1 in em2mids[(eid1, e1text)]: 92 | for mid2 in em2mids[(eid2, e2text)]: 93 | if (mid1, mid2) in mids2relation: 94 | labels.update(set(mids2relation[(mid1, mid2)])) 95 | if len(labels) > 0: 96 | rmDic['label'] = ','.join(labels) 97 | sentDic['relationMentions'].append(rmDic) 98 | 99 | if mentionTypeRequired == 'rm': 100 | del sentDic['entityMentions'] 101 | 102 | fout.write(json.dumps(sentDic) + '\n') 103 | 104 | 105 | def getNegRMs(jsonFname, outputFname): 106 | with open(jsonFname, 'r') as fin, open(outputFname, 'w') as fout: 107 | for line in fin: 108 | sentDic = json.loads(line.strip('\r\n')) 109 | rms = set() 110 | ems = set() 111 | newRms = [] 112 | relationMentions = [] 113 | for em in sentDic['entityMentions']: 114 | ems.add(em['text']) 115 | for rm in sentDic['relationMentions']: 116 | relationMentions.append(rm) 117 | rms.add(frozenset([rm['em1Text'], rm['em2Text']])) 118 | for em1 in ems: 119 | for em2 in ems: 120 | if em1 != em2: 121 | if frozenset([em1, em2]) not in rms: 122 | newRm = dict() 123 | newRm['em1Text'] = em1 124 | newRm['em2Text'] = em2 125 | newRm['label'] = 'None' 126 | newRms.append(newRm) 127 | rms.add(frozenset([em1, em2])) 128 | #break 129 | 130 | for rm in newRms: 131 | relationMentions.append(rm) 132 | if len(relationMentions) > 0: 133 | sentDic['relationMentions'] = relationMentions 134 | fout.write(json.dumps(sentDic)+'\n') 135 | -------------------------------------------------------------------------------- /code/generateJson.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import nltk 3 | from stanza.nlp.corenlp import CoreNLPClient 4 | import json 5 | from distantSupervision import linkToFB, getNegRMs 6 | 7 | INTERESTED_STANFORD_EM_TYPES = ['PERSON','LOCATION','ORGANIZATION'] 8 | 9 | class NLPParser(object): 10 | """ 11 | NLP parse, including Part-Of-Speech tagging. 12 | Attributes 13 | ========== 14 | parser: StanfordCoreNLP 15 | the Staford Core NLP parser 16 | """ 17 | def __init__(self): 18 | self.parser = CoreNLPClient(default_annotators=['ssplit', 'tokenize', 'ner']) 19 | 20 | def parse(self, sent): 21 | result = self.parser.annotate(sent) 22 | tokens_list, ner_list = [], [] 23 | for sent in result.sentences: 24 | tokens, ner = [], [] 25 | currNERType = 'O' 26 | currNER = '' 27 | for token in sent: 28 | token_ner = token.ner 29 | if token_ner not in INTERESTED_STANFORD_EM_TYPES: 30 | token_ner = 'O' 31 | tokens += [token.word] 32 | if token_ner == 'O': 33 | if currNER != '': 34 | ner.append(currNER.strip()) 35 | currNER = '' 36 | elif token_ner == currNERType: 37 | currNER += token.word + ' ' 38 | else: 39 | if currNER != '': 40 | ner.append(currNER.strip()) 41 | currNERType = token_ner 42 | currNER = token.word + ' ' 43 | if currNER != '': 44 | ner.append(currNER.strip()) 45 | if len(tokens) == 0 or len(ner) == 0: 46 | continue 47 | tokens_list.append(tokens) 48 | ner_list.append(ner) 49 | return tokens_list, ner_list 50 | 51 | 52 | def extract_np(data): 53 | nps = [] 54 | for d in data: 55 | np = "" 56 | for tup in d: 57 | if len(np) == 0: 58 | np = tup[0] 59 | else: 60 | np += " "+tup[0] 61 | 62 | nps.append(np) 63 | 64 | return nps 65 | 66 | def leaves(tree): 67 | #Finds NP (nounphrase) leaf nodes of a chunk tree. 68 | nps = [] 69 | for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'): 70 | nps.append(subtree.leaves()) 71 | 72 | return extract_np(nps) 73 | 74 | 75 | def writeToJson(inFile, outFile, parseTool, isTrain, mentionType): 76 | if parseTool == 'stanford': 77 | useNLTK = False 78 | elif parseTool == 'nltk': 79 | useNLTK = True 80 | else: 81 | raise Exception('parse tool has to be \'stanford\' or \'nltk\'') 82 | 83 | grammar = r""" 84 | NBAR: 85 | {*} # Nouns and Adjectives, terminated with Nouns 86 | NP: 87 | {} 88 | {} # Above, connected with in/of/etc... 89 | """ 90 | cp = nltk.RegexpParser(grammar) #chunk parser 91 | with open(inFile, 'r') as fin, open(outFile, 'w') as fout: 92 | articleId = 0 93 | for line in fin: 94 | doc = line.strip('\r\n') 95 | if useNLTK: 96 | sents = nltk.sent_tokenize(doc) 97 | tokens_list = [] 98 | nps_list = [] 99 | for sent in sents: 100 | tokens = nltk.word_tokenize(sent) 101 | if len(tokens) == 0: 102 | continue 103 | nps = leaves(cp.parse(nltk.pos_tag(tokens))) 104 | if len(nps) == 0: 105 | continue 106 | tokens_list.append(tokens) 107 | nps_list.append(nps) 108 | else: 109 | parser = NLPParser() 110 | tokens_list, nps_list = parser.parse(doc) 111 | 112 | sentId = 0 113 | for i in range(len(tokens_list)): 114 | tokens = tokens_list[i] 115 | nps = nps_list[i] 116 | 117 | sentDic = dict() 118 | sentDic['sentId'] = sentId 119 | entityMentions = [] 120 | start = 0 121 | for np in nps: 122 | entityMention = dict() 123 | entityMention['text'] = np 124 | entityMention['label'] = 'None' 125 | entityMention['start'] = start 126 | entityMentions.append(entityMention) 127 | start += 1 128 | if isTrain or mentionType != 'rm': 129 | sentDic['entityMentions'] = entityMentions 130 | if not isTrain and mentionType != 'em': 131 | sentDic['relationMentions'] = [] 132 | for em1 in entityMentions: 133 | for em2 in entityMentions: 134 | if em1 is not em2: 135 | rmDic = dict() 136 | rmDic['em1Text'] = em1['text'] 137 | rmDic['em2Text'] = em2['text'] 138 | rmDic['label'] = 'None' 139 | sentDic['relationMentions'].append(rmDic) 140 | 141 | sentDic['sentText'] = ' '.join(tokens) 142 | sentDic['articleId'] = articleId 143 | fout.write(json.dumps(sentDic) + '\n') 144 | sentId += 1 145 | 146 | articleId += 1 147 | 148 | 149 | inFile = sys.argv[1] 150 | outFile = sys.argv[2] 151 | parseTool = sys.argv[3] 152 | if int(sys.argv[4]) == 1: 153 | isTrain = True 154 | else: 155 | isTrain = False 156 | mentionType = sys.argv[5] 157 | 158 | if isTrain: 159 | entityTypesFname = sys.argv[6] 160 | relationTypesFname = sys.argv[7] 161 | freebase_dir = sys.argv[8] 162 | print('start generating candidate entity mentions') 163 | writeToJson(inFile, './tmp1.json', parseTool, isTrain, mentionType) 164 | print('start linking to freebase') 165 | linkToFB('./tmp1.json', './tmp2.json', mentionType, entityTypesFname, relationTypesFname, freebase_dir) 166 | print('start generating negative examples') 167 | getNegRMs('./tmp2.json', outFile) 168 | else: 169 | writeToJson(inFile, outFile, parseTool, isTrain, mentionType) 170 | 171 | -------------------------------------------------------------------------------- /data/documents.txt: -------------------------------------------------------------------------------- 1 | Its success of hosting the second Middle East and North Africa economic summit in Amman , which drew the participation of 1,600 government officials and private businessmen from 62 countries across the world , added great credit to the kingdom and was considered the crowning accomplishment of Jordan 's foreign policy . AMMAN , January 1 ( Xinhua ) -- Jordan , in the past year , witnessed a series of remarkable achievements on its diplomatic front . In 1995 , Jordan regained sovereignty over its territories occupied by Israel and exchanged ambassadors with Tel Aviv . CAIRO , January 1 ( Xinhua ) -- Violence seems on the rise in Egypt as a latest sign that militant attacks may be intensified in the most populous Arab country . 2 | DAR ES SALAAM , January 1 ( Xinhua ) -- Tanzania has accomplished a peaceful transition to a multy-party system this year after Benjamin Mkapa was sworn in as President of the United Republic of Tanzania in late November . 3 | JERUSALEM , January 1 ( Xinhua ) -- King Hussein of Jordan will arrive in Tel Aviv next Wednesday to pay his first official visit to Israel since the two countries signed a peace treaty last year . 4 | Since the signing of the accord in August , a pile of reports , both jubilant and disturbing , have been emanating from Monrovia , the capital of Liberia : 5 | However , a disturbing report from Monrovia on December 20 rang a warning bell that Liberia 's peace prospects were still fragile following the sacking of two ministers by the four-month-old Council of State in mid-December . LAGOS , January 1 ( Xinhua ) -- In 1995 , the strength of the once biggest warring faction , the National Patriotic Front of Liberia ( NPFL ) led by Charles Taylor , was seriously weakened following numerous internal strifes and a series of military defeats . 6 | It was also joined by three civilian members : Oscar Quiah , representative of the citizens of Monrovia ; Tamba Taylor , one of Liberia 's traditional leaders who was a `` paramount chief '' ; and Wilton Sankawolo , a professor from the University of Liberia who was appointed chairman of the council . 7 | The January talks in Accra , capital of Ghana , ended in vain . Fruits were indeed born out of the May meeting in Abuja , capital of Nigeria , but the pick of a 90-year-old civilian to chair the interim cabinet once again got vetoed by the warlord Charles Taylor . 8 | NAIROBI , January 1 ( Xinhua ) -- A rift is developing between liberia 's Ministry of Information and the Press Union of Liberia ( pul ) over who should accredit journalists in the country , the Pan African News Agency ( PANA ) reported from Monrovia today . 9 | CAIRO , January 1 ( Xinhua ) -- U.N. Secretary General Boutros Boutros-Ghali , currently on a visit to Egypt , will hold talks with Arab League ( AL ) Secretary General Esmat Abdul-Meguid here Tuesday mainly on the Yemeni-Eritrean dispute over the Red Sea island of Greater Hanish . 10 | BRUSSELS -- Italy , the rotating presidency of the European Union ( EU ) for the first half of 1996 , will `` try to serve the general purpose of the EU and bring European citizens closer to the EU . '' 11 | TEHRAN -- Iran claimed here Sunday that it will lodge a protest with the United Nations and the International Court of Justice in The Hague if Washington does not reject the congressional act against the Islamic republic . 12 | HANOI -- Vietnamese nationals as well as foreigners now can bring into and take out of Vietnam as much as 7,000 U.S. dollars without declaring to the customs . 13 | JOHANNESBURG , January 2 ( Xinhua ) -- An increase in the police budget was a main factor in curbing cirme this year , South Africa 's National Police Commissioner George Fivaz said today . 14 | Speaking to reporters in Pretoria , Fivaz said South Africa would risk becoming a `` gangster state where hijackers , drug-lords , muggers and other criminals would trample hard-won democratic rights into the dust unless the South African Police Service ( SAPS ) was given every means necessary to fight crime . '' 15 | LUSAKA , January 2 ( Xinhua ) -- Presidents of Zambia 's eight opposition parties will meet here tomorrow to work out a strategy of dealing with the method of adopting the draft constitution , among other issues . 16 | AMMAN , January 2 ( Xinhua ) -- King Hussein of Jordan met today with visiting Israeli Foreign Minister Ehud Barak and discussed bilateral relations and the Middle East peace process . 17 | LONDON , January 2 ( Xinhua ) -- Thousands of homes throughout the United Kingdom were flooded on Monday as the Christmas freeze gave way to a dramatic thaw , bursting water mains and domestic water systems . 18 | AMMAN , January 2 ( Xinhua ) -- Jordanian Prime Minister Sharif Zeid Ben Shaker stressed here today that Jordan is committed to realizing a just , comprehensive and lasting peace on all tracks of the Middle East peace process . 19 | He also said that candidates in Jerusalem are asked to contact the DEO in east Jerusalem 's Abu Dis , where scores of Palestinian policemen have already been deployed . 20 | The project manager in the state , Lawrence Esho , told newsmen today in Akure , capital of Ondo State , that 160 million naira ( about 1.88 million U.S. dollars ) was spent on machineries , such as compressors , vehicles and drilling rigs . 21 | Lebanon upholds a coordinated stand with Syria in the peace negotiations with Israel , and refuses to sign any peace accord with the Jewish state before Syria does so . 22 | LUSAKA , January 3 ( Xinhua ) -- The Cooperative Bank of Zambia which closed in November last year will re-open to the public within the shortest possible time , it is announced here today . 23 | TEHRAN , January 3 ( Xinhua ) -- Iran has executed eight people , including a woman , on charges of armed robbery , Kidnaping and drug trafficking , the local evening newspaper Kayhan report today . 24 | The newspaper said that these criminals were executed in Iran 's southern province of Kerman and other 16 people were sentenced six to 11 years imprisonment on the same charges . 25 | LUSAKA , January 3 ( Xinhua ) -- Zambia 's ruling party chief today urged the youths in markets and residential areas to conduct door-to- door campaign to get more people registered as voters before the exercise ends . 26 | TEHRAN , January 3 ( Xinhua ) -- Iran announced here today that it has arrested 273 Afghan nationals for illegally entering the country . 27 | Iran strives to have relations with Europe , Russia , China , Japan , Southeast Asian , South Asian and Moslem countries and real Non-Aligned countries , he added . 28 | DAMASCUS , January 3 ( Xinhua ) -- No breakthrough is expected to be secured during the second round of Maryland talks between Syria and Israel , according to diplomatic sources here . 29 | The sources added that despite the positive atmosphere pervading the Maryland talks between Syria and Israel , there is still a wide gap between the attitudes of the two countries regarding a number of questions , especially the question of the Israeli withdrawal from the Golan Heights captured in the 1967 Middle East war . 30 | Israel insists on setting up ground-based early warning stations in the Golan Heights after an Israeli pull-out . 31 | GENEVA , January 3 ( Xinhua ) -- With a per-capita-GDP of 37,180 U.S. dollars , Switzerland kept its position as the world 's second richest country in 1994 , just after Luxemburg , the latest list of rankings issued by the World Bank showed . 32 | ADDIS ABABA , January 3 ( Xinhua ) -- Ethiopian Foreign Minister Seyoum Mesfin left here today for Asmara and Sanaa in a renewed effort to help mediate the dispute on the Hanish islands between Eritrea and Yemen . 33 | Eritrea set free more than 200 Yemeni prisoners , handing them over to the International Society of Red Cross which flew them out of Asmara and returned them to Sanaa last month . 34 | LAGOS , January 4 ( Xinhua ) -- Liberia 's Council of State , the collective presidency , has called for a ceasefire in the northwestern town of Tubmanburg , where militiamen have been fighting the west African peacekeeping force known as ECOMOG . 35 | Charles Taylor , a member of the presidency , made the appeal in Liberia 's capital Monrovia Wednesday when he spoke to newsmen with the presence of two other members of the presidency , Alhaji Kromah and George Boley . 36 | KHARTOUM , January 4 ( Xinhua ) -- Abu Bakr Yunis , head of a Libyan delegation currently visiting Sudan , said here Wednesday that his country is mediating between Sudan and Uganda in a bid to remove the misunderstanding between the two countries . 37 | Yunis , one of the Libyan leaders , affirmed while meeting with Sudanese President Omar al-Bashir in Khartoum that Libya is making efforts to help Sudan and Uganda settle their differences , local newspapers reported today . 38 | The trapped ECOMOG soldiers were on a mission to maintain the ceasefire under the peace accord signed by all the warring factions in Abuja , capital of Nigeria , on August 20 , 1995 . 39 | JERUSALEM , January 4 ( Xinhua ) -- Israel is to release some 1,200 Palestinian prisoners next week in accordance with the Israel-PLO Oslo 2 accord on expanding Palestinian autonomy . 40 | AMMAN , January 4 ( Xinhua ) -- Jordan 's Lower House of Parliament endorsed here today a draft budget bill for fiscal 1996 , according to `` Radio Jordan . '' 41 | This was disclosed by Kgosinkwe Moesi , the SADC information officer in Gaborone , capital of Botswana , in his interview with the Zimbabwean National News Agency . 42 | This year 's SADC annual summit will be held in August in Maseru , capital of Lesotho . 43 | Damascus , January 5 ( Xinhua ) -- The resuming of Syrian-Israeli peace talks in Maryland , the United States , in late December has witnessed a new phase of Syrian-Israeli track of the peace process after a six-month stalemate . 44 | These meetings revealed big gap between the stances of Syria and Israel , prominent among which is the establishment of early warning systems after proposed Israeli withdrawal from the Golan Heights , a strategic plateau Israel captured from Syria in the 1967 Middle East war . 45 | Syria had even expressed its indignation at Israel 's demand to maintain early warning stations at the Syrian side after Israeli pullout from the Golan Heights . 46 | He had on several occasions indicated hope that complete peace with syria could be attained in the no distant future , saying that peace with Syria and Lebanon would mean peace with 18 to 20 Arab states . 47 | DAR ES SALAAM , January 5 ( Xinhua ) -- President Benjamin Mkapa said here today the Tanzanian government pursues the one-China policy , commending China for its support to Tanzania 's economic development . 48 | However , the field commander of the west African peace-keeping force , known as ECOMOG , Maj.-gen. John Inienger , said on Thursday that the fithting had ceased in Tubmanburg , where the Kahn wing fighters of the United Liberation Movement for Democracy in Liberia ( ULIMO ) on attacked and killed three ECOMOG soldiers on December 28 last year . 49 | Hashimoto , Deputy Prime Minister who heads the Liberal Democratic Party , the largest component within Japan 's three-way ruling coalition , is the most likely choice . 50 | The sources disclosed the mandrax drug comes from Asia and is sent to South Africa while heroine from India and Thailand and to Europe , North America and Canada . 51 | JOHANNESBURG , January 5 ( Xinhua ) -- The announcement of a new multi-party commission to investigate political violence in South Africa 's Kwazulu/Natal sparked a verbal battle on Thursday between political parties in the Province . 52 | Talking about Seguin 's visit at a press conference earlier today , Lebanese Foreign Minister Farez Boueiz pointed to a `` big concern '' for Europe generally , and for France particularly , about what is happening in the region and especially what is occurring in Lebanon . 53 | JOHANNESBURG , January 5 ( Xinhua ) -- South Africa is expected to send a multi-party delegation to Germany next week to study its federal system , the German Embassy said today in Pretoria . 54 | TOKYO , January 5 ( Xinhua ) -- Prime Minister Tomiichi Murayama said today he is stepping down after 18 months in office so that a new coalition cabinet can take measures to ensure Japan 's economic recovery and handle other pressing issues with the beginning of the new year . 55 | ADDIS ABABA , January 5 ( Xinhua ) -- Ethiopian Foreign Minister Seyoum Mesfin said here today that both Yemen and Eritrea have accepted Ethiopia 's Proposal on the disputed Hanish Islands . 56 | The Ethiopian minister made the announcement when he returned home from Asmara , Eritrea , where he had discussions with Eritrean leaders on solutions to the Hanish Islands crisis . 57 | The Syrian official daily Tishrin expressed hope Wednesday that 1996 should be the year of peace on the Syrian and Lebanese tracks though reiterating Syria 's demand that Israel commit itself to a complete and early withdrawal from the Golan Heights . 58 | TEHRAN , January 5 ( Xinhua ) -- Iran has urged natives of the disputed island of Abu Musa to `` be alert against any aggression '' and `` defend every inch of our territory . '' 59 | The strongly-worded statement was made by Iranian First Vice- President Hassan Habibi today in the disputed island of Abu Musa , the highest ranking Iranian official ever set his foot on this island , the most controversial dispute between Iran and the United Arab Emirates ( UAE ) . 60 | UNITED NATIONS , January 5 ( Xinhua ) -- The U.N. Security Council , in its regular 60-day review of sanctions against Iraq , on Friday decided to maintain the economic embargo . 61 | ANKARA , January 5 ( Xinhua ) -- Turkish Pro-Islamist Welfare Party ( RP ) leader Necmettin Erbakan said today that the West has nothing to fear from his party 's coming to power because then Turkey will be a powerful ally of the West . 62 | Besides , it restricts Turkey 's trade with other countries , including the 'Turkish Republic of Northern Cyprus , ' Erbakan said , adding , `` I am deeply surprised how they could suggest such an agreement . '' 63 | LAGOS , January 5 ( Xinhua ) -- The Sierra Leonean High Commissioner to Nigeria , Joe Blell , has appealed to the Nigerian government to review its decision not to participate in the 20th African Cup . 64 | The envoy told the News Agency of Nigeria ( NAN ) here today that Nigeria 's participation in the forthcoming tournament to be opened in South Africa would promote unity and peace in Africa . 65 | Commenting on the president 's decision , Medvedev said , `` The president 's actions were logical : Boris Yeltsin repeatedly and sharply criticized the Foreign Ministry and its leader for miscalculations and drawbacks connected with the conduct of Russia 's foreign policy and coordination of activities o various ministries and departments operating abroad . '' 66 | He cited Yeltsin as saying that foreign policy priorities are the building up of Russia 's ties with CIS ( Commonwealth of Independent States ) members , of its partnership with Western nations , China , Japan and India . 67 | NAIROBI , January 5 ( Xinhua ) -- Kenyan Foreign Minister Kalonzo Musyoka today called on foreign envoys to Kenya to issue positive travel advice to their fellow citizens as the security is steadily strengthened in the country . 68 | Diplomats including all the deans of various regions represented in Kenya congratulated Kenyan police and the government on the recent success in combating crime in Nairobi . 69 | At a press conference in the West Bank town of Ramallah , Abbas said that 1,0103,235 Palestinian voters in the West Bank and Gaza have registered for the PC elections due to take place on January 20 . 70 | The Syrian official daily Tishrin expressed hope Wednesday that 1996 should be the year of peace on the Syrian and Lebanese tracks though reiterating Syria 's demand that Israel commit itself to a complete and early withdrawal from the Golan Heights . 71 | ATHENS , January 7 ( Xinhua ) -- Greece 's main opposition party , the New Democracy Party ( ND ) , is to table a censure motion against the government Monday , claiming Prime Minister Andreas Papandreou 's extended incapacity had left the country virtually ungoverned . 72 | LAGOS , January 7 ( Xinhua ) -- No fewer than 3.3 million Nigerians are suffering from river blindness , according to a report from the News Agency of Nigeria . 73 | Johannesburg , January 7 ( Xinhua ) -- The year 1996 is expected to witness South Africa 's ruling party facing unprecedented challenges as the African National Congress ( ANC ) fights to repair the damage apartheid has wrecked on the country . 74 | AMMAN , January 7 ( Xinhua ) -- U.S. Secretary of Defense William Perry arrived here today for talks on Jordan 's military needs , including a long-sought military aid package to modernize Jordan 's forces . 75 | However , it is reported that Washington has agreed in principle to provide Jordan with the single-engine F-16s , of which the number is not immediately known but is estimated at between 12 and 16 . 76 | JERUSALEM -- Israel went on high alert following the death of Hamas activist Yehia Ayyash , known as `` the Engineer , '' in Gaza Friday , as Hamas threatened to launch revenge attacks on Israeli targets . 77 | ISLAMABAD -- At least six people were killed and 14 others injured Sunday evening in Pakistan 's southern port city of Karachi when an explosive went off in a bus , according to local reports . 78 | The Ethiopian foreign minister has visited Asmara and Sanaa three times since the conflict erupted last October between Eritrea and Yemen . 79 | KUWAIT CITY , January 7 ( Xinhua ) -- Czech Prime Minister Vaclav klaus arrived here today on an official two-day visit to Kuwait . 80 | VIENNA , January 7 ( Xinhua ) -- Italy 's Alberto Tomba clinched his third consecutive slalom victory after a brilliant second run in the Alpine skiing men 's World Cup in Flachau , Austria , on Sunday . 81 | Mario Reiter of Austria finished second with 1:41.25 while Slovenia 's Jure Kosir was in third in 1:41.45 . 82 | Austria 's Thomas Sykora was the only skier to have a better second run than Tomba , vaulting him from seventh to fourth with 1:41.48 . 83 | LUANDA , January 8 ( Xinhua ) -- A visiting World Bank official today pledged the bank 's `` full support '' to Angola which she said was torn apart by an `` economic crisis . '' 84 | He said that the UK accounted for one-fifth of man-made emissions of carbon dioxide in Europe , compared with 12 percent from France and 13 percent from Italy . 85 | NICOSIA , January 8 ( Xinhua ) -- Cyprus and the United States have agreed that there should be an overall approach to all key issues composing the Cyprus problem and that all fundamental questions should be on the negotiating table . 86 | In addition , Arab parliamentarians will hold a special session Thursday to discuss the U.S. Congress decision to move the American embassy to Israel from Tel Aviv to Jerusalem . 87 | BEIRUT , January 7 ( Xinhua ) -- Speaker of the Canadian House of Representatives Gilbert Adolph Parent reiterated here today his country 's support for the implementation of U.N. Security Council resolution 425 , which calls Israel 's withdrawal from Lebanon . 88 | JOHANNESBURG , January 8 ( Xinhua ) -- President Nelson Mandela today considered the death of former French President Francois Mitterrand as `` a great loss to the people and government of South Africa . '' 89 | ISLAMABAD , January 8 ( Xinhua ) -- Pakistan and Britain have reaffirmed their cooperation in rooting out drug menace . 90 | JOHANNESBURG , January 8 ( Xinhua ) -- Claims of National Intelligence Agency ( NIA ) agents spying on top police officials might have to referred to an independent investigator , South Africa 's First Deputy President Thabo Mbeki said today . 91 | ADDIS ABABA , January 9 ( Xinhua ) -- Ethiopia 's ambitious coffee export plans for 1996 have been met with the slide of coffee prices in the world market . 92 | LUSAKA , January 9 ( Xinhua ) -- Germany today extended three aid packages to Zambia totalling 21.5 million U.S. dollars to finance its rural water supply projects and other aid programs . 93 | BAGHDAD , January 9 ( Xinhua ) -- Iraq today demanded the release of a sum of 50 million U.S. dollars of assets frozen in the countries of Saudi Arabia , Bahrain and the United Arab Emirates to cover expenses of Iraqis ' pilgrimage to Mecca and the cost of printing the holy Qur'an . 94 | Under the agreement , the EU pledged to provide about 518,000 U.S. dollars to finance the deployment of the first group of five observers in Burundi 's capital city of Bujumbura for a period of three and a half months . 95 | JERUSALEM , January 9 ( Xinhua ) -- The Palestinian National Authority ( PNA ) may ask Israel to extradite Kamal Hamad , suspected of being involved in the murder of Hamas activist Yihye Ayyash , Israel Radio reported this evening . 96 | HARARE , January 9 ( Xinhua ) -- The British government is willing to reconsider providing aid for Zimbabwe 's land redistribution if the Zimbabwean government spells out its needs and the new strategies to be adopted after the failure of the first resettlement program . 97 | JERUSALEM , January 9 ( Xinhua ) -- The head of the European Union Electoral Unit , Carl Lidbom , today called on Israel to allow free movements of the Palestinian candidates in their campaign prior to the scheduled January 20 elections . 98 | CAIRO , January 9 ( Xinhua ) -- Egyptian Prime Minister Kamal el-Ganzouri met here today with Israeli Minister of Energy Gonen Segev , who is currently visiting Egypt for talks on promoting bilateral cooperation in the field of oil . 99 | `` With the situation determined , the president assigned deputy from Konya Province , Necmettin Erbakan , chairman of the Welfare Party which has the most seats in the parliament , to form the government , '' the statement concluded . 100 | BEIJING -- Chinese Vice-Premier and Foreign Minister Qian Qichen said here today that China has always attached importance to its relations with Britain . 101 | SHANGHAI -- Than Shwe , chairman of the State Law and Order Restoration Council ( SLORC ) of Myanmar , had a meeting here today with the city 's mayor Xu Kuangdi . 102 | LUANDA , January 9 ( Xinhua ) -- Visiting Portuguese President Mario Soares conferred here today the Medal of Afonso Henrique on the United Nations special envoy to Angola , Blondin Beye . 103 | -- China confirms that all the residents now with permanent residence status in Hong Kong will continue to have residence status after June 30 , 1997 . 104 | AMMAN , January 9 ( Xinhua ) -- Visiting Saudi Arabian Foreign Minister Prince Saud al-Faisal al Saud said here today that King Hussein of Jordan has accepted King Fahd iben Abdul-Aziz's invitation to visit Riyadh . 105 | JOHANNESBURG , January 10 ( Xinhua ) -- South Africa , southern Africa 's largest maize producer , is set to harvest between 7 and 8 million tons of maize this year following a good start to the rainy season , the state-run maize board said here today . 106 | Johannesburg , January 10 ( Xinhua ) -- South Africa 's ruling party , the African National Congress ( ANC ) , today regretted on the refusal by its rival , the Inkatha Freedom Party ( IFP ) , to resume talks on the country 's first post-apartheid Constitution . 107 | The blizzard left Philadelphia with over 30 inches of snow , dumped 27 inches of white powder in New York City and Washington , D.C. and buried Shenandoah National Park in Virginia with as much as 47 inches of flakes . 108 | Massachusetts , New York and Maryland already had light snow overnight . 109 | ADDIS ABABA , January 10 ( Xinhua ) -- France will give the highest regard to Ethiopia 's efforts to seek a political settlement to the problem in the Red Sea Hanish Islands , said French Ambassador to Ethiopia Francis Gutmann , who is special envoy of French President Jack Chirac . 110 | Cape Town is the early favorite but will face challenge from Athens , Buenos Aires , Istanbul , Lille , Rio de Janeiro , Rome , San Juan , Seville , Stockholm and St Petersburg . 111 | LONDON , January 10 ( Xinhua ) -- The Primate of the Church of Ireland , Archbishop Robin Eames said today that he could act as a broker in the arms decommissioning process in Northern Ireland , reported BBC TV this afternoon . 112 | Thomas Muster , Austria , 4,474 4 . 113 | Boris Becker , Germany , 3,325 5 . 114 | Yevgeny Kafelnikov , Russia , 2,660 7 . 115 | Wayne Ferreira , South Africa , 2,144 10 . 116 | Goran Ivanisevic , Croatia , 1,861 11 . 117 | Sergi Bruguera , Spain , 1,666 13 . 118 | Michael Stich , Germany , 1,653 14 . 119 | Arnaud Boetsch , France , 1,412 15 . 120 | Marc Rosset , Switzerland , 1,391 16 . 121 | Gilbert Schaller , Austria , 1,256 20 . 122 | Andrea Gaudenzi , Italy , 1,212 123 | KUALA LUMPUR , January 11 ( Xinhua ) -- Malaysian Prime Minister Mahathir Mohamad said today that Malaysia should take steps to correct the country 's current balance of payment ( BOP ) deficit , which reached 18 billion Malaysia Ringgit ( 7.2 billion US dollars ) at the end of 1995 . 124 | WELLINGTON , January 10 ( Xinhua ) -- Jaime Yzaga of Peru recovered from a year of doldrums to upset top seed Thomas Enqvist of Sweden 7-5 , 6-4 in the 328,000-U.S.-Dollar BellSouth Open on Wednesday in Auckland , New Zealand . 125 | Second-seed MaliVai Washington moved into the last eight after outplaying Martin Damm 6-3 , 6-4 in 82 minutes , while eighth seed Jiri Novak of the Czech Republic disposed of Italian Stefano Pescosolido 6-2 , 6-3 . 126 | CANBERRA , January 11 ( Xinhua ) -- Video games have been proved of little educational value as they are promised , an education expert said in Australia . 127 | Stephen Kline , professor from the Simon Fraser University of Canada , told an international meeting held in Hobart in Australia that video games offered little help in enhancing children's ability of learning , analysis , critique and intellectual inquiry . 128 | PHNOM PENH , January 11 ( Xinhua ) -- A U.S. military chief today ended his two-day visit to Cambodia , with the promise of continued U.S. aid for the country . 129 | -- Ryutaro Hashimoto , president of the Liberal Democratic Party , is to be elected Japan 's new prime minister in a parliamentary election today . 130 | YANGON , January 11 ( Xinhua ) -- Nearly 200 foreign devotees from 18 countries or regions including Singapore , Britain , Australia , USA , Japan and France will take part in collective ordination and novitiation in Myanmar . 131 | BEIJING , January 11 ( Xinhua ) -- China 's electronic industry witnessed a sustained and stable growth last year , with total industrial output value reaching 230 billion yuan , up 23.5 percent from that of previous year , according to Zhang Jinqiang , vice minister of Electronic Industry . 132 | -- China is confident that Hong Kong will have a smooth transfer and a stable and prosperous future , Premier Li Peng said yesterday during a meeting with visiting British Foreign Secretary Malcolm Rifkind . 133 | The two leaders fixed the date for Kohl 's forthcoming visit to Russia scheduled for the end of April after the `` eight '' summit in Moscow . 134 | ANKARA , January 12 ( Xinhua ) -- Turkey and Bulgaria are planning to open a new border gate between the two countries , the Anatolia News Agency reported today . 135 | Officials from the two countries will meet in Edirne province of Turkey on January 25 and then in Bulgaria on January 26 to discuss the opening of Hamzabeyli border gate , located in Edirne province . 136 | CAIRO , January 12 ( Xinhua ) -- Kenyan Foreign Minister Stephen Musyoka arrived here today leading a delegation on a three-day visit to Egypt , during which he will hold talks with Egyptian officials on promoting bilateral cooperation . 137 | DAMASCUS , January 12 ( Xinhua ) -- U.S. Secretary of State Warren Christopher said here today that Israel and Syria have agreed to to resume their negotiations on January 24 with the participation of military experts from both sides . 138 | He said al-Assad and Israeli Prime Minister Shimon Peres had moved closer to achieving their goal -- an Israeli commitment to a complete withdrawal from the Golan Heights and a Syrian definition of peace terms with Israel . 139 | GENEVA , January 12 ( Xinhua ) -- Two large Swiss banks will arrange negotiations in Hong Kong next week to resolve a decade-long dispute over claims to the estimated 475 million U.S. dollars deposited by the late president of the Philippines , Ferdinand Marcos . 140 | The Swiss deposits represent a fraction of the billions of dollars that Marcos , who died in exile in 1989 , and his widow , Imelda , are accused of looting from the Philippines in their 20 years in power . 141 | KAMPALA , January 12 ( Xinhua ) -- Uganda 's Minister for Foreign Affairs Ruhakana Rugunda today urged the international community to supplement efforts of countries in this sub-region to get a lasting solution to problems in Burundi . 142 | BEIRUT , January 12 ( Xinhua ) -- The Palestine Liberation Organization ( PLO ) today denied a report that it was recruiting fresh bloods in Lebanon for the Palestinian police force in the Palestinian self-ruled areas of Gaza Strip and West Bank . 143 | On the report that the pro-Arafat Fatah faction was collecting light weapons in Ain Al-Hilweh , the largest Palestinian refugee camp in Lebanon , he said that the move was aimed to enhance the security in the Palestinian refugee camps in this country . 144 | ATHENS , January 12 ( Xinhua ) -- Police in Greece 's northern city of Salonika have expressed fears that a large quantity of Russian-made automatic weapons have been channelled to organized criminals in the country . 145 | ANKARA , January 12 ( Xinhua ) -- Turkish police have seized more than 31 kilograms of heroin in their recent operations against drug trafficking in Turkey , the Anatolia News Agency reported today . 146 | KHARTOUM , January 13 ( Xinhua ) -- Sudan has delivered to the Ethiopian government a strong-worded protest , urging Ethiopia to withdraw its troops from the Sudanese territory . 147 | ADDIS ABABA , January 13 ( Xinhua ) -- Ethiopia 's Investment Code would soon be amended after four years of implementation , a spokesman of the Investment Office said here today . 148 | KUWAIT CITY , January 13 ( Xinhua ) -- Kuwaiti Defense Minister Sheikh Ahmed al-Humoud al-Sabah held talks here this afternoon with his British Secretary of State for Defense Michael Portillo on supplying Kuwait with advanced weapons and ways to boost bilateral relations . 149 | HONG KONG , January 13 ( Xinhua ) -- Following are news items from the Asia-Pacific Desk of Xinhua in Hong Kong today : hke011329 -- Myanma Leader Returns Home after Visit to China hke011330 -- Malaysia Satellite Useful For Defense Purpose : hke011331 -- Weekly Report On Kuala Lumpur Stock Market hka011332 -- Weather Information for Asian-Pacific Cities hke011333 -- Weekly Report On Malaysia 's Tin , Rubber Markets hka011334 -- 1.5 Million HK Dollars of Electronic Components Looted hke011335 -- Canadian PM Predicts 4-Fold Increase in Trade with hke011336 -- 5 Killed in Road Mishap in Philippines hke011337 -- Hong Kong Exports More Jewelry Last Year hke011338 -- Karamat Sworn in As New Pakistani Army Chief hke011339 -- Roundup : Rushing Irrigation Projects to Boost Food hke011340 -- ADB Approves Loan to Pakistan for Drainage Program hke011341 -- India Set on Higher Growth Path , RBI Governor Says hke011342 -- Malaysia Not Ready For Diplomatic Ties with Israel hke011343 -- Indonesian Armed Forces Manage to Free 11 Hostages hke011344 -- Pakistan , Iran to Cooperate in Highway Projects hke011345 -- Roundup : Cooperative Tap on China 's Offshore Oil , hke011346 -- Indonesia 's Polls Committee Provides Hotline Service hke011347 -- Indonesia to Open More Single-Teacher Schools hka011348 -- Chinese Taipei Badminton Open Results ( 1 ) hke011349 -- Regional Meeting on Adventure Travel Opens in Nepal hke011350 -- Ye , Dong qualify for Chinese Taipei Open Badminton Final hke011351 -- Chinese Taipei Badminton Open Results ( 2-last ) 150 | HONG KONG , January 13 ( Xinhua ) -- Indonesia 's world champion Susi Susanti set up a show-down with China 's world number one Ye Zhaoying in the women 's singles final after their semi-final victories at the Chinese Taipei Open badminton tournament in Taipei today . 151 | And in the men 's singles , it was China 's Dong Jiong who stole the show , beating hot-favored top seed Allan Budikusuma from Indonesia . 152 | Sun won high worlds from the local media after he scored a come-from-behind victory over Indonesia 's Ardy Wiranata , ranked fifth in the world , yesterday . 153 | Other finalists were Liu Jianjun and Sun Man of China and Danes R. Olsen and M. Sogaard for the mixed doubles , Ge Fei and Gu Jun of China for the women 's doubles , and Swedes P-G Jonsson and P. Axelsson in the men 's doubles . 154 | NEW YORK , January 13 ( Xinhua ) -- New York police have arrested two men and seized 213 kilograms of cocaine in New York City , local press reports said today . 155 | The satellite , together with experiments , will be brought home in the cargo bay of Endeavour , which will land at Kennedy Space Center in Cape Canaveral , Florida , on January 20 . 156 | Clinton was greeted at the airport by Croatian President Franjo Tudjman and other senior Croatian officials and the two presidents were expected to hold talks at the airport before Clinton goes to Washington later this evening . 157 | Earlier today , Clinton stopped over in Aviano of Italy , Taszar of Hungary and Tuzla of Bosnia to see U.S. troops involved in the NATO 's Bosnian peace-keeping operations and held talks with Hungarian and Bosnian presidents . 158 | JOHANNESBURG , January 13 ( Xinhua ) -- South Africa claimed the first victory of the African Nations ' Soccer Cup competitions which opened here today . 159 | The country 's national squad defeated Cameroon 3-0 in their first-ever participation of Africa 's most important soccer events in years . 160 | BONN , January 13 ( Xinhua ) -- Following are the resutls in women 's Alpine skiing World Cup in Garmisch-Partenkirchen , Germany , on Saturday : 161 | Alexandra Meissnitzer , Austria , 282 points 162 | Katja Seizinger , Germany , 265 163 | Michaela Dorfmeister , Austria , 179 164 | Heidi Zurbriggen , Switzerland , 178 165 | Anita Wachter , Austria , 174 166 | Isolde Kostner , Italy , 111 167 | Elfi Eder , Austria , 440 168 | Pernilla Wiberg , Sweden , 302 169 | TIRANA , January 13 ( Xinhua ) -- Armed robbers looted the branch of Albania 's largest holding company at the Southern port of Vlora Friday and vanished with an amount of cash worth 144,000 U.S. dollars . 170 | SEOUL , January 13 ( Xinhua ) -- China 's nine-duan player Cao Dayuan went down to South Korea 's seven-duan Lee Chang Ho in the Three-country go-chess contest here on Saturday . 171 | So far , China and South Korea both have two players left while Japan has only one after Yoda 's loss on Friday . 172 | The players will move to China 's Shanghai to continue their competition in early February . 173 | He expressed satisfaction with the outcome of the talks held by the Dutch delegation in Damascus and their positive impact on bilateral relations between the Netherlands and Syria . 174 | The Crown Prince and Arafat affirmed that Moslems would not concede their Islamic and historical rights in Jerusalem , and reiterated their rejection of Israel 's attempts to Judaize the Holy City , Deeb said . 175 | On Tuesday , President Suleyman Demirel asked Necmettin Erbakan , leader of the Welfare Party , which came first at the December 24 general elections , to form a new government . 176 | JOHANNESBURG , January 14 ( Xinhua ) -- South Africa 's Deputy President Thabo Mbeki today led a government delegation to welcome returning refugees back to their Shobashobane homes in Kwazulu/Natal Province . 177 | AMMAN , January 14 ( Xinhua ) -- Jordan 's Prime Minister Sharif Zeid Ben Shaker held talks here today with his Dutch counterpart Wim Kok who arrived earlier today for a two-day official visit to Jordan . 178 | OTTAWA , January 14 ( xinhua ) -- Doctors have warned Canadians that they will be hit by influenza which has been traveling from west to east across Canada in this winter . 179 | CAIRO , January 14 ( Xinhua ) -- The American investments in Egypt will be doubled next year from the present 1 billion U.S. dollars , American Ambassador to Egypt Edward Walker said here today prior to U.S. Vice President Al Gore 's visit to Egypt Monday . 180 | KUWAIT CITY , January 14 ( Xinhua ) -- French Defense Minister Charles Millon arrived here today from Djibouti on an official visit to Kuwait . 181 | In a departure statement in Paris to the Kuwaiti News Agency on the eve of his visit to Kuwait , Millon said the visit was `` an occasion to reaffirm France 's concern about the stability in the Gulf and security of Kuwait , and to review defense issues related to armament . 182 | He added that the 1991 Gulf war had opened the way for his country 's reaching of defense agreements with the states of the Arabian peninsula , adding France has signed defense agreements with with Kuwait , Qatar and the United Arab Emirates respectively . 183 | The 130 peacekeepers were held hostage following the outbreak on December 28 of a battle between the fighters of the Krahn wing of the United Liberation Movement for Democracy in Liberia ( ULIMO-J ) and the ECOMOG in Tubmanburg , 60 km north of capital Monrovia . 184 | Boutros-Ghali said in a BBC television interview this morning he hoped that Germany and Japan would finally join the United States , Russia , Britain , France and China as permanent members of the U.N. Security Council . 185 | HONG KONG , January 14 ( Xinhua ) -- Following are news items from the Asia-Pacific Desk of Xinhua in Hong Kong today : hke011401 -- Philippines Continues to Contain Monetary Expansion hka011402 -- HK Police Officers Involved in Offenses on Rise hke011403 -- Myanmar Newspaper Hails Than Shwe 's China Visit hke011404 -- Major News Items in Leading Pakistani Newspapers hke011405 -- Major News Items in Leading Philippine Newspapers hke011406 -- Foreign Tourists Stay Longer in Nepal : Study hke011407 -- Major News Items in Indian Newspapers hke011408 -- Sri Lankan Troops Set For Major Offensive In East hke011409 -- Mahathir to go Live On Internet With Arafat , Ramos hke011410 -- India , Canada Sign Trade Deals Worth 3.39 Billion Dollars hke011411 -- Australia Outlines Trade Outlook To 2000 hke011412 -- India , Iran to Strengthen Bilateral Cooperation hke011414 -- Susi Susanti Wins Chinese Taipei Open Badminton hke011415 -- Pakistan Exports Down in 1st Half of This Fiscal Year hka011416 -- Weather Information for Asian-Pacific Cities hke011417 -- Dong Jiong Wins Chinese Taipei open Badminton Title hke011418 -- Canadian PM Visits Pakistan to Explore Market hke011419 -- Roundup : Manila Tightens Price Monitoring hke011420 -- Myanmar Trade Fair '96 Realizes 85.5 Million Dollars hke011421 -- Fatal Particles in Air threaten HK People 's Health : hka011422 -- Portuguese Presidential Election Not Affect Macao's hke011423 -- Final Results of Chinese Taipei Open Badminton hka011424 -- One third HK Women Suffer Urinary Incontinence : Survey hka011425 -- Karachi Stock Exchange Index Up hke011426 -- Pakistan , Netherlands Sign Economic Agreement hke011427 -- Chinese , Indonesian Win Chinese Taipei Open Badminton hke011428 -- UN Envoy to Afghanistan Insists on Previous Peace Plan hke011429 -- Nepal to Solve Problem of Water Shortage in Capital hke011430 -- Chinese Taipei Golfer wins Omega PGA Championship 186 | South Africa 's Richard Kaplan , who equaled the best score of the week with a final round 65 , finished third on three under par , and America 's Mike Cunning fourth a further stroke behind . 187 | Thailand 's Boonchu Ruangkit was second with 173,177 US dollars , and Indian Jeev Milkha Singh third with 154,403 US dollars . 188 | WASHINGTON , January 14 ( Xinhua ) -- Space shuttle Endeavour crew released a U.S. science satellite Sunday , a day after retrieving a Japanese spacecraft , said U.S. space agency officials in Cape Canaveral , Florida . 189 | WASHINGTON , January 14 ( Xinhua ) -- Space shuttle Endeavor crew released a U.S. science satellite today , a day after retrieving a Japanese spacecraft , said U.S. space agency officials in Cape Canaveral , Florida . 190 | JERUSALEM , January 14 ( Xinhua ) -- Israel will not interfere in the process of the Palestinian elections scheduled on January 20 , but will take security steps on its part on the day . 191 | MOSCOW -- Russia extended the deadline Sunday for Chechen rebels to release some 70 hostages from 10 a.m. ( 700 GMT ) to 1 p.m. ( 1000 GMT ) to avoid a bloodshed to end the five-day crisis in southern Russia , after the gunmen failed to free their captives by the old deadline . 192 | TEHRAN -- Iran 's security could be affected by the wars raging in Central Asia and the Caucasus , said Iranian foreign minister here Sunday . 193 | TEHRAN -- Iran 's Foreign Minister Ali Akbar Velayati said here Sunday that it is only of defensive purpose for Iran to earmark 25 billion rials ( about 14.28 million U.S. dollars ) as a tit-for-tat measure to fight back the U.S. covert-action against Iran . 194 | DAMASCUS -- Syria is expecting that the January 24 peace talks with Israel in the U.S. will secure substantial progress in bringing the two countries closer to peace . 195 | NEW DELHI -- India and Canada have signed business deals worth about 3.39 billion U.S. dollars during Canadian Prime Minister Jean Chretien 's visit that began on Tuesday , the National Herald said Sunday . 196 | HANOI , January 14 ( Xinhua ) -- Vietnam and Laos today signed a number of agreements on strengthening cooperation in economic , cultural , and scientific and technological fields in 1996 and in the 1996-2000 period , the official Vietnam News Agency said . 197 | BONN , January 14 ( Xinhua ) -- Following are the women 's ski racing World Cup standings after Sunday 's slalom in Garmisch- Partenkirchen , Germany : 198 | Elfi Eder , Austria , 520 points 199 | Martina Accola , Switzerland , 291 200 | Pernilla Wiberg , Sweden , 214 201 | -------------------------------------------------------------------------------- /data/emTypeMap.txt: -------------------------------------------------------------------------------- 1 | people.person PERSON 2 | organization.organization ORGANIZATION 3 | location.location LOCATION 4 | -------------------------------------------------------------------------------- /data/rmTypeMap.txt: -------------------------------------------------------------------------------- 1 | location.country.capital location.country.capital 2 | sports.sports_team_location.teams sports.sports_team_location.teams 3 | people.deceased_person.place_of_death people.deceased_person.place_of_death 4 | location.neighborhood.neighborhood_of location.neighborhood.neighborhood_of 5 | sports.sports_team.location sports.sports_team.location 6 | people.person.place_of_birth people.person.place_of_birth 7 | organization.organization.place_founded organization.organization.place_founded 8 | location.location.contains location.location.contains 9 | organization.organization.founders organization.organization.founders 10 | location.country.administrative_divisions location.country.administrative_divisions 11 | business.industry.companies business.industry.companies 12 | business.person.company business.person.company 13 | people.person.religion people.person.religion 14 | people.person.nationality people.person.nationality 15 | people.place_lived.person people.place_lived.person 16 | people.person.children people.person.children 17 | organization.organization.advisors organization.organization.advisors 18 | -------------------------------------------------------------------------------- /data/test.txt: -------------------------------------------------------------------------------- 1 | Stephen A. Schwarzman , the co-founder of the Blackstone Group , which is in the process of going public , made $ 400 million last year . 2 | For Mrs. Clinton , the strategy for reaching black voters at this early stage of the campaign involves strong outreach to black elected officials , business leaders and others , followed by phone calls to reinforce her candidacy from her husband and supporters like Robert L. Johnson , who founded Black Entertainment Television . 3 | DEAL FOR ` NO DEAL ' CREATOR A group of investors led by John de Mol , one of the founders of Endemol , which pioneered reality television series like '' Big Brother '' and game shows like '' Deal or No Deal , '' agreed to buy a controlling stake in the company . 4 | It 's so hypocritical for any network in this culture to go all puritanical on the subject of condom use when their programming is so salacious , '' said Mark Crispin Miller , a media critic who teaches at New York University . '' 5 | -------------------------------------------------------------------------------- /getInputJsonFile.sh: -------------------------------------------------------------------------------- 1 | inTestFile='./data/test.txt' 2 | inTrainFile='./data/documents.txt' 3 | mentionType='both' #'em' or 'rm' or 'both' 4 | emTypeMapFile='./data/emTypeMap.txt' 5 | rmTypeMapFile='./data/rmTypeMap.txt' 6 | outTrainFile='./data/train.json' 7 | outTestFile='./data/test.json' 8 | bcInputFile='./data/bc_input.txt' 9 | bcOutDir='./data/brown-out' 10 | bcOutOrigFile='./data/brown-out/paths' 11 | bcOutFile='./data/brown' 12 | parseTool='stanford' #'nltk' or 'stanford' 13 | testOnly=false 14 | freebaseDir='./freebase' 15 | 16 | if [ "$testOnly" = false ] ; then 17 | echo 'start generating train json file' 18 | python code/generateJson.py $inTrainFile $outTrainFile $parseTool 1 $mentionType $emTypeMapFile $rmTypeMapFile $freebaseDir 19 | 20 | echo 'removing tmp files...' 21 | rm tmp1.json 22 | rm tmp2.json 23 | fi 24 | 25 | echo 'start generating test json file' 26 | python code/generateJson.py $inTestFile $outTestFile $parseTool 0 $mentionType 27 | 28 | echo 'start generating brown cluster input file from train & test json files' 29 | python code/brown-cluster/generateBClusterInput.py $outTrainFile $outTestFile $bcInputFile 30 | 31 | echo 'start generating brown file' 32 | code/brown-cluster/wcluster --text $bcInputFile --c 300 --output_dir $bcOutDir 33 | mv $bcOutOrigFile $bcOutFile 34 | --------------------------------------------------------------------------------