├── CRFs-latin-word-segmenation.ipynb └── README.md /CRFs-latin-word-segmenation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Using Conditional Random Fields and Python for Latin word segmentation\n", 8 | "\n", 9 | "In this project, most texts from the Latin Library are being utilized to train a CRF the segmentation of Latin texts. \n", 10 | "For several centuries, Latin text was written without the use of space characters or any other word delimiters (scriptio continua)." 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "First, perform some imports. The Python CRFSuite can be installed via\n", 18 | "__ pip install python-crfsuite __" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 1, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "from bs4 import BeautifulSoup\n", 28 | "from urllib.request import urlopen, HTTPError\n", 29 | "import pycrfsuite" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 2, 35 | "metadata": { 36 | "collapsed": true 37 | }, 38 | "outputs": [], 39 | "source": [ 40 | "base_url = \"http://www.thelatinlibrary.com/\"" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "## Retrieving our training data\n", 48 | "\n", 49 | "Now, get most links on the Latin Library's homepage -- ignoring some links that are not associated with a particular author." 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 3, 55 | "metadata": { 56 | "collapsed": true 57 | }, 58 | "outputs": [], 59 | "source": [ 60 | "home_content = urlopen(base_url)\n", 61 | "soup = BeautifulSoup(home_content, \"lxml\")\n", 62 | "author_page_links = soup.find_all(\"a\")\n", 63 | "author_pages = [ap[\"href\"] for i, ap in enumerate(author_page_links) if i < 49]\n", 64 | "ap_content = list()\n", 65 | "texts = list()" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 4, 71 | "metadata": { 72 | "collapsed": true 73 | }, 74 | "outputs": [], 75 | "source": [ 76 | "for ap in author_pages:\n", 77 | " ap_content.append(urlopen(base_url + ap))" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "Next, create a list of all links pointing to Latin texts. The Latin Library uses a special format which makes it easy to find the corresponding links: All of these links contain the name of the text author." 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 5, 90 | "metadata": { 91 | "collapsed": true 92 | }, 93 | "outputs": [], 94 | "source": [ 95 | "book_links = list()" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 6, 101 | "metadata": {}, 102 | "outputs": [ 103 | { 104 | "name": "stdout", 105 | "output_type": "stream", 106 | "text": [ 107 | "Liber XIV\n" 108 | ] 109 | } 110 | ], 111 | "source": [ 112 | "for path, content in zip(author_pages, ap_content):\n", 113 | " author_name = path.split(\".\")[0]\n", 114 | " ap_soup = BeautifulSoup(content, \"lxml\")\n", 115 | " book_links += ([link for link in ap_soup.find_all(\"a\", {\"href\": True}) if author_name in link[\"href\"]])\n", 116 | "\n", 117 | "print(book_links[0])" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "Get the text content and write it to a list. We will not need all of the books available, just take the first 200 pages." 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": 7, 130 | "metadata": {}, 131 | "outputs": [ 132 | { 133 | "name": "stdout", 134 | "output_type": "stream", 135 | "text": [ 136 | "Getting content 200 of 200\r" 137 | ] 138 | } 139 | ], 140 | "source": [ 141 | "texts = list()\n", 142 | "num_pages = 200\n", 143 | "\n", 144 | "for i, bl in enumerate(book_links[:num_pages]):\n", 145 | " print(\"Getting content \" + str(i + 1) + \" of \" + str(num_pages), end=\"\\r\", flush=True)\n", 146 | " try:\n", 147 | " content = urlopen(base_url + bl[\"href\"]).read() \n", 148 | " texts.append(content)\n", 149 | " except HTTPError as err:\n", 150 | " print(\"Unable to retrieve \" + bl[\"href\"] + \".\")\n", 151 | " continue" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "The text that we would like to retrieve is written on every book page in its paragraphs __1__ to __-1__. \n", 159 | "Then, split the text at periods to convert it into sentences which we will use for training later on." 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 8, 165 | "metadata": {}, 166 | "outputs": [ 167 | { 168 | "name": "stdout", 169 | "output_type": "stream", 170 | "text": [ 171 | "infamabat autem haec suspicio latinum domesticorum comitem et agilonem tribunum stabuli atque scudilonem scutariorum rectorem qui tunc ut dextris suis gestantes rem publicam colebantur\n" 172 | ] 173 | } 174 | ], 175 | "source": [ 176 | "sentences = list()\n", 177 | "\n", 178 | "for i, text in enumerate(texts):\n", 179 | " print(\"Document \" + str(i + 1) + \" of \" + str(len(texts)), end=\"\\r\", flush=True)\n", 180 | " textSoup = BeautifulSoup(text, \"lxml\")\n", 181 | " paragraphs = textSoup.find_all(\"p\", attrs={\"class\":None})\n", 182 | " prepared = (\"\".join([p.text.strip().lower() for p in paragraphs[1:-1]]))\n", 183 | " for t in prepared.split(\".\"):\n", 184 | " part = \"\".join([c for c in t if c.isalpha() or c.isspace()])\n", 185 | " sentences.append(part.strip())\n", 186 | "\n", 187 | "print(sentences[200])" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": 9, 193 | "metadata": {}, 194 | "outputs": [], 195 | "source": [ 196 | "sentences = [s for s in sentences if len(s) > 5] # remove very short \"sentences\"" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 10, 202 | "metadata": {}, 203 | "outputs": [ 204 | { 205 | "name": "stdout", 206 | "output_type": "stream", 207 | "text": [ 208 | "tentis igitur regis utriusque legatis et negotio tectius diu pensato cum pacem oportere tribui quae iustis condicionibus petebatur eamque ex re tum fore sententiarum via concinens adprobasset advocato in contionem exercitu imperator pro tempore pauca dicturus tribunali adsistens circumdatus potestatum coetu celsarum ad hunc disservit modum nemo quaeso miretur si post exsudatos labores itinerum longos congestosque adfatim commeatus fiducia vestri ductante barbaricos pagos adventans velut mutato repente consilio ad placidiora deverti\n" 209 | ] 210 | } 211 | ], 212 | "source": [ 213 | "print(sentences[200])" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": {}, 219 | "source": [ 220 | "For preparing our training data, every sentence is converted into a char list together with the information wether the char marks the beginning of a new word." 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 11, 226 | "metadata": {}, 227 | "outputs": [ 228 | { 229 | "name": "stdout", 230 | "output_type": "stream", 231 | "text": [ 232 | "[('t', 0), ('e', 0), ('n', 0), ('t', 0), ('i', 0), ('s', 0), ('i', 1), ('g', 0), ('i', 0), ('t', 0), ('u', 0), ('r', 0), ('r', 1), ('e', 0), ('g', 0), ('i', 0), ('s', 0), ('u', 1), ('t', 0), ('r', 0), ('i', 0), ('u', 0), ('s', 0), ('q', 0), ('u', 0), ('e', 0), ('l', 1), ('e', 0), ('g', 0), ('a', 0), ('t', 0), ('i', 0), ('s', 0), ('e', 1), ('t', 0), ('n', 1), ('e', 0), ('g', 0), ('o', 0), ('t', 0), ('i', 0), ('o', 0), ('t', 1), ('e', 0), ('c', 0), ('t', 0), ('i', 0), ('u', 0), ('s', 0), ('d', 1), ('i', 0), ('u', 0), ('p', 1), ('e', 0), ('n', 0), ('s', 0), ('a', 0), ('t', 0), ('o', 0), ('c', 1), ('u', 0), ('m', 0), ('p', 1), ('a', 0), ('c', 0), ('e', 0), ('m', 0), ('o', 1), ('p', 0), ('o', 0), ('r', 0), ('t', 0), ('e', 0), ('r', 0), ('e', 0), ('t', 1), ('r', 0), ('i', 0), ('b', 0), ('u', 0), ('i', 0), ('q', 1), ('u', 0), ('a', 0), ('e', 0), ('i', 1), ('u', 0), ('s', 0), ('t', 0), ('i', 0), ('s', 0), ('c', 1), ('o', 0), ('n', 0), ('d', 0), ('i', 0), ('c', 0), ('i', 0), ('o', 0), ('n', 0), ('i', 0), ('b', 0), ('u', 0), ('s', 0), ('p', 1), ('e', 0), ('t', 0), ('e', 0), ('b', 0), ('a', 0), ('t', 0), ('u', 0), ('r', 0), ('e', 1), ('a', 0), ('m', 0), ('q', 0), ('u', 0), ('e', 0), ('e', 1), ('x', 0), ('r', 1), ('e', 0), ('t', 1), ('u', 0), ('m', 0), ('f', 1), ('o', 0), ('r', 0), ('e', 0), ('s', 1), ('e', 0), ('n', 0), ('t', 0), ('e', 0), ('n', 0), ('t', 0), ('i', 0), ('a', 0), ('r', 0), ('u', 0), ('m', 0), ('v', 1), ('i', 0), ('a', 0), ('c', 1), ('o', 0), ('n', 0), ('c', 0), ('i', 0), ('n', 0), ('e', 0), ('n', 0), ('s', 0), ('a', 1), ('d', 0), ('p', 0), ('r', 0), ('o', 0), ('b', 0), ('a', 0), ('s', 0), ('s', 0), ('e', 0), ('t', 0), ('a', 1), ('d', 0), ('v', 0), ('o', 0), ('c', 0), ('a', 0), ('t', 0), ('o', 0), ('i', 1), ('n', 0), ('c', 1), ('o', 0), ('n', 0), ('t', 0), ('i', 0), ('o', 0), ('n', 0), ('e', 0), ('m', 0), ('e', 1), ('x', 0), ('e', 0), ('r', 0), ('c', 0), ('i', 0), ('t', 0), ('u', 0), ('i', 1), ('m', 0), ('p', 0), ('e', 0), ('r', 0), ('a', 0), ('t', 0), ('o', 0), ('r', 0), ('p', 1), ('r', 0), ('o', 0), ('t', 1), ('e', 0), ('m', 0), ('p', 0), ('o', 0), ('r', 0), ('e', 0), ('p', 1), ('a', 0), ('u', 0), ('c', 0), ('a', 0), ('d', 1), ('i', 0), ('c', 0), ('t', 0), ('u', 0), ('r', 0), ('u', 0), ('s', 0), ('t', 1), ('r', 0), ('i', 0), ('b', 0), ('u', 0), ('n', 0), ('a', 0), ('l', 0), ('i', 0), ('a', 1), ('d', 0), ('s', 0), ('i', 0), ('s', 0), ('t', 0), ('e', 0), ('n', 0), ('s', 0), ('c', 1), ('i', 0), ('r', 0), ('c', 0), ('u', 0), ('m', 0), ('d', 0), ('a', 0), ('t', 0), ('u', 0), ('s', 0), ('p', 1), ('o', 0), ('t', 0), ('e', 0), ('s', 0), ('t', 0), ('a', 0), ('t', 0), ('u', 0), ('m', 0), ('c', 1), ('o', 0), ('e', 0), ('t', 0), ('u', 0), ('c', 1), ('e', 0), ('l', 0), ('s', 0), ('a', 0), ('r', 0), ('u', 0), ('m', 0), ('a', 1), ('d', 0), ('h', 1), ('u', 0), ('n', 0), ('c', 0), ('d', 1), ('i', 0), ('s', 0), ('s', 0), ('e', 0), ('r', 0), ('v', 0), ('i', 0), ('t', 0), ('m', 1), ('o', 0), ('d', 0), ('u', 0), ('m', 0), ('n', 1), ('e', 0), ('m', 0), ('o', 0), ('q', 1), ('u', 0), ('a', 0), ('e', 0), ('s', 0), ('o', 0), ('m', 1), ('i', 0), ('r', 0), ('e', 0), ('t', 0), ('u', 0), ('r', 0), ('s', 1), ('i', 0), ('p', 1), ('o', 0), ('s', 0), ('t', 0), ('e', 1), ('x', 0), ('s', 0), ('u', 0), ('d', 0), ('a', 0), ('t', 0), ('o', 0), ('s', 0), ('l', 1), ('a', 0), ('b', 0), ('o', 0), ('r', 0), ('e', 0), ('s', 0), ('i', 1), ('t', 0), ('i', 0), ('n', 0), ('e', 0), ('r', 0), ('u', 0), ('m', 0), ('l', 1), ('o', 0), ('n', 0), ('g', 0), ('o', 0), ('s', 0), ('c', 1), ('o', 0), ('n', 0), ('g', 0), ('e', 0), ('s', 0), ('t', 0), ('o', 0), ('s', 0), ('q', 0), ('u', 0), ('e', 0), ('a', 1), ('d', 0), ('f', 0), ('a', 0), ('t', 0), ('i', 0), ('m', 0), ('c', 1), ('o', 0), ('m', 0), ('m', 0), ('e', 0), ('a', 0), ('t', 0), ('u', 0), ('s', 0), ('f', 1), ('i', 0), ('d', 0), ('u', 0), ('c', 0), ('i', 0), ('a', 0), ('v', 1), ('e', 0), ('s', 0), ('t', 0), ('r', 0), ('i', 0), ('d', 1), ('u', 0), ('c', 0), ('t', 0), ('a', 0), ('n', 0), ('t', 0), ('e', 0), ('b', 1), ('a', 0), ('r', 0), ('b', 0), ('a', 0), ('r', 0), ('i', 0), ('c', 0), ('o', 0), ('s', 0), ('p', 1), ('a', 0), ('g', 0), ('o', 0), ('s', 0), ('a', 1), ('d', 0), ('v', 0), ('e', 0), ('n', 0), ('t', 0), ('a', 0), ('n', 0), ('s', 0), ('v', 1), ('e', 0), ('l', 0), ('u', 0), ('t', 0), ('m', 1), ('u', 0), ('t', 0), ('a', 0), ('t', 0), ('o', 0), ('r', 1), ('e', 0), ('p', 0), ('e', 0), ('n', 0), ('t', 0), ('e', 0), ('c', 1), ('o', 0), ('n', 0), ('s', 0), ('i', 0), ('l', 0), ('i', 0), ('o', 0), ('a', 1), ('d', 0), ('p', 1), ('l', 0), ('a', 0), ('c', 0), ('i', 0), ('d', 0), ('i', 0), ('o', 0), ('r', 0), ('a', 0), ('d', 1), ('e', 0), ('v', 0), ('e', 0), ('r', 0), ('t', 0), ('i', 0)]\n" 233 | ] 234 | } 235 | ], 236 | "source": [ 237 | "prepared_sentences = list()\n", 238 | "\n", 239 | "for sentence in sentences:\n", 240 | " lengths = [len(w) for w in sentence.split(\" \")]\n", 241 | " positions = []\n", 242 | "\n", 243 | " next_pos = 0\n", 244 | " for length in lengths:\n", 245 | " next_pos = next_pos + length\n", 246 | " positions.append(next_pos)\n", 247 | " concatenated = sentence.replace(\" \", \"\")\n", 248 | "\n", 249 | " chars = [c for c in concatenated]\n", 250 | " labels = [0 if not i in positions else 1 for i, c in enumerate(concatenated)]\n", 251 | "\n", 252 | " prepared_sentences.append(list(zip(chars, labels)))\n", 253 | " \n", 254 | " \n", 255 | "print([d for d in prepared_sentences[200]])" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "## Transforming the characters to feature vectors.\n", 263 | "\n", 264 | "Finally, we can create some simple n-gram features. Obviously, you could think of much more sophisticated features and possibly improve our model's performance." 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 12, 270 | "metadata": {}, 271 | "outputs": [], 272 | "source": [ 273 | "def create_char_features(sentence, i):\n", 274 | " features = [\n", 275 | " 'bias',\n", 276 | " 'char=' + sentence[i][0] \n", 277 | " ]\n", 278 | " \n", 279 | " if i >= 1:\n", 280 | " features.extend([\n", 281 | " 'char-1=' + sentence[i-1][0],\n", 282 | " 'char-1:0=' + sentence[i-1][0] + sentence[i][0],\n", 283 | " ])\n", 284 | " else:\n", 285 | " features.append(\"BOS\")\n", 286 | " \n", 287 | " if i >= 2:\n", 288 | " features.extend([\n", 289 | " 'char-2=' + sentence[i-2][0],\n", 290 | " 'char-2:0=' + sentence[i-2][0] + sentence[i-1][0] + sentence[i][0],\n", 291 | " 'char-2:-1=' + sentence[i-2][0] + sentence[i-1][0],\n", 292 | " ])\n", 293 | " \n", 294 | " if i >= 3:\n", 295 | " features.extend([\n", 296 | " 'char-3:0=' + sentence[i-3][0] + sentence[i-2][0] + sentence[i-1][0] + sentence[i][0],\n", 297 | " 'char-3:-1=' + sentence[i-3][0] + sentence[i-2][0] + sentence[i-1][0],\n", 298 | " ])\n", 299 | " \n", 300 | " \n", 301 | " if i + 1 < len(sentence):\n", 302 | " features.extend([\n", 303 | " 'char+1=' + sentence[i+1][0],\n", 304 | " 'char:+1=' + sentence[i][0] + sentence[i+1][0],\n", 305 | " ])\n", 306 | " else:\n", 307 | " features.append(\"EOS\")\n", 308 | " \n", 309 | " if i + 2 < len(sentence):\n", 310 | " features.extend([\n", 311 | " 'char+2=' + sentence[i+2][0],\n", 312 | " 'char:+2=' + sentence[i][0] + sentence[i+1][0] + sentence[i+2][0],\n", 313 | " 'char+1:+2=' + sentence[i+1][0] + sentence[i+2][0],\n", 314 | " ])\n", 315 | " \n", 316 | " if i + 3 < len(sentence):\n", 317 | " features.extend([\n", 318 | " 'char:+3=' + sentence[i][0] + sentence[i+1][0] + sentence[i+2][0]+ sentence[i+3][0],\n", 319 | " 'char+1:+3=' + sentence[i+1][0] + sentence[i+2][0] + sentence[i+3][0],\n", 320 | " ])\n", 321 | " \n", 322 | " return features\n", 323 | "\n", 324 | "\n", 325 | "\n", 326 | "def create_sentence_features(prepared_sentence):\n", 327 | " return [create_char_features(prepared_sentence, i) for i in range(len(prepared_sentence))]\n", 328 | "\n", 329 | "def create_sentence_labels(prepared_sentence):\n", 330 | " return [str(part[1]) for part in prepared_sentence]" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": 13, 336 | "metadata": {}, 337 | "outputs": [], 338 | "source": [ 339 | "X = [create_sentence_features(ps) for ps in prepared_sentences[:-10000]]\n", 340 | "y = [create_sentence_labels(ps) for ps in prepared_sentences[:-10000]]\n", 341 | "\n", 342 | "X_test = [create_sentence_features(ps) for ps in prepared_sentences[-10000:]]\n", 343 | "y_test = [create_sentence_labels(ps) for ps in prepared_sentences[-10000:]]" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [ 350 | "## Training a CRF\n", 351 | "Now, we use Python-CRFSuite for training a CRF." 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": 14, 357 | "metadata": {}, 358 | "outputs": [], 359 | "source": [ 360 | "trainer = pycrfsuite.Trainer(verbose=False)\n", 361 | "\n", 362 | "for xseq, yseq in zip(X, y):\n", 363 | " trainer.append(xseq, yseq)" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 15, 369 | "metadata": { 370 | "collapsed": true 371 | }, 372 | "outputs": [], 373 | "source": [ 374 | "trainer.set_params({\n", 375 | " 'c1': 1.0, \n", 376 | " 'c2': 1e-3,\n", 377 | " 'max_iterations': 60,\n", 378 | " 'feature.possible_transitions': True\n", 379 | "})" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": 16, 385 | "metadata": { 386 | "collapsed": true 387 | }, 388 | "outputs": [], 389 | "source": [ 390 | "trainer.train('latin-text-segmentation.crfsuite')" 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": 17, 396 | "metadata": {}, 397 | "outputs": [ 398 | { 399 | "data": { 400 | "text/plain": [ 401 | "" 402 | ] 403 | }, 404 | "execution_count": 17, 405 | "metadata": {}, 406 | "output_type": "execute_result" 407 | } 408 | ], 409 | "source": [ 410 | "tagger = pycrfsuite.Tagger()\n", 411 | "tagger.open('latin-text-segmentation.crfsuite')" 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": 18, 417 | "metadata": {}, 418 | "outputs": [], 419 | "source": [ 420 | "def segment_sentence(sentence):\n", 421 | " sent = sentence.replace(\" \", \"\")\n", 422 | " prediction = tagger.tag(create_sentence_features(sent))\n", 423 | " complete = \"\"\n", 424 | " for i, p in enumerate(prediction):\n", 425 | " if p == \"1\":\n", 426 | " complete += \" \" + sent[i]\n", 427 | " else:\n", 428 | " complete += sent[i]\n", 429 | " return complete" 430 | ] 431 | }, 432 | { 433 | "cell_type": "code", 434 | "execution_count": 19, 435 | "metadata": {}, 436 | "outputs": [ 437 | { 438 | "name": "stdout", 439 | "output_type": "stream", 440 | "text": [ 441 | "dominus ad templum properat\n", 442 | "porta patet\n" 443 | ] 444 | } 445 | ], 446 | "source": [ 447 | "print(segment_sentence(\"dominusadtemplumproperat\"))\n", 448 | "print(segment_sentence(\"portapatet\"))" 449 | ] 450 | }, 451 | { 452 | "cell_type": "markdown", 453 | "metadata": {}, 454 | "source": [ 455 | "Finally, let's find out how well our model performs." 456 | ] 457 | }, 458 | { 459 | "cell_type": "code", 460 | "execution_count": 20, 461 | "metadata": {}, 462 | "outputs": [], 463 | "source": [ 464 | "tp = 0\n", 465 | "fp = 0\n", 466 | "fn = 0\n", 467 | "n_correct = 0\n", 468 | "n_incorrect = 0\n", 469 | "\n", 470 | "for s in prepared_sentences[-10000:]:\n", 471 | " prediction = tagger.tag(create_sentence_features(s))\n", 472 | " correct = create_sentence_labels(s)\n", 473 | " zipped = list(zip(prediction, correct))\n", 474 | " tp += len([_ for l, c in zipped if l == c and l == \"1\"])\n", 475 | " fp += len([_ for l, c in zipped if l == \"1\" and c == \"0\"])\n", 476 | " fn += len([_ for l, c in zipped if l == \"0\" and c == \"1\"])\n", 477 | " n_incorrect += len([_ for l, c in zipped if l != c])\n", 478 | " n_correct += len([_ for l, c in zipped if l == c])" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": 21, 484 | "metadata": {}, 485 | "outputs": [ 486 | { 487 | "name": "stdout", 488 | "output_type": "stream", 489 | "text": [ 490 | "Precision:\t0.9314353553833713\n", 491 | "Recall:\t\t0.9171904737701122\n", 492 | "Accuracy:\t0.9766116709677363\n" 493 | ] 494 | } 495 | ], 496 | "source": [ 497 | "print(\"Precision:\\t\" + str(tp/(tp+fp)))\n", 498 | "print(\"Recall:\\t\\t\" + str(tp/(tp+fn)))\n", 499 | "print(\"Accuracy:\\t\" + str(n_correct/(n_correct+n_incorrect)))" 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": null, 505 | "metadata": { 506 | "collapsed": true 507 | }, 508 | "outputs": [], 509 | "source": [] 510 | } 511 | ], 512 | "metadata": { 513 | "kernelspec": { 514 | "display_name": "Python 3", 515 | "language": "python", 516 | "name": "python3" 517 | }, 518 | "language_info": { 519 | "codemirror_mode": { 520 | "name": "ipython", 521 | "version": 3 522 | }, 523 | "file_extension": ".py", 524 | "mimetype": "text/x-python", 525 | "name": "python", 526 | "nbconvert_exporter": "python", 527 | "pygments_lexer": "ipython3", 528 | "version": "3.6.1" 529 | } 530 | }, 531 | "nbformat": 4, 532 | "nbformat_minor": 2 533 | } 534 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # NLP-with-Python 2 | 3 | Using Conditional Random Fields for segmenting Lating words written in *scriptio continua*. 4 | --------------------------------------------------------------------------------